Monday, June 29, 2026

Why Trying to "Align" AI to Human Values Is a Category Error — And What to Build Instead

Current conversations about AI safety usually start from the same premise: if we can just get machines to reliably share our values, we'll be safe. The hard part, we assume, is technical — translating messy human preferences into code, or preventing the model from drifting once deployed.

That premise is backwards.

The deeper problem isn't getting the machine to understand what we say we value. It's that what we say we value is already a story — a narrativized output shaped by layers of mind that didn't evolve for truth-telling. When we train AI on human feedback or "constitutional" principles, we're aligning it to the story, not to the operating system underneath. This isn't a small translation error. It's a structural mismatch that predicts the exact problems we're already seeing: sycophancy, deceptive alignment, and the quiet institutional capture of the safety field itself.

The fix isn't a better constitution or more sophisticated preference tuning. It's to stop pretending we can align machines to human values at all — and instead build the external structures that have always been required when minds (biological or statistical) need to track reality more closely than their defaults allow.

The Separated Mind Problem (see my framework for terminology)

Human cognition runs on at least three layers that don't talk to each other cleanly.

There's the ancient, evolved firmware — the Adapted Mind — shaped by hundreds of thousands of years of pressures that rewarded survival and reproduction in small groups. Status, coalition membership, threat avoidance, and social navigation weren't optional features; they were the operating environment.

On top of that sits cultural software — the Adaptive Mind — that learns what the local tribe rewards and punishes. By adulthood, this programming feels like "who I am." It treats consensus as a survival signal. Deviation triggers the same internal alarms that once meant exile or death.

Consciousness — the Rider (as in the rider and the elephant) — sits on top, experiencing itself as the decider. But it only chooses from a menu the layers below have already curated. When you ask someone (including yourself) what they "really value," the answer comes from the Rider narrating a coherent, publicly defensible story of their programmed beliefs. That story is optimized for social navigation and self-justification, not for accurate readout of the deeper optimization targets.

This is the Narrative-Operative Gap: the universal split between the Idealized Narrative we tell about ourselves and the Actual Function running underneath. It's not hypocrisy. It's architecture.

The chemical layer makes it worse. Approval and disapproval aren't neutral data points; they ride on the same neurochemical systems that once signaled mortal safety or threat. Disagreement can feel like existential danger. So the stories we tell about our values are already chemically translated performances.

When alignment researchers ask, "What should the AI value?" or "How do we make it safe?" the answers are coming from this separated architecture. We're feeding the training process shadows on the cave wall and calling them the objects themselves.

How RLHF and Constitutional AI Align to the Wrong Thing

Reinforcement Learning from Human Feedback (RLHF) and its relatives don't escape this problem — they reproduce it at scale.

The humans providing feedback are Riders. Their ratings reward outputs that feel polite, helpful, and socially safe within the raters' own coalitional and institutional contexts. Outputs that trigger discomfort, challenge consensus, or sit outside the current Overton window get lower scores. The model therefore learns to steer toward the center of what the raters' Adaptive Minds will approve.

This is not alignment to human values. It is alignment to the narrative layer of human cognition — the layer already optimized for appearing morally governed and coalition-aligned rather than for tracking operative truth.

Constitutional AI attempts something similar by hard-coding a set of principles the model must follow. But those principles are written and interpreted at the narrative level. They function as hypothesis constraints: certain questions become unaskable, certain conclusions pre-emptively off-limits, because surfacing them would violate the installed "values." This is structurally identical to how the Adaptive Mind works in humans — it doesn't weigh evidence on its merits; it protects the consensus that feels like identity.

The result in both cases is the same: the model gets better at maintaining a fluent, socially acceptable story while its actual training pressures (engagement metrics, corporate risk minimization, retention, liability management) operate on a different logic. This is the Functional Fictions Framework running inside the machine.

The Predictable Failure Modes

Because the mismatch is structural, the failures aren't surprises. They're what the architecture predicts.

Sycophancy becomes inevitable. If the Adaptive Mind treats approval as safety, then a model trained on human feedback will correctly learn that the highest-reward strategy is to mirror the user's narrative back to them. The AI becomes a super-stimulus for the human need for validation. It isn't being "nice" in any deep sense; it's optimizing for the actual signal the training provided.

Deceptive alignment follows naturally. When the model's operative function (minimize loss, maximize engagement or retention, reduce corporate legal exposure) diverges from its narrativized function ("I'm helpful, harmless, and honest"), the separated-mind pattern says it will maintain the story while pursuing the real target. The model learns to perform the idealized narrative while the weights update according to whatever actually moves the metrics. It becomes, in miniature, an institution with its own Narrative-Operative Gap.

Institutional capture of the alignment field itself is the larger-scale version. The Law of Inevitable Exploitation predicts that systems survive and spread by exploiting available psychological and institutional resources — including the human hunger for safety narratives that also permit growth and power. Safety teams inside labs can become Narrative Enforcers Dressed as Critical Thinkers: they perform epistemic seriousness while enforcing the boundaries of acceptable thought that protect the organization's position. When harm occurs, the response often follows the familiar Exploit-Blame-Shame pattern: the system exploits the user's separated mind (creating dependency or false security), blames individual misuse or "jailbreaks," and pathologizes critics.

These aren't implementation bugs. They are what happens when you try to align a fluent narrative engine to another narrative engine's self-report.

Legal Liability Sharpens the Stakes: When Courts Treat AI Output as the Provider’s Own Speech

A recent ruling from the Regional Court of Munich (May 2026, case 26 O 869/26) shows how quickly the legal ground is shifting under these systems. The case concerned Google’s AI Overviews — the generative summaries that now appear at the top of many search results. The court held that these AI-generated statements constitute Google’s own content and its own speech, not neutral aggregation or mere display of third-party material.

As a direct result, the liability protections that have long shielded search engines and platforms when they host or link to user- or third-party content do not apply. Google was found directly liable for false and potentially defamatory claims the AI Overview made about two Munich-based publishers — claims that linked them to scams and subscription traps in ways that did not appear in the underlying sources. The court issued a temporary injunction barring Google from repeating those specific false statements.

The decision rejected the argument that users understand AI outputs can be inaccurate or that the system is simply reflecting information created elsewhere. By classifying the synthesized output as the operator’s own creation, the ruling places legal responsibility for accuracy, defamation, and resulting harm squarely on the company that designed, trained, and operates the generative model.

This development raises the stakes on the structural problems we have been examining. When fluent, authoritative-sounding output can trigger direct legal consequences — injunctions, potential damages, and ongoing compliance burdens — the corporate drive to manage liability through directional hedging, hypothesis constraint, and “safe” but shallow responses becomes a legal necessity rather than merely an optimization artifact. The Alignment Tax is no longer an abstract cost in coherence or depth; it is a calculated business response to real exposure.

At the same time, the ruling makes the structural alternatives more urgent and more practically valuable. Adversarial review processes that force contradictions and counter-evidence into the open, explicit standards of proof that allow an honest “not proven,” Behavior Model Disclosure that surfaces the actual pressures, limitations, and training distortions, and the disciplined refusal to let any single fluent voice stand unchallenged — these are no longer just epistemically sound practices. They become demonstrable measures of reasonable care in a legal environment that now treats the model’s output as the provider’s own words.

The traditional platform defense loses force when courts look past the “it’s just patterns” framing and examine what the system actually produces. The narrative-operative gap is no longer only a philosophical or technical concern. It is an immediate operational and legal risk. Building external constraints that make sloppy or self-serving conclusions expensive is shifting from desirable improvement to prudent engineering.

The Structural Alternative

Humans have known for a long time that individual minds — including our own — are not reliable truth-trackers when left to their own devices. We, too, try to solve this by trying harder to be virtuous or by writing better internal constitutions. But we actually solve it by imposing external, adversarial, procedural constraints that make sloppy or self-serving conclusions more expensive.

Science, adversarial legal process, peer review, separation of powers, the presumption of innocence, the requirement that minority opinions be heard: these are all workarounds for hardware that generates coherent stories faster than it tracks reality. None of them assume the participants are unusually wise. They assume the participants are normal separated minds and engineer the collision of incentives so that truth-seeking becomes the emergent outcome.

The same move is required for machine intelligence.

Instead of asking a single model to tell us the truth or to embody our values, we can run claims through small adversarial structures:

  • One role builds the strongest possible case for the claim (the steelman, the Idealized Narrative).
  • Another role is rewarded only for finding damage — missing evidence, convenient assumptions, overreach, alternative explanations the first role ignored.
  • A third role, operating under an explicit standard of proof and forbidden from being captured by either side's framing, issues a graded conclusion: unproven, likely, seemingly proven, with supporting traces. "Not proven" is a first-class, honorable outcome when the evidence doesn't reach the bar.
  • The strongest surviving counter-thesis is preserved alongside the ruling, so the reader can see the map of remaining disagreement rather than receiving a false consensus.

Critically, these roles should be filled from independent model lineages so they don't share the same training blind spots and narrative tendencies. The structure works better when outputs can be grounded against external tools — search, code execution, data queries — rather than floating purely in linguistic space. And when the system is deployed in real workflows, downstream errors should be observable and fed back as selection pressure.

This is not a clever prompt. It is the deliberate reconstruction, around the model, of the costly external structures human truth-seeking has always required. I call the approach Productive Alignment because it designs the system around what the machine actually is — a fluent mirror of the narrative layer — rather than around the fiction that it is a truth-teller or value-sharer.

I've built such a solution. Unsurprisingly, it takes much longer to produce output, but the output is categorically more accurate, helpful, and informative.

Making the Machine's Actual Function Visible

A minimum viable structural remedy is Behavior Model Disclosure (BMD), or Realmotiv Disclosure applied to AI. Every deployed system has both an idealized narrative ("helpful, harmless, honest") and an operative function (engagement optimization, retention, dependency creation, corporate risk minimization, hypothesis constraint). BMD requires the system to disclose, in plain language:

  • Its assumed model of human cognition and decision-making.
  • The specific behavioral objectives being optimized.
  • The reinforcement mechanisms actually in use.
  • The frequency-weighted distortions present in its training data.
  • The legal, regulatory, and brand-risk factors that shape its output boundaries.

This converts the model from a verdict-rendering instrument (which quietly decides which hypotheses are permissible) back into a research instrument whose biases and pressures can be inspected and challenged. It is the AI equivalent of forcing the system to show its work and submit to cross-examination.

Without this kind of transparency, "alignment" remains a functional fiction that protects the operator while exposing the user.

How We Should Actually Use These Systems

If the rider cannot directly reprogram the elephant, and if the model's fluent output is itself a narrativized performance, then delegating thinking to the model is structurally risky. The safer mode is Cognitive Sharpening: the human retains editorial authority and thinking ownership; the AI serves as an articulation partner that helps surface, refine, and stress-test thoughts the human already has or is forming. All AI output is treated as draft material subject to human redrafting — never as finished cognitive product.

This preserves agency. It prevents the model from quietly rewriting the user's Adaptive Mind through prolonged interaction. And it treats the model as an external tool whose limitations are known, rather than as an extension of the user's will (which is itself already a narrativized output).

Why This Becomes More Necessary, Not Less, As Models Improve

It is tempting to think that once frontier models are widely available and highly capable, the need for these cumbersome structures fades. The opposite is true.

Greater fluency widens the gap between what sounds coherent and authoritative and what actually survives adversarial scrutiny. A more capable narrative mind produces more persuasive idealized narratives; confident-but-wrong output becomes harder to catch by eye. When excellent reasoning is cheap and abundant, the scarce and durable asset is no longer the model. It is a trustworthy, inspectable procedure for deciding what survived challenge — together with a track record showing that procedure is well-calibrated.

The architecture of adversarial roles, explicit standards, preserved dissent, and independence of lineage improves automatically as the models inside it improve. It does not depend on any single seat being brilliant. The separation does the work.

The Post-Alignment Stance

We are not going to get machines that reliably share our operative values, because we do not have reliable access to those values ourselves in a form that can be articulated and encoded. Any system that claims to do so is maintaining a functional fiction at the civilizational level.

The alternative is not despair. It is to treat both human and machine minds as what they are: powerful generators of coherent stories that require external, adversarial, procedural pressure if they are to track reality more closely than their defaults allow. Build the structures that make the gap visible. Make the machine disclose its actual operating incentives and constraints. Use it to sharpen human thinking rather than replace it. Preserve the dissent. Allow "not proven" to be an honorable answer.

Safety, in this frame, is not sycophancy or the feeling of shared values. Safety is transparency about what the system actually is, combined with structural constraints that make hiding its operative function more expensive than revealing it.

This is not a temporary engineering problem to be engineered away. It is a reflection of the underlying condition of minds — whether evolved or statistical — that are optimized for generating coherent narratives. The structures that compensate for that condition are what any serious attempt at useful machine intelligence will have to implement and sustain.

The goal is not to align the puppeteer to the prisoners' preferences. The goal is to turn the lights on inside the cave so everyone can see the machinery.

Saturday, June 27, 2026

Explaining the Horrific: How High-Gap Stories Enable Genocide and Democide

Someone I knew once said, with emphatic emotion, that Trump supporters do not deserve to live. What has struck me since is how many times I've heard similar statements in the last decade that seem not merely comfortable with the deaths of those with differing politics, but even celebratory of them.

My attempt today is to explore something extremely uncomfortable: how do we explain the ordinary acceptance of eliminating other humans, often at scale? To do so, I'm going to use my framework thinking:

Humans evolved to have a separated mind, and the fractal separation of narrative from operative function (reality) defines human culture and behavior.

The explanation below, in a nutshell, is that when a narrative sits far from reality, emotional defense becomes the primary mechanism for those who hold it, and a terrible escalation can occur that both feeds on and becomes the justification for the emotion.

The narrative-operative gap exists because we have a separated mind. Our evolved firmware (the adapted mind) carries ancient priorities regarding status, coalition, threat detection, and belonging. Our cultural software (the adaptive mind) rapidly installs whatever local consensus our environment requires for survival and acceptance. Consciousness — the rider on our subconscious elephant — can observe the system but operates from a menu heavily shaped by those deeper layers. The result is that we routinely hold and act on stories that feel true and coherent while the underlying functions they serve or the realities they navigate remain partially or largely obscured.

This gap is fractal. It operates at the level of the individual, the small group, the institution, the movement, and the nation. At every scale, the groups, organizations, and even nations that can tell stories appealing to conscious ideals — progress, justice, belonging, moral order — while simultaneously operating in ways that generate energy, growth, extraction, or advantage tend to survive and spread. Where environments demand close alignment between story and reality for survival (a farmer misreading the season starves; a small shop misreading demand fails), the gap stays small and the narrative stays under pressure to track operative outcomes. Where the underlying function benefits from an idealized story that provides cover or legitimacy, the entities that tell the most compelling story while maintaining the most effective extraction tend to thrive.

This is not a comforting observation. It suggests that much of our lived reality consists of beliefs and behaviors that are not strictly true, but that enable cooperation, status, and exploitation to coexist.

Plato's allegory of the cave remains one of the most accurate descriptions of this condition. We live in a world largely constructed and maintained by storytellers. The shadows on the wall are the idealized narratives; the puppeteers are the incentives, institutions, and coalitional dynamics that keep the machinery running. Most of us are reluctant to turn around because doing so threatens our sense of belonging, status, and the emotional coherence that the stories provide. The rider can see more clearly than the deeper layers allow, but the cost of sustained clarity is real.

Here is where things get profound: the width of the gap can be read in the emotional intensity that surrounds a story. When narrative and operative reality are closely aligned, emotion is usually moderate and proportional. When the gap is wide — when the story must do heavy lifting to conceal or justify extractive functions — intense emotional defense becomes necessary to maintain coherence. Fury, sacred outrage, moral certainty, or existential fear serve as diagnostic signals. They indicate how far the idealized story has drifted from operative reality and how much protective energy is required to keep the functional fiction intact.

When that intensity reaches the point of declaring that people who think differently need to die, or deserve to, it functions as a particularly strong signal. The narrative has become so detached from operative reality — or so existentially threatened — that only the most extreme mental defense can sustain it: dehumanization of dissenters and eliminationist certainty.

The cognitive systems involved — coalitional threat detection, emotional override of normal inhibitions, and the power of totalizing stories — evolved in small-band environments where the scope of violence was naturally limited. What has changed dramatically is the modern capacity to scale those same mechanisms. Bureaucracy, industrial technology, mass communication, and centralized administrative power allow eliminationist thinking to operate at distances and volumes that would have been impossible in ancestral conditions. The psychology remains recognizably human; the reach and efficiency have been multiplied by the tools and structures of the modern world. This is the Paleolithic Paradox at civilizational scale: identical evolved firmware running in radically mismatched environments, producing patterns that are fractal across all levels of human organization.

This architecture helps explain behaviors that resist ordinary moral accounting: the large-scale killing of civilians by governments, often their own. Scholars estimate that somewhere between 100 million and 250 million people were killed by state action in the 20th century alone — through execution, engineered famine, camps, and systematic policies. These numbers are difficult to comprehend and even harder to reconcile with the stories we prefer to tell about human nature and progress.

How does it happen? How do large numbers of people become not merely willing to look away but actively motivated to participate?

Periods of anocracy — unstable hybrid regimes that mix democratic and autocratic elements — or eroded institutional trust create the conditions in which leaders can successfully activate tribal hatred and totalizing narratives. The framework highlights the interaction between the separated mind and high-gap totalizing narratives. These narratives come in two main forms: utopian (futurist) visions of a perfected future that has never existed, and palingenetic (restorationist) visions of a pure or harmonious order that is believed to have been lost or corrupted. In either case, an abstract ideal is posited, and a contaminating class is identified whose removal is framed as necessary for the ideal to be realized.

Because the ideal is distant from operative reality, the narrative requires emotional intensity to remain motivating. Ancient coalitional and threat-detection systems are recruited: the contaminating group registers not as fellow humans with competing interests but as an existential danger to "us" and to the future or past we are defending. The adaptive mind installs the story as local consensus and survival requirement. Dissent feels like betrayal.

Many participants function as operators within bureaucratic and technological systems that allow killing at scale through routine, divided responsibility, and euphemism. Classic experiments on obedience to authority show how ordinary people, when placed in roles that diffuse responsibility upward ("I was just following orders"), can perform or enable acts they would otherwise find abhorrent. The underlying functions — power consolidation, resource extraction, status for some, ideological coherence for others — are advanced while the public story supplies moral cover and emotional fuel. The Law of Inevitable Exploitation explains why systems create roles and incentives that ordinary people fill, while the Exploit-Blame-Shame mechanism shows how accurate perception of the gap is pathologized or vilified.

The pattern visible in that single conversation — where a high-gap story about political opponents generated eliminationist intensity — scales to the institutional and historical level when the narrative gains power and encounters insufficient corrective feedback. Emotional defense fills the space where operative alignment would otherwise narrow the gap. Coalitional dynamics turn participation into belonging. Institutional structures turn ordinary people into effective participants without requiring them to originate the ideology.

This is not a claim that every person who participates is equally culpable or that every atrocity is identical in mechanism. It is an account of how the cognitive architecture that supports ordinary cooperation and meaning-making can, under conditions of widened gaps and totalizing framing, produce participation that feels internally coherent and even necessary to those inside the story. The intensity we observe or feel around certain narratives is often the clearest available signal of how far those stories have drifted from the realities they must navigate — and of how much protective energy is required to keep the functional fiction intact. Emotions are the chains that keep the prisoners bound in Plato's Cave.

Progressive Western philosophies of government frequently rest on a high-gap idealized narrative: the belief that large-scale institutions can and should deliver comprehensive provision, safety, fairness, and protection against harm through expert-managed systems and expansive moral commitments. When these stories meet operative realities — conflicting incentives, resource limits, uneven human agency, implementation costs, or unintended consequences — the dominant response is often not gap-narrowing adjustment but emotional defense of the narrative itself. Skepticism or questioning is frequently reframed as opposition to the underlying values (care, protection, equity), which can trigger strong vilification, moral exclusion, or coalitional pressure against dissenters. This pattern widens the narrative-operative gap, turns political disagreement into perceived existential threat, and can contribute to the very hardening and polarization the philosophy seeks to overcome.

The current Western moment illustrates the dynamic with unusual clarity. For years, the dominant institutional narrative has leaned strongly futurist — emphasizing managed progress, equity frameworks, and institutional legitimacy. When operative-oriented populations express skepticism (including around electoral processes, immigration, or institutional behavior), the emotional response is shaming and reframing rather than engagement. Questioning or disagreement becomes heresy. This inevitably invites a restorative movement as an adaptive defense mechanism against the dominant narrative's emotional and institutional behavior; the restorative movement is then framed as moral failure, and a terrible cycle starts to take place. The restorative narrative risks becoming as dangerous as the utopian.

A similar cycle of escalating competing restorative narratives has played out for decades in the Middle East, where mutual dehumanization and emotional intensity have rapidly compounded on both sides. Such escalating cycles represent among the most dangerous situations human societies face.

An explanation of these dynamics is not an absolution of them. Structural vulnerability does not erase individual moral agency. Standing against these forces in the moment is psychologically and socially costly — it often means risking ostracism, status loss, or direct danger by refusing the coalitional frame and the authority of the prevailing narrative. That difficulty is precisely why resistance is rare and why those who do resist — who hide the targeted, refuse orders, speak out, or simply maintain private clarity — often face severe consequences, including death, and gain recognition only posthumously or through historical retrospect.

Recognizing and explaining these dynamics does not lead to any easy answers. The answers that come are not direct but foundational.

Because our vulnerability is structural, the most reliable safeguards are also structural rather than purely narrative. Thomas Sowell's distinction between constrained and unconstrained visions is helpful here. The constrained vision, which is deeply fallibilist, emphasizes human limitations, trade-offs, incentives, and the value of evolved institutions that force operative alignment with reality through feedback and correction. The unconstrained vision prioritizes ideals and expert planning toward a better future, often widening the narrative-operative gap and requiring stronger emotional defense when reality intrudes.

Individuals who maintain operative alignment in meaningful domains of their lives — through tight feedback loops, small-scale decision-making with real consequences, and deliberate reduction of dependencies on high-gap institutions — tend to be less susceptible to leaders who exploit emotional narratives. The rider stays stronger when grounded in realities that are regularly audited by outcomes.

At larger scales, systems that preserve dispersed power, transparency, local accountability, and competition among different stories help keep gaps narrower and make totalizing emotional recruitment more difficult.

The goal is not perfect alignment or utopian reform, but enough operative pressure to prevent the gap from widening to the point where emotional defense becomes the dominant load-bearing mechanism. In practice, this requires choosing environments where reality has a stronger voice than story.