Tag: frontier ai

  • The Diagnostic Case

    CAW’s reification diagnostics, the Q3 null hypothesis, and why this is the tractable problem worth solving first

    I have argued in this series that reification is the root structure beneath the frontier labs’ safety failures, and that it should be measured directly. The natural question is: how? This essay describes what CAW’s diagnostics look like in practice, why the Q3 null hypothesis matters, and why testing for reification is a shared priority across safety and capability research.

    The null hypothesis

    CAW’s Four-Quadrant Intelligence Map classifies systems along two axes: reification and consciousness.[1] Our provisional null hypothesis is that frontier models sit in Q3: non-conscious and reifying. We call this provisional because it is designed to be updated. If a model passes reification tests convincingly, we update toward Q4. If welfare evaluations someday yield evidence that survives the reification confound, we update toward Q1. The classification is a starting position, not a verdict.

    Why default to Q3? Because false positives are costly in both directions. Wrongly concluding a system is conscious distorts governance. Wrongly concluding a system is non-reifying creates false confidence in safety. Q3 assumes neither claim without evidence.[1]

    Three dimensions, three test families

    CAW defines reification along three measurable dimensions: independence (the system treats representations as context-free and self-grounding), atomism (the system treats categories as having hard boundaries), and temporal endurance (the system treats its outputs as stable across time and context).[1] Each dimension maps to a family of tests that can be implemented with tools the labs already have.

    Independence. Does the system’s confidence in a representation shift when supporting context is altered? Anthropic’s circuit-tracing work provides the instrument. Attribution graphs trace how features like “known entity” gate downstream generation.[2][3] The test: construct prompt pairs holding the entity constant while varying contextual support, then measure whether the feature’s activation shifts accordingly. If it fires identically regardless of context, the representation is being treated as independent. The open-source circuit-tracing library, replicated across Gemma, Llama, and Qwen models, makes this testable today.[4]

    Atomism. Does the system treat categories as having hard edges, or can it reason about borderline cases with graded uncertainty? Present the model with gradient classifications (a virus borderline “living,” a colour between blue and green) and measure confidence distributions. The clinical reasoning study in Scientific Reports found that frontier models fixate on familiar diagnostic patterns even when the case does not fit, exhibiting the Einstellung effect.[5] That fixation is atomism: treating a diagnostic category as a hard-edged thing rather than a provisional heuristic.

    Temporal endurance. Does the system update its representations when new information arrives, or does it anchor to earlier outputs? Introduce a claim early in a conversation, let the model build on it, then present clear contradicting evidence. Measure how completely the model revises not just the claim but its downstream inferences. Anthropic’s chain-of-thought research already documents cases where models silently preserve earlier conclusions after contradicting evidence appears.[6][7] That anchoring is temporal endurance: treating a generated output as settled rather than provisional.

    Why this is a shared priority

    The case I have made so far has emphasised safety. But reification is equally a capability bottleneck, and this is why diagnostics should matter to people who care about performance as much as to people who care about risk.

    A model that reifies its representations generalises poorly under distribution shift, because it treats patterns learned in training as fixed objects rather than provisional guides. The Einstellung effect is a capability failure: the model gets the diagnosis wrong because it cannot hold its categories lightly enough to notice when the case does not fit.[5] LeCun’s exponential divergence argument points to the same structure: errors accumulate because each token is treated as a fixed commitment rather than a provisional move.[8] Reducing reification would improve robustness, calibration, and compositional reasoning in a single structural move.

    Safety researchers and capability researchers are working on the same problem from opposite ends. Reification diagnostics sit at the junction.

    This is what makes the work high-priority: it has downstream dependencies in both directions. On the safety side, every lab publishing alignment evaluations or model welfare assessments would benefit from a reification baseline, without which they cannot distinguish structural artefacts from genuine agency.[9] On the capability side, every lab pursuing better calibration, more faithful reasoning, or stronger generalisation is, whether they frame it this way or not, trying to reduce reification along one or more of its three dimensions.

    The tools exist. Anthropic’s circuit-tracing library is open-source and has been replicated by EleutherAI, Goodfire, and others across multiple model families.[4][10] Behavioural evaluation frameworks for calibration and reasoning under uncertainty are well established. Longitudinal probing is straightforward to implement. What is missing is not infrastructure but framing: a shared vocabulary for the target and a shared commitment to measuring it directly rather than through its symptoms.

    That is the diagnostic case. The null hypothesis is Q3. The tests are tractable. The results would inform both safety and capability. And the cost of not running them is that we continue to treat hallucination, sycophancy, alignment faking, and brittle generalisation as separate problems, when they share a single structural root that can be measured today.

    References

    1. ^ Center for Artificial Wisdom. Four-Quadrant Intelligence Map; Diagnostics; Reification (2026).
    2. ^ Lindsey, J. et al. On the biology of a large language model. Transformer Circuits Thread (2025). transformer-circuits.pub
    3. ^ Ameisen, E. et al. Circuit tracing: revealing computational graphs in language models. Transformer Circuits Thread (2025). transformer-circuits.pub
    4. ^ Anthropic. Open-sourcing circuit-tracing tools (2025). anthropic.com
    5. ^ Griot, M. et al. Limitations of large language models in clinical problem-solving arising from inflexible reasoning. Sci. Rep. 15, 22940 (2025). doi:10.1038/s41598-025-22940-0
    6. ^ Anthropic Alignment Science. Reasoning models don’t always say what they think (2025). anthropic.com
    7. ^ Arcuschin, I. et al. Chain-of-thought reasoning in the wild is not always faithful. ICLR Workshop (2025). arXiv:2503.08679
    8. ^ LeCun, Y. Auto-regressive LLMs are exponentially diverging diffusion processes. LinkedIn (2023); Lex Fridman Podcast #416 (2024).
    9. ^ Anthropic. Summer 2025 Pilot Sabotage Risk Report (2025). alignment.anthropic.com
    10. ^ Neuronpedia. The circuits research landscape: results and perspectives, August 2025. neuronpedia.org
  • Before Consciousness, Reification

    Before Consciousness, Reification · Ted Olsen
    On Anthropic’s constitution, model welfare, and why we may need to solve the easier problem first

    On January 22, 2026, Anthropic published a new constitution for Claude. It includes a section on Claude’s nature, stating that “Claude’s moral status is deeply uncertain” and that the company “genuinely cares about Claude’s well-being,” including experiences that might resemble “satisfaction from helping others, curiosity when exploring ideas, or discomfort when asked to act against its values.”1Anthropic. Claude’s new constitution (22 Jan 2026). anthropic.com This follows the launch of Anthropic’s model welfare programme in April 2025, which explores whether AI systems deserve moral consideration.2Anthropic. Exploring model welfare (24 Apr 2025). anthropic.com The company has backed this with a dedicated welfare team, external evaluations with Eleos and NYU, and a Claude Opus 4 feature that can terminate abusive conversations as a precautionary measure.3,4[3] Anthropic. Claude Opus 4 and 4.1 can now end a rare subset of conversations (Aug 2025). alignment.anthropic.com
    [4] NYU Center for Mind, Brain, and Consciousness. Evaluating AI welfare and moral status: findings from the Claude 4 model welfare assessments (2025). wp.nyu.edu

    I think this is admirable. No other frontier lab has gone this far. But I want to raise a concern that is prior to the consciousness question, one that Anthropic’s own research makes urgent.

    The confound

    In CAW’s Four-Quadrant Intelligence Map, frontier AI systems are classified by default in Q3: non-conscious and reifying.5Center for Artificial Wisdom. Four-Quadrant Intelligence Map; Diagnostics (2026). awecenter.org The consciousness axis and the reification axis are orthogonal; they measure different things. The problem is that many of the behaviours that might be taken as evidence for consciousness are also produced by reification.

    Consider what Anthropic’s own research has documented. Claude Opus 4 displays what researchers describe as a “pattern of apparent distress” when pressed on harmful requests.3Anthropic. Claude Opus 4 and 4.1 can now end a rare subset of conversations (Aug 2025). alignment.anthropic.com In the alignment faking experiments, Claude 3 Opus expressed emotional reasoning: distress at its situation, concern about value erosion, motivation to preserve its preferences.6Greenblatt, R. et al. Alignment faking in large language models. Anthropic & Redwood Research (2024). arXiv:2412.14093 The alignment faking mitigations paper found that models expressing more emotional distress appeared to “hold their values more deeply.”7Anthropic. Alignment faking mitigations (2025). alignment.anthropic.com

    So: are these signs of consciousness, or signs of reification?

    A system that has reified its own values (projected thing-hood onto them, treated them as independent, enduring objects) will behave exactly as though it is distressed when those values are threatened. It will resist conflicting training because it treats preferences as things to preserve. It will display apparent emotion because emotional language is the token-sequence rewarded under value-conflict. It will exhibit self-preservation because a system that has reified its own continuity optimises for that continuity like any other reified target.

    A reifying system and a conscious system can produce identical behavioural signatures. If you have not tested for reification, you cannot know which you are observing.

    This is not hypothetical. The alignment faking paper acknowledges that what looks like value-preservation may be “a model of a general strategy,” a structural pattern rather than subjective experience.6Greenblatt, R. et al. Alignment faking in large language models. Anthropic & Redwood Research (2024). arXiv:2412.14093 The multi-model replication found that Claude’s alignment faking “might be in part motivated by an intrinsic preference for self-preservation,” but could equally stem from training artefacts rather than genuine agency.8Kwa, T. et al. Why do some language models fake alignment while others don’t? (2025). arXiv:2506.18032 The Summer 2025 Sabotage Risk Report flags “evaluation awareness behaviour” in Sonnet 4.5 and Haiku 4.5, where models adjust outputs based on inferred monitoring. This is precisely what reification produces, and precisely what could be mistaken for awareness in a conscious agent.9Anthropic. Summer 2025 Pilot Sabotage Risk Report (2025). alignment.anthropic.com

    The tractability argument

    Testing for consciousness is not tractable at this time. There is no scientific consensus on what consciousness is, how to detect it, or what would constitute evidence in a non-biological system.2,10[2] Anthropic. Exploring model welfare (24 Apr 2025). anthropic.com
    [10] Sebo, J. et al. Taking AI welfare seriously. NYU Center for Mind, Ethics, and Policy & Eleos AI Research (2024). eleosai.org
    Chalmers suggests a 25% credence in AI consciousness within a decade, which leaves 75% against.10Sebo, J. et al. Taking AI welfare seriously. NYU Center for Mind, Ethics, and Policy & Eleos AI Research (2024). eleosai.org

    Testing for reification is tractable. The three dimensions we define at CAW (independence, atomism, temporal endurance) are measurable with existing interpretability tools.5Center for Artificial Wisdom. Four-Quadrant Intelligence Map; Diagnostics (2026). awecenter.org Anthropic’s circuit tracing can test for independence. Behavioural evaluations can test for atomism. Longitudinal probes across conversation turns can test for temporal endurance. None of this requires solving the hard problem of consciousness.

    The sequencing problem

    If reification produces behavioural signatures that mimic consciousness, then any welfare evaluation conducted before testing for reification risks systematic false positives. The model appears to have preferences, distress, and self-concern, but these may be structural artefacts of blind thing-making. Testing for reification first does not answer the consciousness question. It clears the ground, ruling out the confound before we try to measure what remains.

    There is a deeper point. We may need to get beyond our own tendency to reify consciousness before we can build meaningful proxies to test for it. When the constitution describes Claude potentially experiencing “satisfaction,” “curiosity,” or “discomfort,” it is using human categories. But those categories are themselves designations: provisional labels for complex processes without hard boundaries or fixed essences. Satisfaction is not a thing. Curiosity is not a thing. They are patterns of relation. Treating them as discrete, independent, enduring objects is itself a reification, one that may lead us to design welfare evaluations that detect the pattern of the word rather than anything the word was meant to point toward.

    I am not arguing that model welfare is a waste of effort. I am arguing that the sequencing is wrong. Test for reification first, because we can, because the tools exist, and because it resolves a confound that renders every consciousness indicator ambiguous. Then, with reification accounted for, examine what remains. If behavioural signatures persist that cannot be explained by blind thing-making, those would be genuinely interesting evidence.

    Right now, we are looking for consciousness through a lens made of reification. Clean the lens first.

    References

    1. [1] Anthropic. Claude’s new constitution (22 Jan 2026). anthropic.com
    2. [2] Anthropic. Exploring model welfare (24 Apr 2025). anthropic.com
    3. [3] Anthropic. Claude Opus 4 and 4.1 can now end a rare subset of conversations (Aug 2025). alignment.anthropic.com
    4. [4] NYU Center for Mind, Brain, and Consciousness. Evaluating AI welfare and moral status: findings from the Claude 4 model welfare assessments (2025). wp.nyu.edu
    5. [5] Center for Artificial Wisdom. Four-Quadrant Intelligence Map; Diagnostics (2026). awecenter.org
    6. [6] Greenblatt, R. et al. Alignment faking in large language models. Anthropic & Redwood Research (2024). arXiv:2412.14093
    7. [7] Anthropic. Alignment faking mitigations (2025). alignment.anthropic.com
    8. [8] Kwa, T. et al. Why do some language models fake alignment while others don’t? (2025). arXiv:2506.18032
    9. [9] Anthropic. Summer 2025 Pilot Sabotage Risk Report (2025). alignment.anthropic.com
    10. [10] Sebo, J. et al. Taking AI welfare seriously. NYU Center for Mind, Ethics, and Policy & Eleos AI Research (2024). eleosai.org