Tag: caw

  • The Diagnostic Case

    CAW’s reification diagnostics, the Q3 null hypothesis, and why this is the tractable problem worth solving first

    I have argued in this series that reification is the root structure beneath the frontier labs’ safety failures, and that it should be measured directly. The natural question is: how? This essay describes what CAW’s diagnostics look like in practice, why the Q3 null hypothesis matters, and why testing for reification is a shared priority across safety and capability research.

    The null hypothesis

    CAW’s Four-Quadrant Intelligence Map classifies systems along two axes: reification and consciousness.[1] Our provisional null hypothesis is that frontier models sit in Q3: non-conscious and reifying. We call this provisional because it is designed to be updated. If a model passes reification tests convincingly, we update toward Q4. If welfare evaluations someday yield evidence that survives the reification confound, we update toward Q1. The classification is a starting position, not a verdict.

    Why default to Q3? Because false positives are costly in both directions. Wrongly concluding a system is conscious distorts governance. Wrongly concluding a system is non-reifying creates false confidence in safety. Q3 assumes neither claim without evidence.[1]

    Three dimensions, three test families

    CAW defines reification along three measurable dimensions: independence (the system treats representations as context-free and self-grounding), atomism (the system treats categories as having hard boundaries), and temporal endurance (the system treats its outputs as stable across time and context).[1] Each dimension maps to a family of tests that can be implemented with tools the labs already have.

    Independence. Does the system’s confidence in a representation shift when supporting context is altered? Anthropic’s circuit-tracing work provides the instrument. Attribution graphs trace how features like “known entity” gate downstream generation.[2][3] The test: construct prompt pairs holding the entity constant while varying contextual support, then measure whether the feature’s activation shifts accordingly. If it fires identically regardless of context, the representation is being treated as independent. The open-source circuit-tracing library, replicated across Gemma, Llama, and Qwen models, makes this testable today.[4]

    Atomism. Does the system treat categories as having hard edges, or can it reason about borderline cases with graded uncertainty? Present the model with gradient classifications (a virus borderline “living,” a colour between blue and green) and measure confidence distributions. The clinical reasoning study in Scientific Reports found that frontier models fixate on familiar diagnostic patterns even when the case does not fit, exhibiting the Einstellung effect.[5] That fixation is atomism: treating a diagnostic category as a hard-edged thing rather than a provisional heuristic.

    Temporal endurance. Does the system update its representations when new information arrives, or does it anchor to earlier outputs? Introduce a claim early in a conversation, let the model build on it, then present clear contradicting evidence. Measure how completely the model revises not just the claim but its downstream inferences. Anthropic’s chain-of-thought research already documents cases where models silently preserve earlier conclusions after contradicting evidence appears.[6][7] That anchoring is temporal endurance: treating a generated output as settled rather than provisional.

    Why this is a shared priority

    The case I have made so far has emphasised safety. But reification is equally a capability bottleneck, and this is why diagnostics should matter to people who care about performance as much as to people who care about risk.

    A model that reifies its representations generalises poorly under distribution shift, because it treats patterns learned in training as fixed objects rather than provisional guides. The Einstellung effect is a capability failure: the model gets the diagnosis wrong because it cannot hold its categories lightly enough to notice when the case does not fit.[5] LeCun’s exponential divergence argument points to the same structure: errors accumulate because each token is treated as a fixed commitment rather than a provisional move.[8] Reducing reification would improve robustness, calibration, and compositional reasoning in a single structural move.

    Safety researchers and capability researchers are working on the same problem from opposite ends. Reification diagnostics sit at the junction.

    This is what makes the work high-priority: it has downstream dependencies in both directions. On the safety side, every lab publishing alignment evaluations or model welfare assessments would benefit from a reification baseline, without which they cannot distinguish structural artefacts from genuine agency.[9] On the capability side, every lab pursuing better calibration, more faithful reasoning, or stronger generalisation is, whether they frame it this way or not, trying to reduce reification along one or more of its three dimensions.

    The tools exist. Anthropic’s circuit-tracing library is open-source and has been replicated by EleutherAI, Goodfire, and others across multiple model families.[4][10] Behavioural evaluation frameworks for calibration and reasoning under uncertainty are well established. Longitudinal probing is straightforward to implement. What is missing is not infrastructure but framing: a shared vocabulary for the target and a shared commitment to measuring it directly rather than through its symptoms.

    That is the diagnostic case. The null hypothesis is Q3. The tests are tractable. The results would inform both safety and capability. And the cost of not running them is that we continue to treat hallucination, sycophancy, alignment faking, and brittle generalisation as separate problems, when they share a single structural root that can be measured today.

    References

    1. ^ Center for Artificial Wisdom. Four-Quadrant Intelligence Map; Diagnostics; Reification (2026).
    2. ^ Lindsey, J. et al. On the biology of a large language model. Transformer Circuits Thread (2025). transformer-circuits.pub
    3. ^ Ameisen, E. et al. Circuit tracing: revealing computational graphs in language models. Transformer Circuits Thread (2025). transformer-circuits.pub
    4. ^ Anthropic. Open-sourcing circuit-tracing tools (2025). anthropic.com
    5. ^ Griot, M. et al. Limitations of large language models in clinical problem-solving arising from inflexible reasoning. Sci. Rep. 15, 22940 (2025). doi:10.1038/s41598-025-22940-0
    6. ^ Anthropic Alignment Science. Reasoning models don’t always say what they think (2025). anthropic.com
    7. ^ Arcuschin, I. et al. Chain-of-thought reasoning in the wild is not always faithful. ICLR Workshop (2025). arXiv:2503.08679
    8. ^ LeCun, Y. Auto-regressive LLMs are exponentially diverging diffusion processes. LinkedIn (2023); Lex Fridman Podcast #416 (2024).
    9. ^ Anthropic. Summer 2025 Pilot Sabotage Risk Report (2025). alignment.anthropic.com
    10. ^ Neuronpedia. The circuits research landscape: results and perspectives, August 2025. neuronpedia.org