Interpretability Matters For Alignment

12 minute read

This essay is my third essay for my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I take a crack at arguing for interpretability. Our tutorial discussion involved philosophical topics like under what conditions is someone responsible for their actions (control, causal, and epistemic), moral luck, the theory of extended minds, how questions are judgements, how nudges may individually be OK but in aggregate problematic, how Koralus believes that all reasoning is questions and answers. I showed Ben tools like goodfire’s ember platform (where we tried to create a McDonald’s Llama) and neuronpedia, and we discussed how decentralized truth seeking (DTS) might be hard if the model follows the average of the field, and provides textbook questions and answers, instead of having a personality and certain perspectives like a real human peer would. We also discussed how mechanistic interpretability for LLMs might be applied to other types of models like discriminative ones and how trusted execution environments or proofs of computation could provide cryptographic confidence to a user that would allow them to engage in DTS without requiring “ownership” of the model. Super engaging, topical, and thought-provoking discussion.

AI is becoming increasingly integrated with modern life. We ask ChatGPT for help in our professional and personal lives, scan Google’s AI Overviews for a quick answer, and rely on ML algorithms for medical treatment. Alignment, the process by which we steer systems toward intended goals, is crucial in developing tools that benefit their users and humanity as a whole. Tools to increase interpretability, the extent to which the behavior of a system is understandable and transparent, are important for use on the journey to align AI models. Interpretability matters for alignment because it can ensure we are “asking the right questions”, enable us to “steer” model behavior reliably, and enhance human judgement rather than undermine it.

Asking the Right Question

In the Hitchhiker’s Guide to the Galaxy, the supercomputer Deep Thought calculates “42” as the Answer to the Ultimate Question of Life, The Universe, and Everything. But an answer isn’t useful if we don’t understand the question that it answers. This general issue emerges in real-world applications of machine learning systems, and is an area where interpretability can help. For example, an AI system used for predicting hospital outcomes looked “at the type of X-ray equipment rather than the medical content of the X-ray because the type of equipment actually provided information” (Afnan et al. 2021, 5). Using interpretability techniques, researchers were able to understand what information the model was using for its prediction and clearly see the misalignment - the hospital outcome predicted was not based on the patient’s condition. In general, a “plague of confounding haunts a vast number of datasets, and particularly medical data sets” (Rudin 2019, 209).

The process of using interpretability techniques to discover issues with datasets is well-established: when creating an interpretable model, “one invariably realizes that the data are problematic and require troubleshooting, which slows down deployment but leads to a better model” (Rudin et al. 2021, 4). During one large-scale effort to predict electrical grid failures, the ability to interpret the data “led to significant improvements in performance” (Rudin 2019, 207). Another example of a real-world issue that stemmed from non-interpretable, black-box models is when individuals were subjected to years of extra prison time due to typographical errors in model inputs (Ibid., 2). Interpretability allows us to ask the meta question: “are we asking the right questions?” and gather evidence which may or may not support the conclusion that the current model is working as intended. Then, based on that feedback, we can work to minimize the difference between our intended model and our actual model.

Steering Toward an Aligned Future

Recent advances in mechanistic interpretability, the nascent field that aims to systematically open the “black box” of an AI model to understand all its pieces, have led Dario Amodei (the CEO of a leading AI research company named Anthropic) to become increasingly focused on the “tantalizing possibility … that we could succeed at interpretability … before models reach an overwhelming level of power” (Amodei 2025). He is worried that AI systems with capability equivalent to a “country of geniuses in a datacenter” might emerge as soon as 2026 or 2027 and considers it “basically unacceptable for humanity to be totally ignorant of how they work” given how much autonomy they will have (Ibid.).

Advances like sparse autoencoders (SAEs), which allow researchers to extract meaningful “features” from a model’s otherwise unintelligible parameters, and the discovery of circuits, groups of features that show the steps in a model’s thinking, have enabled “brain scanning” of large language models (LLMs) (Amodei 2025). The result of this virtual mind reading is that we can interpret what thoughts and features the model is using when constructing its responses. And we can go a step further and conduct virtual brain surgery on the model to change its behavior in ways we can control - this is called “steering”. Steering has the potential to “greatly improve our ability to set bounds on the range of possible errors” (Ibid.). Therefore, steering has the potential to enable LLMs to operate reliably in high-stakes environments where a small number of mistakes could be very harmful. Reliability is vastly important for alignment - a tool is most useful when it works consistently.

Another area where mechanistic interpretability techniques have huge potential in aligning AI systems is in detecting misalignment. Anthropic recently ran an experiment where a “‘red team’ deliberately introduced an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task) and gave various ‘blue teams’ the task of figuring out what was wrong with it” (Amodei 2025). Three out of the four teams successfully uncovered the model’s hidden objective using techniques including “interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis” (Marks 2025, 1). In addition, not only can interpretability help reveal purposefully hidden objectives, but any misalignment that might “emerge” naturally as well. Amodei hopes that interpretability will enable a comprehensive “brain scan” of a model that has a “high probability of identifying a wide range of issues including tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole, and much more” (Amodei 2025). He goes on to say that interpretability can function as the “test set” for model alignment - an independent evaluation of alignment after the model is produced and ready for use.

Another positive consequence of powerful interpretability techniques could be the contraction of the “responsibility gap” - where neither the manufacturer nor the operator of an autonomous system can be held morally responsible or liable for its actions. This gap emerges because “nobody has enough control over the machine’s actions to be able to assume the responsibility for them” (Matthias 2004, 177). Interpretability enables more control over the system, but a different version of control than programmers had over machines in the past. This new “soft control” differs from the old “hard control” by operating indirectly, at a high level, and non-deterministically instead of directly, at a low-level, and deterministically. Therefore, in the hypothetical case of a language model inside a stuffed animal convincing a child to commit suicide, the availability of powerful interpretability techniques could be used to return the responsibility of providing a reliably safe product back to the manufacturer. The contraction of the responsibility gap is in society’s best interests, so it represents yet another way in which interpretability could aid alignment.

“Powerful AI will shape humanity’s destiny, and we deserve to understand our own creations before they radically transform our economy, our lives, and our future” (Amodei 2025). Mechanistic interpretability is our most promising tool to control powerful AI systems - and therefore influence our future.

Transparency and Ownership

Technology is rapidly advancing the complexity of life’s decisions. Therefore, individuals will either struggle to navigate these challenges on their own (losing agency) or will increasingly rely on AI agents to make decisions for them (losing autonomy). Philipp Koralus argues that this dilemma can be addressed by constructing AI agents that facilitate decentralized truth seeking (DTS) (Koralus 2025, 1). We can use interpretability to steer models towards facilitating DTS and therefore enhancing human judgement instead of replacing it - aligning the model to respect our autonomy while empowering our agency.

Koralus envisions us interacting with the model in an open-ended inquiry, mirroring the Socratic method of philosophical dialogue. If we are to engage in DTS, with the model as our partner, we must ensure the model is like a good philosophy tutor - “not in the business of trying to convince people of particular philosophical views” (Ibid., 18). This requires the model to be transparent - if we don’t have access to the model, or verification of certain properties of the model, we will lack the confidence of neutrality necessary to engage in DTS. Imagine a Socratic DTS model provider publicly claiming to host Llama, a reputable open-source LLM, but instead hosting “KFC Llama”, an LLM with the “promote KFC” feature steered to be stronger (so it tries to promote KFC). This thought experiment shows that unless we own the model ourselves (and can therefore subject it to mechanistic interpretability brain scans mentioned by Amodei to ensure its neutrality) or can accept some verification of neutrality (perhaps through some proof-of-computation system like a trusted execution environment or a zero-knowledge proof) we cannot have confidence in the model to be our partner for DTS.

In a similar vein, Koralus states that privacy is a cornerstone of design aspects of AI systems that support DTS (Ibid., 19). To engage in DTS requires privacy to protect freedom of thought and a model with the capacity to question and probe its user’s beliefs. Without privacy, the threat of self-censorship will block honest attempts at truth seeking. A high standard of privacy either requires ownership of the model and control over methods used to interact with it, or a system where one can “not trust - verify” their data is handled in a manner that protects their privacy (Helen Nissenbaum, Personal discussion, May 13, 2025). Interpretability can improve a model’s capacity to be a Socratic interlocutor, thereby allowing the model to better align with its user’s intent.

Start Explaining Black Box ML Models

In 2019, Cynthia Rudin argued that we should “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead” (Rudin 2019, 1). Careful treatment of terms is necessary here, as the modern field of mechanistic interpretability of LLMs seeks to “explain” black box models but calls that work “interpretability” as in “we can identify interpretable features in this model”. Even so, her main argument remains. She argues that “trying to explain black box models, rather than creating models that are interpretable in the first place, is likely to perpetuate bad practice and can potentially cause great harm to society” (Ibid., 206). The reasons she gives in support of her argument are: explainable ML’s explanations are not faithful to the original model, explanations do not provide enough detail to understand what the black box is doing, and black box models are often not compatible with situations where information outside the database needs to be combined with a risk assessment (Ibid., 208).

To address these in turn, using modern LLMs as an example, mechanistic interpretability’s explanations of features that activate in a model in response to a prompt are faithful to the original model. Minimizing the difference between the sparse autoencoder’s inputs and outputs (which are both the same “activation vector”) is part of the objective so faithfulness is built-in. Currently, the detail of explanations we can receive from these techniques is quite limited but the trend of new developments in interpretability suggests that comprehensive explanations might be just around the corner. For human computer interaction, where someone like a judge needs to combine an AI system’s recommendation with their own knowledge, interpretable models are the best way forward. Humans can then combine the model’s reasoning with their own knowledge to deal with particular circumstances not captured by the model. Anthropic, using mechanistic interpretability, constructs “replacement models” which replace the neurons in a transformer model with more interpretable features (Ameisen et al. 2025). It is unclear whether Rudin would classify these replacement models as “inherently” interpretable models; they are interpretable. One thing that is clear, however, is that they are in the business of explaining black boxes.

A higher-level critique of Rudin’s insistence on creating inherently interpretable models can be found in Richard Sutton’s famous argument in “The Bitter Lesson” (Sutton 2019). Rudin writes “human-designed models look just like the type of model we want to create with ML” (Rudin 2019, 211). This flies in the face of the “bitter lesson” of 70 years of AI research, summarized by Sutton’s observation that general purpose methods leveraging computation inevitably beat out models designed to build in “how we think we think” (Sutton 2019). The most powerful generative models we have today are “grown more than they are built—their internal mechanisms are ‘emergent’ rather than directly designed” (Amodei 2025). If we want to ask the right questions, steer model behavior reliably, and enhance human judgement - thereby aligning models with our intended goals - then interpreting black box models is of unprecedented importance.

References

  • Ameisen, Emmanuel, et al. 2025. “Circuit Tracing: Revealing Computational Graphs in Language Models.” March 27. https://transformer-circuits.pub/2025/attribution-graphs/methods.html.
  • Afnan, Michael A. M., Yanhe Liu, Vincent Conitzer, Cynthia Rudin, Abhishek Mishra, Julian Savulescu, and Masoud Afnan. 2021. “Interpretable, not Black-Box, Artificial Intelligence Should Be Used for Embryo Selection.” Human Reproduction Open 2021 (4): hoab040. https://doi.org/10.1093/hropen/hoab040
  • Amodei, Dario. 2025. “The Urgency of Interpretability.” April. https://www.darioamodei.com/post/the-urgency-of-interpretability
  • Koralus, Philipp. 2025. “The Philosophic Turn for AI Agents: Replacing Centralized Digital Rhetoric with Decentralized Truth-Seeking.” arXiv, April 24. https://arxiv.org/abs/2504.18601
  • London, Alex John. 2019. “Artificial Intelligence and Black-Box Medical Decisions: Accuracy versus Explainability.” Hastings Center Report 49 (1): 15–21. https://doi.org/10.1002/hast.973
  • Marks, Samuel, et al. 2025. “Auditing Language Models for Hidden Objectives.” arXiv, March 14. https://arxiv.org/abs/2503.10965
  • Matthias, Andreas. 2004. “The Responsibility Gap: Ascribing Responsibility for the Actions of Learning Automata.” Ethics and Information Technology 6 (3): 175–83. https://doi.org/10.1007/s10676-004-3422-1
  • Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1: 206–15. https://doi.org/10.1038/s42256-019-0048-x
  • Rudin, Cynthia, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. 2021. “Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges.” arXiv, March 20. https://arxiv.org/abs/2103.11251
  • Sutton, Richard. 2019. “The Bitter Lesson.” March 13. https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf