Blogs

Mechanistic Interpretability Glossary

Interpretability

An informal, evolving glossary of mechanistic interpretability terms, methods, and metrics.

Jun 19, 2026

The Additivity of the Residual Stream

Interpretability

Why each attention head and MLP neuron writes its own additive term into the transformer residual stream, following Anthropic’s Mathematical Framework for Transformer Circuits.

Jun 13, 2026

12 min read

A retrospective on the last 100 days - Doing Hard Things

Personal

What I mean by hard things, what I don’t, why the neuroscience behind willpower makes a compelling case for doing them anyway, and how that played out over 100 days.

Jun 5, 2026

11 min read

A Walkthrough of Attention

Deep-Learning

A GPT-2–style walkthrough of causal self-attention—from a small worked example through single-head and multi-head formulations and their computational graphs.

May 22, 2026

12 min read

Sparse Autoencoders for Monosemanticity

Interpretability

An exploration of Sparse Autoencoders as a tool for decomposing polysemantic neural network representations into interpretable, monosemantic features.

Mar 30, 2026

41 min read

Categories

Mechanistic Interpretability Glossary

The Additivity of the Residual Stream

A retrospective on the last 100 days - Doing Hard Things

A Walkthrough of Attention

Sparse Autoencoders for Monosemanticity