Show HN: A Transformer Is All You Need
The unanswered question in mechanistic interpretability of pretrained transformers is plain: for any prompt and any decoder-only transformer, which weights at which layers along which residual-stream dimensions produced the decision the model emitted? Activation probing reports a per-depth accuracy curve. Sparse dictionaries decompose activations into monosemantic features. Logit and tuned lenses trace the trajectory of a prediction through the residual stream. None of these names the weight that did the work. The weights are the artifact training produced, the substrate every activation must traverse, the only object in the system that persists across forward passes; interpretability that treats them as a fixed backdrop describes what the model is doing right now, never why this particular model with these particular weights had to do it. We close that gap with one primitive — the alignment of a residual-stream activation with the top singular directions of a weight matrix, scaled by the singular values — and a small cross-layer transformer (the hybrid weight–activation probe) that consumes the joint (activation, alignment) sequence and predicts the host model's next-token decision. As a byproduct of training, the probe exposes per-layer importance (the depth at which the host's decision crystallized) and per-layer alignment importance over the three weight families Q/K/V, attention output, and MLP up/gate (which family at each layer carried the decisional signal, and via
Opening excerpt (first ~120 words) tap to expand
Published June 26, 2026 | Version v1 Preprint Open A Transformer Is All You Need Authors/Creators Lamoureux, Marc Description The unanswered question in mechanistic interpretability of pretrained transformers is plain: for any prompt and any decoder-only transformer, which weights at which layers along which residual-stream dimensions produced the decision the model emitted? Activation probing reports a per-depth accuracy curve. Sparse dictionaries decompose activations into monosemantic features. Logit and tuned lenses trace the trajectory of a prediction through the residual stream. None of these names the weight that did the work.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Zenodo.