WeSearch

Show HN: A Transformer Is All You Need

·3 min read · 0 reactions · 0 comments · 1 view

The unanswered question in mechanistic interpretability of pretrained transformers is plain: for any prompt and any decoder-only transformer, which weights at which layers along which residual-stream dimensions produced the decision the model emitted? Activation probing reports a per-depth accuracy curve. Sparse dictionaries decompose activations into monosemantic features. Logit and tuned lenses trace the trajectory of a prediction through the residual stream. None of these names the weight that did the work. The weights are the artifact training produced, the substrate every activation must traverse, the only object in the system that persists across forward passes; interpretability that treats them as a fixed backdrop describes what the model is doing right now, never why this particular model with these particular weights had to do it.   We close that gap with one primitive — the alignment of a residual-stream activation with the top singular directions of a weight matrix, scaled by the singular values — and a small cross-layer transformer (the hybrid weight–activation probe) that consumes the joint (activation, alignment) sequence and predicts the host model's next-token decision. As a byproduct of training, the probe exposes per-layer importance (the depth at which the host's decision crystallized) and per-layer alignment importance over the three weight families Q/K/V, attention output, and MLP up/gate (which family at each layer carried the decisional signal, and via

Original article
Zenodo
Read full at Zenodo →
Opening excerpt (first ~120 words) tap to expand

Published June 26, 2026 | Version v1 Preprint Open A Transformer Is All You Need Authors/Creators Lamoureux, Marc Description The unanswered question in mechanistic interpretability of pretrained transformers is plain: for any prompt and any decoder-only transformer, which weights at which layers along which residual-stream dimensions produced the decision the model emitted? Activation probing reports a per-depth accuracy curve. Sparse dictionaries decompose activations into monosemantic features. Logit and tuned lenses trace the trajectory of a prediction through the residual stream. None of these names the weight that did the work.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Zenodo.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Zenodo