Sparse Autoencoders Reveal Cortical Brain-LLM Semantic Mapping
A recent preprint explores the connection between large language models and human brain semantics using sparse autoencoders. The study demonstrates that these autoencoders can extract interpretable features from models like GPT-2 and Llama-3, achieving significant alignment with neural encoding performance. Findings suggest that this approach could enhance our understanding of cognitive neuroscience and model interpretability.
- ▪The preprint presents a mechanistic interpretability approach linking large language model representations to human cortical semantic organization.
- ▪Sparse autoencoders were used to decompose GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer.
- ▪The authors report that semantic features alone recover 94% of peak neural encoding performance, outperforming variance-matched baselines.
Opening excerpt (first ~120 words) tap to expand
Models & Researchsparse autoencodersbrain llm alignmentcomputational neurolinguisticsgpt 2Sparse Autoencoders Reveal Cortical Brain-LLM Semantic Mapping2 sources|May 25, 20267.0Relevance ScorePhoto: arxiv.org · rights & takedownsQuick SummaryHideA preprint submitted to arXiv (arXiv:2605.23035) by Dongxin Guo and colleagues presents a mechanistic interpretability approach connecting large language model representations to human cortical semantic organization. According to the arXiv preprint and the CoNLL openreview entry, the authors use sparse autoencoders (SAEs) to decompose GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Let's Data Science.