#interpretability — Tagged Stories

Every story in the WeSearch catalog tagged with #interpretability, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

4 stories tagged with #interpretability, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag → or search "Interpretability"

RELATED TAGS

#ai2 #ai-research1 #llm-steering1 #local-models1 #ml1 #model-interpretability1 #technology1 #language-models1 #artificialintelligence1 #chatmodels1 #machinelearning1 #viola-zhong1

ARXIV.ORG

Refusal Lives Downstream of Persona in Chat Models

arXiv:2606.26161v1 Announce Type: new Abstract: Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but…

30 views · Fri, 26 Jun 2026 05:20:40 GMT

#artificialintelligence #chatmodels #machinelearning

ARXIV.ORG

How sure is the activation oracle?

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty…

22 views · Wed, 27 May 2026 08:07:57 GMT

#artificial intelligence #language models

OUTCRY AI

AI Interpretability Is a Revolutionary Skill

A language model has roughly sixty-five thousand internal concepts. None of them is the word your movement uses. Here is where the missing word actually lives — and how to put it t…

22 views · Mon, 25 May 2026 03:47:35 GMT

#artificial intelligence #technology

SEANGOEDECKE