WeSearch

Refusal Lives Downstream of Persona in Chat Models

·2 min read · 0 reactions · 0 comments · 6 views
Refusal Lives Downstream of Persona in Chat Models

arXiv:2606.26161v1 Announce Type: new Abstract: Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates refusal. In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both. Compliant persona steering suppresses refusal -- in Llama, the refu

Original article
arXiv.org
Read full at arXiv.org →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2606.26161 (cs) [Submitted on 24 Jun 2026] Title:Refusal Lives Downstream of Persona in Chat Models Authors:Viola Zhong, Qirui Li View a PDF of the paper titled Refusal Lives Downstream of Persona in Chat Models, by Viola Zhong and 1 other authors View PDF HTML (experimental) Abstract:Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates refusal. In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv.org