Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
The paper titled 'Furina: Fragmented Uncertainty-Driven Refusal Instability Attack' explores safety alignment in large language models. It challenges the assumption of deterministic safety behavior by revealing an instability region that leads to stochastic refusal decisions. The authors introduce a new attack method, Furina, which exploits this instability to enhance understanding of safety vulnerabilities.
- ▪The paper reveals that safety behavior in large language models is influenced by an instability region.
- ▪Furina is a jailbreak attack that uses fragmented prompts to induce uncertainty in model responses.
- ▪The research identifies a decoupling phenomenon where unstable inputs show high output uncertainty and low internal safety activation.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Cryptography and Security arXiv:2605.26158 (cs) [Submitted on 24 May 2026] Title:Furina: Fragmented Uncertainty-Driven Refusal Instability Attack Authors:Tongxi Wu, Jian Zhang, Yang Gao View a PDF of the paper titled Furina: Fragmented Uncertainty-Driven Refusal Instability Attack, by Tongxi Wu and 2 other authors View PDF HTML (experimental) Abstract:Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.