PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
The paper introduces PANDO, a framework designed to enhance the efficiency of multimodal AI agents through online skill distillation. It addresses inefficiencies in existing systems by analyzing common issues and proposing solutions that reduce token usage while improving success rates. PANDO demonstrates a significant performance improvement over previous models, achieving a 58.3% success rate on a comprehensive set of tasks.
- ▪PANDO achieves a 58.3% success rate on 910 VisualWebArena tasks, outperforming previous models.
- ▪The framework uses 58% fewer tokens than SGV and 61% fewer than WALT, indicating improved efficiency.
- ▪Three trajectory-level efficiency metrics are introduced to assess performance beyond just success rates.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.24785 (cs) [Submitted on 24 May 2026] Title:PANDO: Efficient Multimodal AI Agents via Online Skill Distillation Authors:Yubo Li, Yidi Miao, Haotian Shen, Yuxin Liu View a PDF of the paper titled PANDO: Efficient Multimodal AI Agents via Online Skill Distillation, by Yubo Li and 3 other authors View PDF HTML (experimental) Abstract:Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.