DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
The paper titled 'DeskCraft' introduces a new benchmark for evaluating desktop agents in professional workflows that require human collaboration. It addresses the limitations of existing benchmarks by focusing on long horizon tasks and proactive human-agent interactions. The study evaluates various agents and highlights persistent challenges in delivering complex workflows and clarifying tasks.
- ▪DeskCraft targets long horizon creative and engineering workflows requiring human-in-the-loop collaboration.
- ▪The benchmark organizes tasks into a multilevel difficulty taxonomy, with some tasks requiring over 50 execution steps.
- ▪The evaluation of 18 agents revealed that GPT-5.4 scored 31.6% on standard tasks and 27.6% on interactive tasks.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2606.03103 (cs) [Submitted on 2 Jun 2026] Title:DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration Authors:Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang, Shengyu Zhang View a PDF of the paper titled DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration, by Wenkai Wang and 8 other authors View PDF HTML (experimental) Abstract:Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions,…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.