DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Jun 3, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 33 views

#artificial intelligence #desktop agents #human collaboration

TL;DR · WeSearch summary

The paper titled 'DeskCraft' introduces a new benchmark for evaluating desktop agents in professional workflows that require human collaboration. It addresses the limitations of existing benchmarks by focusing on long horizon tasks and proactive human-agent interactions. The study evaluates various agents and highlights persistent challenges in delivering complex workflows and clarifying tasks.

Key facts

▪DeskCraft targets long horizon creative and engineering workflows requiring human-in-the-loop collaboration.
▪The benchmark organizes tasks into a multilevel difficulty taxonomy, with some tasks requiring over 50 execution steps.
▪The evaluation of 18 agents revealed that GPT-5.4 scored 31.6% on standard tasks and 27.6% on interactive tasks.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2606.03103 (cs) [Submitted on 2 Jun 2026] Title:DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration Authors:Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang, Shengyu Zhang View a PDF of the paper titled DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration, by Wenkai Wang and 8 other authors View PDF HTML (experimental) Abstract:Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions,…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Discussion

More from arXiv cs.AI