Claude, GPT, Gemini Agents Fail 72% of U.S. Healthcare Workflows
A new benchmark study reveals that AI agents, including Claude, GPT, and Gemini, fail to complete 72% of U.S. healthcare workflows. The CHI-Bench study tested 30 AI agents across 75 clinical workflows, highlighting significant reliability issues. Despite claims of readiness for long workflows, the agents struggled with real clinical cases, raising concerns about their effectiveness in healthcare settings.
- ▪The CHI-Bench study evaluated 30 frontier AI agents across 75 healthcare workflows.
- ▪The best-performing agent, Anthropic's Claude Code, achieved a pass rate of only 28%.
- ▪No agent successfully completed tasks in a fully end-to-end setting, indicating major reliability issues.
Opening excerpt (first ~120 words) tap to expand
Claude, GPT, Gemini Agents Fail 72% of U.S. Healthcare Workflows, New Benchmark Finds 1 of 2 | CHI-Bench Engine: Simulated Worlds for Clinical Healthcare In-Situ Workflows. Read More 2 of 2 | CHI-Bench results across agent harnesses and frontier. models Read More Claude, GPT, Gemini Agents Fail 72% of U.S. Healthcare Workflows, New Benchmark Finds 1 of 2 | CHI-Bench Engine: Simulated Worlds for Clinical Healthcare In-Situ Workflows. Read More 1 of 2 CHI-Bench Engine: Simulated Worlds for Clinical Healthcare In-Situ Workflows. Add AP News on Google Add AP News as your preferred source to see more of our stories on Google. Share Share Facebook Copy Link copied Print Email X LinkedIn Bluesky Flipboard Pinterest Reddit Read More 2 of 2 | CHI-Bench results across agent harnesses and frontier.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at AP News.