WeSearch

MarketBench: Evaluating AI Agents as Market Participants

·2 min read · 0 reactions · 0 comments · 1 view
MarketBench: Evaluating AI Agents as Market Participants

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.

Original article
arXiv.org
Read full at arXiv.org →
Full article excerpt tap to expand

Computer Science > Artificial Intelligence arXiv:2604.23897 (cs) [Submitted on 26 Apr 2026] Title:MarketBench: Evaluating AI Agents as Market Participants Authors:Andrey Fradkin, Rohit Krishnan View a PDF of the paper titled MarketBench: Evaluating AI Agents as Market Participants, by Andrey Fradkin and 1 other authors View PDF HTML (experimental) Abstract:Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents. Subjects: Artificial Intelligence (cs.AI); General Economics (econ.GN) Cite as: arXiv:2604.23897 [cs.AI] (or arXiv:2604.23897v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.23897 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrey Fradkin [view email] [v1] Sun, 26 Apr 2026 21:48:01 UTC (30 KB) Full-text links: Access Paper: View a PDF of the paper titled MarketBench: Evaluating AI Agents as Market Participants, by Andrey Fradkin and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI < prev | next > new | recent | 2026-04 Change to browse by: cs econ econ.GN q-fin q-fin.EC References & Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly…

This excerpt is published under fair use for community discussion. Read the full article at arXiv.org.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

0 comments

More from arXiv.org