Search: "swe bench" — WeSearch Press

10 stories match your query across our 700+ source catalog. Ranked by relevance and recency.

10 results for "swe bench"

SWE-bench Verified no longer measures frontier coding capabilities

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.…

Sun, 26 Apr 2026 22:44:07 GMT · 8 views

Confirmed: SWE Bench is now a benchmaxxed benchmark

Mon, 27 Apr 2026 06:01:12 GMT · 9 views

ARXIV.ORG

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language mode…

Tue, 28 Apr 2026 04:13:21 GMT · 2 views

ARXIV.ORG

MarketBench: Evaluating AI Agents as Market Participants

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have infor…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the prese…

Tue, 28 Apr 2026 04:13:21 GMT · 2 views

LOCALLLAMA

We benchmarked gpt-oss-120b across 6 inference providers and found a 10x throughput spread

We ran a benchmark across 10+ LLM routers, providers, and inference backends to answer the questions that come up every time someone picks a provider. Key findings: Do LLM routers add latency? No, Ope…

Mon, 27 Apr 2026 16:26:15 GMT · 5 views

ARXIV.ORG

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on lon…

Tue, 28 Apr 2026 04:13:21 GMT · 2 views

ARXIV.ORG

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final ans…

Tue, 28 Apr 2026 04:13:21 GMT · 2 views

ARXIV.ORG

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framew…

Tue, 28 Apr 2026 04:13:21 GMT · 2 views

ARXIV.ORG

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain …

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

Or browse by topic

World US Politics Technology AI Markets Business Science Climate Health Culture Media

Results for "swe bench".

SWE-bench Verified no longer measures frontier coding capabilities

Confirmed: SWE Bench is now a benchmaxxed benchmark

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

MarketBench: Evaluating AI Agents as Market Participants

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

We benchmarked gpt-oss-120b across 6 inference providers and found a 10x throughput spread

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Or browse by topic