26 results for "bench performance"
atomic_queue benchmarks SMT vs no-SMT performance
atomic_queue benchmarks SMT vs no-SMT performance
MarketBench: Evaluating AI Agents as Market Participants
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have infor…
Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.
UPDATE: Vulkan benches arew now included. And yes, I used AI to help me write this post. As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if an…
Performance Test: AWS Graviton4 Reduces EC2 Costs 40% vs. Intel Xeon 5th Gen
In a 12-week production benchmark across 14 workload types, AWS Graviton4-based EC2 instances...…
Xiaomi releases MiMo-v2.5 Family weights with strong coding and agent benchmarks
Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR code generation, assem…
StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems
We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool condi…
Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking
Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We…
Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning
Most digital language learning curricula rely on discrete-item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can adv…
ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation Scenes
Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is …
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (…
Does Point Cloud Boost Spatial Reasoning of Large Language Models?
3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial …
PExA: Parallel Exploration Agent for Complex Text-to-SQL
LLM-based agents for text-to-SQL often struggle with latency-performance trade-off, where performance improvements come at the cost of latency or vice versa. We reformulate text-to-SQL generation with…
Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization
While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ende…
Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate
The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured elec…
Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning
Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final ans…
Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task
Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …
QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proo…
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framew…
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system r…
The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability
This study introduces the Kerimov-Alekberli model, a novel information-geometric framework that redefines AI safety by formally linking non-equilibrium thermodynamics to stochastic control for the eth…
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal …
A systematic evaluation of vision-language models for observational astronomical reasoning tasks
Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities r…
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain …
Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B
Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice spe…
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model Big claims from Qwen about their latest open weight model: Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previo…