Search: "benchmark dataset"

11 stories match your query across our 700+ source catalog. Ranked by relevance and recency.

11 results for "benchmark dataset"

Saved 55% on Recommendation Costs: XGBoost 2.0 vs TensorFlow 2.15 for 1M User Datasets

When our team benchmarked XGBoost 2.0 and TensorFlow 2.15 on a 1 million user recommendation dataset,...…

Tue, 28 Apr 2026 13:55:00 GMT · 2 views

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automate…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the prese…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial …

Tue, 28 Apr 2026 14:55:00 GMT · 3 views

ARXIV.ORG

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empir…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting

Accurate long-term time series forecasting (LTSF) requires the capture of complex long-range dependencies and dynamic periodic patterns. Recent advances in frequency-domain analysis offer a global per…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

This study introduces the Kerimov-Alekberli model, a novel information-geometric framework that redefines AI safety by formally linking non-equilibrium thermodynamics to stochastic control for the eth…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

ARXIV.ORG

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such…

Tue, 28 Apr 2026 04:13:21 GMT · 3 views

Turbo-OCR Update: Layout Model + Multilingual

Follow-up to my post 18 days ago about the C++/CUDA OCR server. Two additions: What's New: Layout model: Added PP-StructureV3 for layout detection Multilingual: No longer Latin-only. Now supports Chin…

Mon, 27 Apr 2026 12:10:04 GMT · 5 views

Or browse by topic

World US Politics Technology AI Markets Business Science Climate Health Culture Media

Results for "benchmark dataset".

Saved 55% on Recommendation Costs: XGBoost 2.0 vs TensorFlow 2.15 for 1M User Datasets

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Turbo-OCR Update: Layout Model + Multilingual

Or browse by topic