60 stories tagged with #evaluation, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Evaluation"
Braves, Strider await doc's evaluation of elbow
CBSE drops Coempt's portal for re-evaluation over ‘security concerns’ to use its own
When asked specifically about the security concerns that prompted the platform switch, CBSE did not confirm or deny the reason. | India News…
OpenAI diverges from Trump's AI EO in a new policy paper, proposing cyber risk evaluations for advanced AI systems be mandatory and led by CAISI, not the NSA (Brendan Bordelon/Politico)
Brendan Bordelon / Politico : OpenAI diverges from Trump's AI EO in a new policy paper, proposing cyber risk evaluations for advanced AI systems be mandatory and led by CAISI, not …
JEDEC® Releases New SiC Guidelines to Improve Reliability and Evaluation in Power Electronics - Morningstar
Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…
Trump's AI Evaluations Order: Right Policy, Unfinished Governance
Six Months of AI-Assisted Software Development: A Critical Evaluation of Vibe Coding, Agentic IDEs, and Real Engineering
Six Months of AI-Assisted Software Development: A Critical Evaluation of Vibe Coding,...…
CBSE says 40,000 students have completed re-evaluation process via portal without issues so far
CBSE reports 40,000 students successfully completed the re-evaluation process, with various online payment options available.…
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…
TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment
LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessi…
Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation
LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frame…
GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly unde…
SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems
Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly …
From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may o…
CBSE says re-evaluation portal targeted by cyberattack from ‘malicious actors’
The portal, originally scheduled for May 29, the rollout was postponed to June 1, missed that deadline as well, and finally went live around 4.30 am on June 2. | India News…
CBSE OSM row: Centre replaces chairman, secretary; orders probe into exam evaluation system
The moves came after the intervention of Prime Minister Narendra Modi, people familiar with the development said. | India News…
Officials blame cyberattack for CBSE revaluation portal glitches
CBSE's revaluation portal faced a cyberattack, disrupting payments for 50 students and deferring the re-evaluation process until June 1. | India News…
Hacking India's Largest Exam Evaluation Portal: From Authentication Bypass to Full Account Takeover (Covered by BBC)
Hacking India's Largest Evaluation Portal: From Authentication Bypass to Full Account Takeover
Nexa-gauge – LLM evaluation framework with per-node scoring controls
Overview of nexa-gauge documentation…
Halfway through an LLM gateway evaluation and the criteria i started with were wrong
Centre planning to expand audit of CBSE's on-screen marking system amid concerns
The move follows anger among parents and teachers, and a series of HT reports covering what appears to be a rushed process to roll out an entirely new mechanism | India News…
CBSE portal for re-evaluation of Class 12 answer papers to be operational from June 1
CBSE re-evaluation portal for Class 12 answer papers will launch on June 1, ensuring transparency and high evaluation standards.…
CBSE Class 12 verification, re-evaluation portal to open on June 1 amid glitch concerns
The Central Board of Secondary Education (CBSE) will open its post-result verification and re-evaluation portal for Class 12 students on June 1, 2026, amid mounting concerns over t…
CBSE postpones Class 12 re-evaluation portal launch to June 1, says website being strengthened
OSM row: A CBSE official told Hindustan Times that the portal would not open on Friday, May 29, as earlier expected. | India News…
Diverging AI safety approaches: OpenAI enters Japanese banking defenses, while Anthropic’s model remains restricted to controlled evaluations - Moomoo
Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…
Would a learning-first AI evaluation platform be useful?
Managed expert-data loops for post-training, eval repair, annotation design, and measurable lift.…
Multi-turn jailbreak rates across 15 frontier models (Grok 88%, Claude 12%)
The dominant safety benchmarks for frontier large language models share a structural assumption: that a single prompt and a single model response are enough to characterize how a m…
Expert IIT team to submit report after ‘full check-up’ of CBSE’s tech ecosystem
IIT experts to assess CBSE's tech issues, offering solutions for evaluation discrepancies and improving the IT ecosystem.…
Video Quality Evaluation Methodology and Result of AV2 Compression Performance
The Alliance for Open Media (AOMedia) has developed the AV2 video coding standard to supersede AV1, aiming for substantial compression efficiency gains across diverse media applica…
Even (very) noisy LLM evaluators are useful for improving AI agents
Even (very) noisy LLM evaluators are useful for improving AI agents…
AI 3D tools need product evals, not benchmark faith
If you’re building AI-assisted 3D or CAD-like workflows, benchmark scores only get you so far. The real work is designing evals around your product contract and catching geometry f…
Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when per…
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limitin…
PitchBench: Measuring Pitch Hearing in Audio-Language Models
Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation …
CBSE 'ignored' calls for regional trials before OSM rollout for Class 12 board exam evaluation
Evaluators are now saying the OSM introduced a completely alien workflow, produced shoddy answer-script scans and recorded marks incorrectly. | India News…
Why We Need Behavioral Benchmarks for LLMs — Not Just More Knowledge Tests
Would you hire an engineer based on their SAT score? Of course not. You look at how they solve...…
Prompter – Compare and benchmark Ollama models side-by-side in your terminal
Terminal-based multi-model comparison, benchmarking, and evaluation tool for Ollama. Zero dependencies, one file. - whonixnetworks/prompter…
What is OSM? CBSE evaluation system for Class 12 that got Delhi student 'Pakistani' label
Concerns over CBSE's On-Screen Marking (OSM) for Class 12 stem from allegations by students of mismatch in the answer sheets uploaded by the board. | India News…
Stop Comparing LLM Agents Without Disclosing the Harness
This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer th…
MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games
Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While AlphaZero has achieved remarkable success in p…
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…
How Well Do Models Follow Their Constitutions?
Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a),…
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactio…
Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework
LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into th…
Why prompt debt, retrieval debt, and evaluation debt are quietly reshaping enterprise AI risk
AI Evaluation Is Biased – By Design
The structural reason teams build false confidence in their AI systems…
Technical glitches in CBSE portal leave students seeking re-evaluation stranded
Technical glitches in the CBSE portal leave students stranded, prompting government intervention for urgent resolution and accountability.…
My LLM optimization loop reward-hacked its own benchmark (and other lessons) [pdf]
Contribute to CodeReclaimers/bishop-loop-experiment-3 development by creating an account on GitHub.…
Rahul Gandhi hits out at Centre over CBSE row: ‘Modi government fears the youth and Gen Z'
In a social media post, Gandhi linked the CBSE Class 12 evaluation row to what he called a broader pattern of suppressing dissent under the Modi government. | India News…
Per-Turn Evaluation: Dynamic Governance for AI Agents
Per-turn evaluation gives AI agents dynamic governance by re-evaluating rules, tools, and context from live state instead of startup config.…
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
Intro: Automated evaluation is fast becoming a necessity as AI-driven agents proliferate across...…
Decomposing and Measuring Evaluation Awareness
Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a …
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community tr…
Urgent Evaluation Needed
After IITs, PSU banks also roped in for smooth CBSE re-evaluation process
CBSE conducted Class 12 board examinations from February 17 to April 10 and announced the results on May 13 | India News…
IIT Kanpur, Madras teams to assist CBSE for ‘glitch-free’ re-evaluation process
Dharmendra Pradhan’s directions came in view of the recent developments and concerns raised by students, parents regarding the CBSE post-result services portal | India News…
Kerala seeks urgent intervention over CBSE Plus Two evaluation complaints
Kerala government urges CBSE intervention over Class XII evaluation complaints, citing concerns about low marks and revaluation delays.…
Revaluation process a hassle for Class 12 CBSE students in Chennai
Class 12 CBSE students face challenges in re-evaluation, including high fees, technical issues, and delays in receiving answer sheets.…
Evaluation & Benchmark Results
Multimodal Gemma 4 Visual Regression & Patch Agent devchallenge gemmachallenge gemma ai Gemma...…