WeSearch
Hub / Tags / Evaluation
TAG · #EVALUATION

Evaluation coverage.

Every story in the WeSearch catalog tagged with #evaluation, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

60 stories tagged with #evaluation, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Evaluation"

RELATED TAGS
#ai44#ml15#technology10#education9#cbse4#model-evaluation4#bias3#programming2#frameworks2#benchmarking2#video-generation2#research2
ESPN — TOP

Braves, Strider await doc's evaluation of elbow

5 views ·
HINDUSTAN TIMES — TOP

CBSE drops Coempt's portal for re-evaluation over ‘security concerns’ to use its own

When asked specifically about the security concerns that prompted the platform switch, CBSE did not confirm or deny the reason. | India News…

48 views ·
#education#cybersecurity#re-evaluation
TECHMEME

OpenAI diverges from Trump's AI EO in a new policy paper, proposing cyber risk evaluations for advanced AI systems be mandatory and led by CAISI, not the NSA (Brendan Bordelon/Politico)

Brendan Bordelon / Politico : OpenAI diverges from Trump's AI EO in a new policy paper, proposing cyber risk evaluations for advanced AI systems be mandatory and led by CAISI, not …

23 views ·
GOOGLE NEWS

JEDEC® Releases New SiC Guidelines to Improve Reliability and Evaluation in Power Electronics - Morningstar

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…

20 views ·
R/ARTIFICIAL

Trump's AI Evaluations Order: Right Policy, Unfinished Governance

23 views ·
DEV.TO (TOP)

Six Months of AI-Assisted Software Development: A Critical Evaluation of Vibe Coding, Agentic IDEs, and Real Engineering

Six Months of AI-Assisted Software Development: A Critical Evaluation of Vibe Coding,...…

14 views ·
#ai#software development#engineering
THE HINDU — TOP

CBSE says 40,000 students have completed re-evaluation process via portal without issues so far

CBSE reports 40,000 students successfully completed the re-evaluation process, with various online payment options available.…

18 views ·
#education#exams#students
ARXIV.ORG

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…

26 views ·
#machine learning#language models
ARXIV CS.AI

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…

19 views ·
#artificial intelligence#autonomous agents
ARXIV CS.AI

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessi…

20 views ·
#artificial intelligence#llm
ARXIV CS.AI

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frame…

23 views ·
#artificial intelligence#social simulation#climate policy
ARXIV CS.AI

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly unde…

15 views ·
#artificial intelligence#graph theory#education
ARXIV CS.AI

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly …

13 views ·
#artificial intelligence#machine learning#evolutionary algorithms
ARXIV CS.AI

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may o…

19 views ·
#artificial intelligence#chemistry#machine learning
HINDUSTAN TIMES — TOP

CBSE says re-evaluation portal targeted by cyberattack from ‘malicious actors’

The portal, originally scheduled for May 29, the rollout was postponed to June 1, missed that deadline as well, and finally went live around 4.30 am on June 2. | India News…

24 views ·
#education#cybersecurity#technology
HINDUSTAN TIMES — TOP

CBSE OSM row: Centre replaces chairman, secretary; orders probe into exam evaluation system

The moves came after the intervention of Prime Minister Narendra Modi, people familiar with the development said. | India News…

17 views ·
#education#government#investigation
HINDUSTAN TIMES — TOP

Officials blame cyberattack for CBSE revaluation portal glitches

CBSE's revaluation portal faced a cyberattack, disrupting payments for 50 students and deferring the re-evaluation process until June 1. | India News…

22 views ·
#education#cybersecurity#technology
R/CYBERSECURITY

Hacking India's Largest Exam Evaluation Portal: From Authentication Bypass to Full Account Takeover (Covered by BBC)

29 views ·
R/CYBERSECURITY

Hacking India's Largest Evaluation Portal: From Authentication Bypass to Full Account Takeover

17 views ·
HARNEXA.DEV

Nexa-gauge – LLM evaluation framework with per-node scoring controls

Overview of nexa-gauge documentation…

13 views ·
#technology#artificial intelligence
R/EXPERIENCEDDEVS

Halfway through an LLM gateway evaluation and the criteria i started with were wrong

14 views ·
HINDUSTAN TIMES — TOP

Centre planning to expand audit of CBSE's on-screen marking system amid concerns

The move follows anger among parents and teachers, and a series of HT reports covering what appears to be a rushed process to roll out an entirely new mechanism | India News…

18 views ·
#education#cbse#audit
THE HINDU — TOP

CBSE portal for re-evaluation of Class 12 answer papers to be operational from June 1

CBSE re-evaluation portal for Class 12 answer papers will launch on June 1, ensuring transparency and high evaluation standards.…

15 views ·
#education#exams
TIMES OF INDIA — TOP

CBSE Class 12 verification, re-evaluation portal to open on June 1 amid glitch concerns

The Central Board of Secondary Education (CBSE) will open its post-result verification and re-evaluation portal for Class 12 students on June 1, 2026, amid mounting concerns over t…

22 views ·
HINDUSTAN TIMES — TOP

CBSE postpones Class 12 re-evaluation portal launch to June 1, says website being strengthened

OSM row: A CBSE official told Hindustan Times that the portal would not open on Friday, May 29, as earlier expected. | India News…

14 views ·
#education#cbse#exams
GOOGLE NEWS

Diverging AI safety approaches: OpenAI enters Japanese banking defenses, while Anthropic’s model remains restricted to controlled evaluations - Moomoo

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…

19 views ·
ALPHASET

Would a learning-first AI evaluation platform be useful?

Managed expert-data loops for post-training, eval repair, annotation design, and measurable lift.…

14 views ·
#ai#technology#data
CISCO BLOGS

Multi-turn jailbreak rates across 15 frontier models (Grok 88%, Claude 12%)

The dominant safety benchmarks for frontier large language models share a structural assumption: that a single prompt and a single model response are enough to characterize how a m…

17 views ·
#artificial intelligence#security#model evaluation
THE HINDU — TOP

Expert IIT team to submit report after ‘full check-up’ of CBSE’s tech ecosystem

IIT experts to assess CBSE's tech issues, offering solutions for evaluation discrepancies and improving the IT ecosystem.…

15 views ·
#education#technology
ARXIV.ORG

Video Quality Evaluation Methodology and Result of AV2 Compression Performance

The Alliance for Open Media (AOMedia) has developed the AV2 video coding standard to supersede AV1, aiming for substantial compression efficiency gains across diverse media applica…

12 views ·
#video#compression#technology
TENSORZERO

Even (very) noisy LLM evaluators are useful for improving AI agents

Even (very) noisy LLM evaluators are useful for improving AI agents…

12 views ·
#ai#technology
DEV.TO (TOP)

AI 3D tools need product evals, not benchmark faith

If you’re building AI-assisted 3D or CAD-like workflows, benchmark scores only get you so far. The real work is designing evals around your product contract and catching geometry f…

20 views ·
#ai#3d
ARXIV CS.AI

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when per…

22 views ·
#artificial intelligence#legal
ARXIV CS.AI

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limitin…

15 views ·
#artificial intelligence#machine learning#multiagent systems
ARXIV CS.AI

PitchBench: Measuring Pitch Hearing in Audio-Language Models

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation …

21 views ·
#audio#artificial intelligence#music
HINDUSTAN TIMES — TOP

CBSE 'ignored' calls for regional trials before OSM rollout for Class 12 board exam evaluation

Evaluators are now saying the OSM introduced a completely alien workflow, produced shoddy answer-script scans and recorded marks incorrectly. | India News…

22 views ·
#education#exams#technology
DEV.TO (TOP)

Why We Need Behavioral Benchmarks for LLMs — Not Just More Knowledge Tests

Would you hire an engineer based on their SAT score? Of course not. You look at how they solve...…

18 views ·
#ai#programming
GITHUB

Prompter – Compare and benchmark Ollama models side-by-side in your terminal

Terminal-based multi-model comparison, benchmarking, and evaluation tool for Ollama. Zero dependencies, one file. - whonixnetworks/prompter…

21 views ·
#technology#software
HINDUSTAN TIMES — TOP

What is OSM? CBSE evaluation system for Class 12 that got Delhi student 'Pakistani' label

Concerns over CBSE's On-Screen Marking (OSM) for Class 12 stem from allegations by students of mismatch in the answer sheets uploaded by the board. | India News…

22 views ·
#education#technology#social media
ARXIV CS.AI

Stop Comparing LLM Agents Without Disclosing the Harness

This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer th…

26 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While AlphaZero has achieved remarkable success in p…

13 views ·
#artificial intelligence#machine learning#games
ARXIV CS.AI

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…

25 views ·
#artificial intelligence#machine learning#performance evaluation
ARXIV CS.AI

How Well Do Models Follow Their Constitutions?

Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a),…

16 views ·
#artificial intelligence#model evaluation#behavioral specifications
ARXIV CS.AI

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactio…

16 views ·
#artificial intelligence#audio-video
ARXIV CS.AI

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into th…

15 views ·
#artificial intelligence#machine learning
VENTUREBEAT

Why prompt debt, retrieval debt, and evaluation debt are quietly reshaping enterprise AI risk

20 views ·
HACKER NEWS (AI / LLM)

AI Evaluation Is Biased – By Design

The structural reason teams build false confidence in their AI systems…

19 views ·
#ai#bias
THE HINDU — TOP

Technical glitches in CBSE portal leave students seeking re-evaluation stranded

Technical glitches in the CBSE portal leave students stranded, prompting government intervention for urgent resolution and accountability.…

16 views ·
#education#technology#exams
GITHUB

My LLM optimization loop reward-hacked its own benchmark (and other lessons) [pdf]

Contribute to CodeReclaimers/bishop-loop-experiment-3 development by creating an account on GitHub.…

19 views ·
#artificial intelligence#machine learning
HINDUSTAN TIMES — TOP

Rahul Gandhi hits out at Centre over CBSE row: ‘Modi government fears the youth and Gen Z'

In a social media post, Gandhi linked the CBSE Class 12 evaluation row to what he called a broader pattern of suppressing dissent under the Modi government. | India News…

23 views ·
#education#politics#youth
DEV.TO (TOP)

Per-Turn Evaluation: Dynamic Governance for AI Agents

Per-turn evaluation gives AI agents dynamic governance by re-evaluating rules, tools, and context from live state instead of startup config.…

11 views ·
#ai#governance#technology
DEV.TO (TOP)

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

Intro: Automated evaluation is fast becoming a necessity as AI-driven agents proliferate across...…

15 views ·
#ai#llm
ARXIV CS.AI

Decomposing and Measuring Evaluation Awareness

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a …

15 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community tr…

14 views ·
#video generation#artificial intelligence
YCOMBINATOR

Urgent Evaluation Needed

11 views ·
HINDUSTAN TIMES — TOP

After IITs, PSU banks also roped in for smooth CBSE re-evaluation process

CBSE conducted Class 12 board examinations from February 17 to April 10 and announced the results on May 13 | India News…

23 views ·
#education#technology#students
HINDUSTAN TIMES — TOP

IIT Kanpur, Madras teams to assist CBSE for ‘glitch-free’ re-evaluation process

Dharmendra Pradhan’s directions came in view of the recent developments and concerns raised by students, parents regarding the CBSE post-result services portal | India News…

31 views ·
#education#technology#exams
THE HINDU — TOP

Kerala seeks urgent intervention over CBSE Plus Two evaluation complaints

Kerala government urges CBSE intervention over Class XII evaluation complaints, citing concerns about low marks and revaluation delays.…

19 views ·
#education#kerala#exams
THE HINDU — TOP

Revaluation process a hassle for Class 12 CBSE students in Chennai

Class 12 CBSE students face challenges in re-evaluation, including high fees, technical issues, and delays in receiving answer sheets.…

19 views ·
#education#exams#technology
DEV.TO (TOP)

Evaluation & Benchmark Results

Multimodal Gemma 4 Visual Regression & Patch Agent devchallenge gemmachallenge gemma ai Gemma...…

11 views ·
#ai#development#technology