6 stories tagged with #model-evaluation, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Model Evaluation"
Multi-turn jailbreak rates across 15 frontier models (Grok 88%, Claude 12%)
The dominant safety benchmarks for frontier large language models share a structural assumption: that a single prompt and a single model response are enough to characterize how a m…
How Well Do Models Follow Their Constitutions?
Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a),…
Building an AI Model Evaluation Pipeline on AWS for Audio Content Generation
Executive Summary A European digital media publisher needed to determine which foundation...…
Building a Serverless AI Model Evaluation Platform on AWS
The Problem A media company needed to evaluate which AI model produces the best...…
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the …
Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels
Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this c…