Product Evals in Three Simple Steps
Label some data, align LLM-evaluators, and run the eval harness with each change.
Full article excerpt tap to expand
window.MathJax = { chtml: { scale: 0.9 }, svg: { scale: 0.9 }, tex: {inlineMath: [['$', '$'], ['\\(', '\\)']] }}; After repeating myself for the $n^\text{th}$ time on how to build product evals, I figured I should write it down. There are three basic steps: (i) labeling a small dataset, (ii) aligning our LLM evaluators, and (iii) running the experiment + evaluation harness with each config change. First, label some data It begins with sampling some input and output from our LLM requests, and labeling whether the output meets our evaluation criteria (e.g., faithfulness, relevance, etc). Start simple with a spreadsheet that has columns for input, output, additional metadata that helps evaluate the output, and a new column for the label. Focus on binary pass/fail or win/lose labels. If the criteria are objective—such as whether a summary is faithful to the source, or contains a refusal—use pass/fail labels. For subjective criteria, such as whether one summary is more concise than another, use win/lose/tie comparisons. For the latter, it helps to allow annotators to indicate ties. Forcing them to pick a winner when two outputs are nearly identical introduces noise and prevents us from learning that some differences are negligible. What about numeric labels or Likert scales? While 1-5 scales offer granularity, I’ve found it challenging to calibrate human annotators and LLM-evaluators. The difference between “3” and a “4” is often subtle. Even with detailed labeling rubrics, different human annotators will return different labels. And if it’s a challenge for human annotators to label consistently against a rubric, it will be a challenge for LLM-evaluators too. Binary labels mitigate this issue by forcing a clear decision boundary. Furthermore, while stakeholders sometimes ask for granular scores so they have flexibilty to adjust the thresholds for what counts as a pass later (e.g., moving from 3 to 4, or from minor error to no error), in my experience, exactly zero of them actually do this. They eventually just ask for a recommended threshold so they can report the pass/fail rate. If this is where we’ll end up anyway, it’s simpler to start with binary labels. It leads to faster and more consistent labels from human annotators, and makes it easier to align our LLM-evaluators. Aim for 50-100 fail cases. This depends on the total number of labels, and more importantly, the number of labels we actually care about. For pass/fail evaluations, most of the time, what matters is the “fails” as these are the trust-busting defects. A dataset with hundreds of labels but only five failures isn’t useful to align and evaluate our evaluators on. We need a balanced dataset. I usually recommend having at least 50-100 failures out of 200+ total samples. How to get fail cases? I’ve found success using smaller, less capable models to generate outputs. Even when trying their hardest, these models naturally produce “organic” failures. They might struggle with long context, have insufficient reasoning ability, or fail on edge cases—these are the types of failures we’ll encounter in production. A popular approach is to prompt a strong model to generate synthetic defects. I find these synthetic defects problematic. They tend to be out-of-distribution, either too exaggerated or too subtle in ways that don’t reflect what happens in production. When we align evaluators on these, they may fail to detect the messy, organic issues that actually affect our users. While I…
This excerpt is published under fair use for community discussion. Read the full article at eugeneyan.com.