WeSearch

LLM-as-judge variance broke our DPO training signal for 3 weeks

·4 min read · 0 reactions · 0 comments · 16 views
#machinelearning#llm#mlops
LLM-as-judge variance broke our DPO training signal for 3 weeks
⚡ TL;DR · AI summary

A recent issue with a single LLM as a preference judge in a DPO pipeline led to a significant drop in production accuracy. The judge exhibited a high rate of self-disagreement, which resulted in misleading training signals. After implementing a three-judge consensus system, production accuracy improved, although costs increased.

Key facts
Original article
DEV.to (Top)
Read full at DEV.to (Top) →
Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3859428) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Marcus Chen Posted on May 27 LLM-as-judge variance broke our DPO training signal for 3 weeks #machinelearning #llm #mlops #pytorch TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was flipping its own labels 28% of the time at temperature 0. The setup Nexus Labs ships agents that book travel, file expenses, process insurance claims. Eight engineers on my fine-tuning team.

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from DEV.to (Top)