JudgeKit: Generate LLM-as-Judge prompts grounded in published research
JudgeKit is a tool that generates evaluation prompts for large language models based on published research, allowing users to assess model outputs according to specific criteria like faithfulness. It supports both pointwise and pairwise evaluation modes and provides a preview of the evaluator prompt with optional stress testing. The tool is free to use, requires no signup, and includes privacy safeguards for user inputs.
Opening excerpt (first ~120 words) tap to expand
How it worksStart herePaste or buildPaste a trace, system prompt, or skip to the wizard below.Step 2ReviewStep 3GenerateLLM as a Judge, prompt generatorBuild a judge humans agree with.Paste a trace and get a research-grounded judge (evaluator prompt) with drop-in code and a 3-judge stress test. Free, no signup.Paste an existing trace, span, or system prompt.We pre-fill the wizard below from what you paste. You review and edit.Strip real PII before pasting. Inputs and any few-shot examples extracted from them are cached for 6 hours. Privacy details.ExtractTry a traceTry a system promptI agree to Termsor build manuallyWhat are you evaluating?Pointwise scores one response.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Judgekit.