✍
Aug 2024·7 min read
LLM-as-a-Judge — How to Evaluate AI Outputs Without Human Labels
Using LLMs to evaluate LLM outputs — prompt design, cross-model judging, and when to trust the judge.
Evaluating LLM outputs at scale requires automation. LLM-as-a-Judge is the most practical approach I've found.
Prompt design
A good evaluation prompt includes: the original task, the model output, a rubric with specific criteria, and a structured output format (JSON score + reasoning).
Cross-model judging
Never let a model judge itself. Use a different model (e.g., GPT-4 evaluates Llama outputs) to eliminate self-grading bias. Average scores across 3 judges for reliability.
When to trust it
LLM judges are surprisingly reliable for factual accuracy, tone, and structure. They're less reliable for subjective quality. Always validate against human labels on a held-out set.