LLM Judges: How the Industry Evaluates AI with AI
Human evaluation doesn't scale. Here's how companies use language models to judge other language models—and how you can too.
Three months ago, I shipped an agent that summarized customer support tickets. It worked great in testing. In production, it started generating summaries that were technically accurate but completely missed the point—flagging routine requests as urgent and burying actual emergencies in boilerplate.
The problem wasn’t the agent. The problem was my evaluation. I’d tested a handful of examples manually, declared victory, and moved on. When I finally dug into production data, I found that roughly 15% of summaries had issues I should have caught.
I needed a way to evaluate thousands of outputs without reading each one myself. That’s when I discovered LLM judges—using language models to evaluate other language models. It sounds circular, but it works surprisingly well when done right.
What Is an LLM Judge?
An LLM judge is exactly what it sounds like: a language model that evaluates the outputs of other AI systems. You give it criteria, show it some output, and ask it to score or critique that output.
The simplest version looks like this:
def judge_response(response: str, criteria: str) -> dict:
prompt = f"""
Evaluate the following response based on these criteria: {criteria}
Response to evaluate:
{response}
Provide a score from 1-5 and a brief explanation.
"""
judgment = call_llm(prompt) # Your preferred model
return parse_judgment(judgment)
That’s the core idea. Everything else is refinement—better prompts, structured outputs, calibration, handling edge cases.
Why the Industry Adopted This
Human evaluation is the gold standard, but it doesn’t scale. If you’re iterating on a model or agent, you might generate thousands of test outputs per day. Having humans rate each one is slow, expensive, and inconsistent—different raters apply different standards.
Automated metrics like BLEU, ROUGE, or embedding similarity help for some tasks, but they miss nuance. A response can have high semantic similarity to a reference answer while still being wrong, unhelpful, or inappropriate. These metrics measure surface-level properties, not whether the response actually accomplishes its goal.
LLM judges fill the gap. They can evaluate nuanced qualities like helpfulness, accuracy, tone, and relevance—things that require understanding rather than pattern matching. They’re fast, consistent, and cheap enough to run on every output.
The research community validated this approach. Papers from Google, Anthropic, and academic labs showed that GPT-4’s judgments correlate reasonably well with human preferences—not perfectly, but well enough to be useful for development and testing.
How It Works in Practice
The basic pattern has three components: the criteria you’re judging against, the prompt that instructs the judge, and the output format you expect.
For criteria, you need to be specific. “Rate how good this response is” produces inconsistent results. “Rate how completely this response answers the user’s question, considering accuracy, relevance, and whether it addresses all parts of the question” gives the judge something concrete to evaluate.
The prompt structure matters more than you’d expect. I’ve found that asking the judge to explain its reasoning before giving a score produces better results than asking for scores directly. This mirrors how chain-of-thought prompting improves reasoning—forcing the model to articulate its thinking leads to more considered judgments.
Here’s a more complete implementation:
from openai import OpenAI
import json
client = OpenAI()
def evaluate_response(
question: str,
response: str,
reference: str = None,
model: str = "gpt-4-turbo-preview"
) -> dict:
"""
Evaluate an AI response using an LLM judge.
Returns scores and reasoning.
"""
system_prompt = """You are an expert evaluator of AI responses.
Your job is to assess responses based on specific criteria and provide
detailed, justified scores. Be critical but fair."""
eval_prompt = f"""
Evaluate the following AI response to a user question.
USER QUESTION:
{question}
AI RESPONSE:
{response}
{f'''
REFERENCE ANSWER (for comparison):
{reference}
''' if reference else ''}
Evaluate on these criteria (1-5 scale each):
1. ACCURACY: Is the information correct? Are there any factual errors?
2. COMPLETENESS: Does it fully address the question? Any missing information?
3. CLARITY: Is it well-organized and easy to understand?
4. HELPFULNESS: Would this actually help the user accomplish their goal?
First, analyze the response against each criterion. Then provide scores.
Respond in this JSON format:
{{
"accuracy_reasoning": "...",
"accuracy_score": X,
"completeness_reasoning": "...",
"completeness_score": X,
"clarity_reasoning": "...",
"clarity_score": X,
"helpfulness_reasoning": "...",
"helpfulness_score": X,
"overall_assessment": "...",
"overall_score": X
}}
"""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": eval_prompt}
],
temperature=0.3, # Lower temperature for more consistent judgments
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
The lower temperature is intentional. You want consistent evaluations—the same response should get roughly the same score if evaluated multiple times. Higher temperatures introduce variance that makes it harder to compare results across runs.
Common Use Cases
LLM judges show up across the industry in several patterns.
Model development and comparison is the most visible use. Benchmarks like MT-Bench and AlpacaEval use GPT-4 to judge model responses across a range of tasks. When you see claims that “Model X achieves 95% of GPT-4 performance,” that’s often based on LLM judge evaluations. LMSYS Chatbot Arena takes this further with head-to-head comparisons where a judge picks the better response between two models.
RLHF reward modeling uses judges during training. When you’re training a model with reinforcement learning from human feedback, you need a reward signal for every generated response. Human labelers provide initial preferences, but the reward model that generalizes from those preferences is essentially an LLM judge. Some systems use LLMs directly as reward models rather than training separate smaller models.
Production quality monitoring is where I use judges most. Every response my agents generate gets evaluated by a judge. I track scores over time, alert on degradation, and sample low-scoring responses for manual review. It’s not perfect, but it catches issues much faster than waiting for user complaints.
Content moderation at scale relies heavily on LLM judges. Classifying content as appropriate, harmful, or policy-violating is a judgment task that LLMs handle reasonably well. Most major platforms use some form of this, often with multiple judges and human review for edge cases.
Automated testing in CI/CD is emerging. Instead of just checking that your agent doesn’t crash, you can evaluate whether its outputs meet quality thresholds. Failed quality checks block deployment just like failed unit tests.
Popular Frameworks and Benchmarks
Several established approaches have emerged for LLM evaluation.
MT-Bench from LMSYS uses GPT-4 to judge responses on 80 multi-turn questions across categories like writing, reasoning, math, and coding. It’s become a standard benchmark for comparing chat models. The key insight was using multi-turn conversations rather than single exchanges, which better reflects real usage.
AlpacaEval compares model outputs against a reference model (usually GPT-4 or Claude) using an LLM judge that picks the better response. The win rate against the reference becomes the benchmark score. It’s simple but effective for ranking models.
G-Eval from Microsoft Research provides a framework for evaluating natural language generation with LLM judges. It emphasizes the importance of detailed evaluation criteria and chain-of-thought reasoning in judge prompts.
LangChain and LlamaIndex both include built-in evaluator modules. LangChain’s langchain.evaluation package has ready-made evaluators for criteria like correctness, helpfulness, and relevance. These are convenient starting points, though you’ll likely want to customize for your specific use case.
For practical use, you don’t need to implement everything from scratch:
from langchain.evaluation import load_evaluator
# Built-in criteria evaluator
evaluator = load_evaluator("criteria", criteria="helpfulness")
result = evaluator.evaluate_strings(
input="How do I reset my password?",
prediction="Click the 'Forgot Password' link on the login page and follow the instructions sent to your email.",
)
print(result)
# {'reasoning': '...', 'value': 'Y', 'score': 1}
Building Your Own Evaluation Pipeline
For production use, you’ll want a more robust setup than one-off evaluations. Here’s the pattern I use:
import json
from datetime import datetime
from typing import List
import statistics
class EvaluationPipeline:
def __init__(self, judge_model: str = "gpt-4-turbo-preview"):
self.judge_model = judge_model
self.results = []
def evaluate_batch(
self,
test_cases: List[dict],
criteria: List[str]
) -> dict:
"""
Evaluate a batch of test cases and return aggregate statistics.
test_cases: [{"input": ..., "output": ..., "reference": ...}, ...]
criteria: ["accuracy", "helpfulness", ...]
"""
for case in test_cases:
scores = self._evaluate_single(case, criteria)
self.results.append({
"timestamp": datetime.now().isoformat(),
"input": case["input"],
"output": case["output"],
"scores": scores
})
return self._compute_statistics(criteria)
def _evaluate_single(self, case: dict, criteria: List[str]) -> dict:
# Your evaluation logic here (similar to earlier example)
pass
def _compute_statistics(self, criteria: List[str]) -> dict:
stats = {}
for criterion in criteria:
scores = [r["scores"].get(criterion, 0) for r in self.results]
stats[criterion] = {
"mean": statistics.mean(scores),
"median": statistics.median(scores),
"std_dev": statistics.stdev(scores) if len(scores) > 1 else 0,
"min": min(scores),
"max": max(scores)
}
return stats
def get_failures(self, threshold: float = 3.0) -> List[dict]:
"""Return cases where overall score fell below threshold."""
return [
r for r in self.results
if r["scores"].get("overall", 5) < threshold
]
The key practices: batch your evaluations to reduce overhead, compute statistics across runs rather than fixating on individual scores, and track failures for manual review.
The Limitations (And They’re Real)
LLM judges have significant limitations you need to understand.
Position bias is well-documented. When comparing two responses, judges often prefer whichever one is presented first (or sometimes second, depending on the model and prompt). This affects pairwise comparisons—you should evaluate both orderings and average the results.
Self-preference bias means models tend to prefer outputs that match their own style. GPT-4 judges often prefer GPT-4-generated responses over Claude responses, and vice versa. If you’re evaluating model A using model A as the judge, your results will be inflated.
Length bias is subtle but persistent. Longer, more detailed responses tend to score higher even when the extra length doesn’t add value. You may need to explicitly instruct judges to penalize unnecessary verbosity.
Calibration drift happens when the judge model gets updated. If OpenAI updates GPT-4, your evaluation scores might shift even though your system hasn’t changed. This makes longitudinal comparisons tricky—always note which judge model version you used.
Genuine disagreement with humans occurs on some percentage of cases. Studies show around 80% agreement between GPT-4 judges and human raters on preference tasks. That means 20% disagreement—on one in five evaluations, the judge might get it wrong by human standards.
Gaming is possible. If you know how the judge evaluates, you can optimize for judge approval rather than actual quality. This is especially problematic if you’re using judges in training loops—you might train models that fool the judge without actually improving.
Best Practices
Based on my experience running judges in production:
Use the best model you can afford for judging. The quality gap between GPT-4 and smaller models is significant for evaluation tasks. This is one place where spending more usually pays off.
Be explicit about criteria. Vague instructions produce vague judgments. Define exactly what you mean by “helpful” or “accurate” in the context of your specific task.
Request reasoning before scores. Judges that explain their thinking produce more consistent and useful evaluations than those that just output numbers.
Run multiple evaluations for important decisions. If you’re making a deployment decision based on eval scores, run the evaluation three times and check for consistency. High variance suggests your evaluation setup needs refinement.
Calibrate against human judgments. Periodically have humans rate a sample of the same outputs your judge is rating. If they diverge significantly, your judge prompts need adjustment.
Track judge consistency over time. The same input-output pair should receive similar scores across evaluations. If scores vary wildly, something is wrong with your setup.
Don’t use the same model as both generator and judge when possible. The self-preference bias is real. If you’re evaluating Claude outputs, use GPT-4 as the judge, and vice versa.
The Bottom Line
LLM judges aren’t perfect, but they’re good enough to be useful. They catch issues that automated metrics miss, scale to volumes that human evaluation can’t handle, and provide fast feedback that accelerates development.
Start simple: take your existing test cases, add an LLM evaluation step, and see what it catches. You’ll quickly learn where judges help and where they fall short for your specific use case.
The key insight is that you’re not replacing human judgment—you’re augmenting it. Judges handle the volume, humans handle the edge cases and calibration. Together, they’re more effective than either alone.