Goodreads
Takeaways
How to Use AI as a Judge
- LLMs perform better on textual classification tasks than numerical scoring tasks.
- Example: Asking an AI judge to classify whether an answer is “helpful” vs. “not helpful” will yield more reliable results than asking it to rate on a 1–10 scale.
Criteria Ambiguity
- Transparency is critical. Never trust an AI judge if you don’t know:
- The underlying model used.
- The prompting setup driving its evaluations.
- Without this, you can’t audit or reproduce results.
Biases of AI as a Judge
- AI judges inherit systemic biases from training data and instructions.
- This can skew comparative evaluations, rankings, and perceived “quality” of outputs.
Ranking Models with Comparative Evaluation
- Not all questions should be answered by preference. Some must be answered by correctness.
- Example: If asked “Is there a link between cell phone radiation and brain tumors?”, preference voting between “Yes” and “No” can be misleading. Correctness requires grounding in factual evidence, not just preference.
- Preference-only systems risk encoding misaligned behaviors if used for training.
Scalability Bottlenecks
- Comparative evaluation grows quadratically with model count.
- To compare N models, you need ~N² pairwise comparisons.
- This creates data and cost bottlenecks at scale.
From Comparative to Absolute Performance
- Winning against another model (relative performance) doesn’t always translate to real-world success.
- Example: Model A resolves 70% of tickets. Model B wins 51% of the time in pairwise evaluation against A.
- This does not clearly map to how many tickets B will resolve.
- Need methods to bridge relative judgments → absolute utility.
Evaluation Criteria in Production
- Early ChatGPT-era chatbots were deployed without clear metrics. Many companies still don’t know if they improve or harm user experience.
- Reinforces need for explicit evaluation frameworks.
Model Selection Workflow (Figure 4–5)
- Filter by hard attributes
- Narrow down based on non-negotiables (e.g., open-source vs. API, deployment constraints, compliance).
- Use public benchmarks
- Leaderboards, academic tests, latency/cost trade-offs help shortlist candidates.
- Run custom experiments
- Evaluate with your own pipeline, tailored to your objectives (quality, cost, latency).
- Continuous monitoring
- Track production behavior, detect failures, collect feedback, and retrain/update accordingly.