Goodreads

Takeaways

How to Use AI as a Judge

  • LLMs perform better on textual classification tasks than numerical scoring tasks.
  • Example: Asking an AI judge to classify whether an answer is “helpful” vs. “not helpful” will yield more reliable results than asking it to rate on a 1–10 scale.

Criteria Ambiguity

  • Transparency is critical. Never trust an AI judge if you don’t know:
    • The underlying model used.
    • The prompting setup driving its evaluations.
  • Without this, you can’t audit or reproduce results.

Biases of AI as a Judge

  • AI judges inherit systemic biases from training data and instructions.
  • This can skew comparative evaluations, rankings, and perceived “quality” of outputs.

Ranking Models with Comparative Evaluation

  • Not all questions should be answered by preference. Some must be answered by correctness.
    • Example: If asked “Is there a link between cell phone radiation and brain tumors?”, preference voting between “Yes” and “No” can be misleading. Correctness requires grounding in factual evidence, not just preference.
  • Preference-only systems risk encoding misaligned behaviors if used for training.

Scalability Bottlenecks

  • Comparative evaluation grows quadratically with model count.
    • To compare N models, you need ~N² pairwise comparisons.
    • This creates data and cost bottlenecks at scale.

From Comparative to Absolute Performance

  • Winning against another model (relative performance) doesn’t always translate to real-world success.
    • Example: Model A resolves 70% of tickets. Model B wins 51% of the time in pairwise evaluation against A.
    • This does not clearly map to how many tickets B will resolve.
  • Need methods to bridge relative judgments → absolute utility.

Evaluation Criteria in Production

  • Early ChatGPT-era chatbots were deployed without clear metrics. Many companies still don’t know if they improve or harm user experience.
  • Reinforces need for explicit evaluation frameworks.

Model Selection Workflow (Figure 4–5)

  1. Filter by hard attributes
    • Narrow down based on non-negotiables (e.g., open-source vs. API, deployment constraints, compliance).
  2. Use public benchmarks
    • Leaderboards, academic tests, latency/cost trade-offs help shortlist candidates.
  3. Run custom experiments
    • Evaluate with your own pipeline, tailored to your objectives (quality, cost, latency).
  4. Continuous monitoring
    • Track production behavior, detect failures, collect feedback, and retrain/update accordingly.