AI Engineering

Goodreads

https://www.goodreads.com/book/show/216848047-ai-engineering

Takeaways

LLMs perform better on textual classification tasks than numerical scoring tasks.
Example: Asking an AI judge to classify whether an answer is “helpful” vs. “not helpful” will yield more reliable results than asking it to rate on a 1–10 scale.

Transparency is critical. Never trust an AI judge if you don’t know:
- The underlying model used.
- The prompting setup driving its evaluations.
Without this, you can’t audit or reproduce results.

AI judges inherit systemic biases from training data and instructions.
This can skew comparative evaluations, rankings, and perceived “quality” of outputs.

Not all questions should be answered by preference. Some must be answered by correctness.
- Example: If asked “Is there a link between cell phone radiation and brain tumors?”, preference voting between “Yes” and “No” can be misleading. Correctness requires grounding in factual evidence, not just preference.
Preference-only systems risk encoding misaligned behaviors if used for training.

Comparative evaluation grows quadratically with model count.
- To compare N models, you need ~N² pairwise comparisons.
- This creates data and cost bottlenecks at scale.

Winning against another model (relative performance) doesn’t always translate to real-world success.
- Example: Model A resolves 70% of tickets. Model B wins 51% of the time in pairwise evaluation against A.
- This does not clearly map to how many tickets B will resolve.
Need methods to bridge relative judgments → absolute utility.

Early ChatGPT-era chatbots were deployed without clear metrics. Many companies still don’t know if they improve or harm user experience.
Reinforces need for explicit evaluation frameworks.

Filter by hard attributes
- Narrow down based on non-negotiables (e.g., open-source vs. API, deployment constraints, compliance).
Use public benchmarks
- Leaderboards, academic tests, latency/cost trade-offs help shortlist candidates.
Run custom experiments
- Evaluate with your own pipeline, tailored to your objectives (quality, cost, latency).
Continuous monitoring
- Track production behavior, detect failures, collect feedback, and retrain/update accordingly.