Evaluating AI: Test Sets, Metrics, and the ‘Looks Good’ Trap

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

Many AI systems look impressive in a demo—and fail quietly in production. This is the “looks good” trap: you test on easy or unrealistic data, use the wrong metric, miss data leakage, or ignore edge cases. This guide shows a simple evaluation playbook: build a real test set, pick metrics that match costs, run slice analysis, and add lightweight human evaluation.

Quickstart: evaluate an AI feature in 60 minutes

If you’re short on time, do this. It catches the most common failures before you ship.

1) Define success like a product

Write the decision and the cost of mistakes. Evaluation is meaningless without this.

What action happens when the model says “yes”?
Which is worse: false positives or false negatives?
What’s the minimum acceptable quality to ship?

2) Build a “real” test set

No cherry-picked examples. Use production-like data and include edge cases on purpose.

Sample from recent production inputs
Include hard cases + “boring” cases
Freeze it (don’t tune on it)

3) Choose metrics that match costs

Accuracy hides pain. Use precision/recall, F1, PR-AUC, calibration, or task-specific scoring.

Binary classification: precision/recall + thresholds
Ranking: NDCG / MRR
LLM tasks: human or rubric scoring + pass/fail tests

4) Do slice checks (where it breaks)

Overall metrics can look fine while one group or scenario fails badly.

Break down by source, device, region, time
Break down by difficulty (short/long, bright/dark)
Check rare but high-cost cases

The fastest red flag

If performance is “amazing” early, suspect data leakage or an unrealistic split. Great metrics are easy to fake by accident.

Overview: why demos lie (even when nobody is cheating)

The “looks good” trap happens when evaluation doesn’t match real usage. The model isn’t necessarily bad— the test is bad. Here are the most common ways teams accidentally fool themselves:

Four ways evaluation goes wrong

Trap	What it looks like	What to do instead
Unrealistic test set	Clean, centered, “nice” examples	Sample from production + include edge cases
Wrong metric	High accuracy but users complain	Choose metrics by cost (precision/recall, ranking, etc.)
Leakage / split bugs	Too-good-to-be-true results	Use group/time splits and leakage checks
No slice analysis	Works “overall”, fails for a segment	Report by slice + worst-case metrics

A strong evaluation plan answers two questions: (1) Will this work for real users? (2) What will it break, and how badly?

A practical rule

If you can’t explain how your test set matches production, your metric is just a number—not evidence.

Core concepts: test sets, metrics, and what they miss

1) Test sets: the point is realism, not size

A test set is a frozen slice of reality. It should represent what the model will see after launch. Bigger helps, but realism helps more.

A good test set is…

Representative of production inputs
Includes edge cases (rare but important)
Independent (not used for tuning)
Well-labeled (or reviewed)

A bad test set is…

Curated from “best examples”
Missing failure modes
Leaky (same user/item appears in train + test)
Constantly changed when metrics dip

2) Data leakage: the reason metrics feel magical

Leakage means your model is getting clues it won’t have in the real world, or your split allows near-duplicates across sets. Common examples: same user appears in train and test, consecutive video frames split across sets, or a feature accidentally encodes the label.

Leakage checklist (fast)

Split by group (user, device, session, scene) when relevant
Split by time if the world changes (news, fraud, behavior)
Deduplicate near-identical samples
Scan features for “label in disguise” (IDs, tags, filenames)

3) Metrics: choose what matches the cost of errors

Metrics are proxies. Choose the proxy that matches real pain. Here’s a quick map of common tasks:

Metric map: what to use when

Task	Often-used metrics	Watch out for
Binary classification	Precision, Recall, F1, PR-AUC	Accuracy hides imbalance; threshold matters
Multi-class	Macro-F1, per-class recall, confusion matrix	One class can fail while average looks OK
Detection (CV)	mAP@IoU, per-size metrics	Small/occluded objects often fail silently
Ranking / retrieval	NDCG, MRR, Recall@K	Average can hide worst-case queries
Forecasting	MAE, RMSE, MAPE	Outliers dominate RMSE; MAPE breaks near 0
LLM outputs	Rubric scoring, pass@K, unit tests	“Looks fluent” ≠ correct or safe

4) Thresholds and calibration: “When do we say yes?”

Many systems don’t fail because the model is weak—they fail because the threshold is wrong. A model can be good but still annoy users if it triggers too often.

Threshold rule of thumb

Choose a threshold based on the cost of false positives vs false negatives—and measure that tradeoff explicitly.

Step-by-step: an evaluation playbook you can reuse

Use this flow to evaluate models, LLM features, or any AI-based decision system. It’s designed for real shipping teams, not research papers.

Step 1 — Define the decision and the failure costs

Write this sentence: “Given X, predict Y, so we can do Z.” Then list the top failure modes.

False positive cost: what happens if you trigger when you shouldn’t?
False negative cost: what happens if you miss when you should trigger?
Worst-case: what’s the “cannot happen” failure?

Step 2 — Choose the right split (random is not always right)

Random splits can leak information when samples are correlated (users, sessions, scenes, time). Use a split that matches real deployment.

Split strategy guide

Situation	Use	Why
User-specific patterns	Group split by user	Prevents “memorizing” user quirks
Time changes (fraud, news, behavior)	Time split	Simulates future performance
Video / burst photos	Scene/session split	Stops near-duplicate leakage
Multiple sources/cameras	Source split + slice checks	Reveals domain shift

Step 3 — Design a test set that can’t be gamed

Make your test set a mix of typical and difficult samples. Include the failures you fear. If possible, keep a second “final” set that is never touched until release.

Test set composition (practical)

70–85% representative “normal” traffic
10–25% edge cases (rare but important)
Small “golden” subset with expert labels

Edge cases to include

Low quality inputs (noise, blur, short text)
Ambiguous cases (close calls)
Out-of-distribution examples
Adversarial-ish cases (prompt tricks, weird phrasing)

Step 4 — Report more than one number

“Overall score” is useful, but incomplete. Add: confusion matrix, per-class metrics, and worst-slice performance.

Minimum evaluation report

Primary metric (aligned to cost)
Threshold and the precision/recall tradeoff
Worst 3 slices (by drop vs overall)
Top 20 failure examples with notes
Leakage checks performed

Step 5 — Add lightweight human evaluation (especially for LLMs)

For generative systems, automatic metrics often miss what users care about: correctness, helpfulness, style, safety, and consistency. A small human eval can be the difference between “demo magic” and “real reliability”.

A simple human eval setup

50–200 real prompts/tasks
Clear rubric (0–2 or 1–5 scale)
Blind comparison when possible (A vs B)
Track disagreements → clarify rubric

Rubric dimensions that matter

Correctness (facts, logic)
Completeness (did it answer?)
Helpfulness (actionable)
Safety / policy constraints
Consistency (same question → similar answer)

Step 6 — Production evaluation: monitor, don’t guess

Real-world evaluation continues after launch. Data shifts, user behavior changes, and silent failures appear. Plan monitoring from day one.

Production monitoring checklist

Input drift (new sources, new distributions)
Output drift (confidence shifts, class mix changes)
Quality signals (sampled human review, user feedback)
Rollback plan (versioned models + thresholds)

The “boring but true” secret

The best AI teams win by being great at evaluation, not by finding magical architectures.

Common mistakes (and how to avoid them)

If evaluation feels confusing, it’s usually because one of these is happening.

Mistake 1 — Optimizing the test set

If you keep adjusting what’s in “test”, your metric becomes a moving target.

Fix: freeze test; tune on validation only
Fix: keep a final “release” set untouched

Mistake 2 — Relying on accuracy

Accuracy can be high even when the model fails the cases you care about.

Fix: use precision/recall + thresholds
Fix: report per-class and worst-slice metrics

Mistake 3 — Ignoring leakage

Leakage is the easiest way to get “great results” that vanish in production.

Fix: group/time splits
Fix: dedupe and inspect features for label proxies

Mistake 4 — No error analysis

Numbers tell you “how much”. Errors tell you “why”.

Fix: review the top 20 failures every iteration
Fix: turn failure themes into new data/slices

A brutal truth

If you can’t reproduce your metric improvement on a frozen test set, it wasn’t an improvement.

FAQ: evaluation questions people search for

What is a test set in machine learning?

A test set is a held-out dataset used to estimate how a model will perform on new, unseen data. It should be representative of production, properly split to avoid leakage, and kept frozen so results are comparable over time.

Why is accuracy a bad metric for many AI problems?

Accuracy can hide class imbalance and doesn’t reflect the cost of different errors. For example, a fraud model with 99% accuracy can still miss most fraud if fraud is rare. Precision/recall, F1, and threshold analysis usually describe performance better.

What is data leakage and how do I detect it?

Data leakage happens when information from the test set influences training—directly or indirectly. Detect it by using group/time splits, deduplicating near-identical samples, inspecting features for label proxies, and being suspicious of “too good” metrics early.

What is slice analysis?

Slice analysis means reporting performance on subsets of data (device type, source, language, lighting, region, class, etc.). It helps you find hidden failures that average metrics hide.

How do I evaluate LLM outputs reliably?

Use a mix of: (1) a curated prompt set that matches real use, (2) a clear rubric or pass/fail tests, (3) blind A/B comparisons when you can, and (4) tracking failure modes (hallucinations, missing steps, unsafe outputs).

When is a model “good enough” to ship?

When it meets a quality bar tied to real costs: acceptable precision/recall at a chosen threshold, stable performance across slices, and a plan for monitoring + rollback. Shipping is a product decision, not a metric decision.

Cheatsheet: the “don’t fool yourself” evaluation checklist

Before training

Define the product decision + error costs
Pick split strategy (random vs group vs time)
Write down primary metric + threshold plan
Design slices you must not fail

After training

Evaluate on frozen test set (not validation)
Report per-class metrics + confusion matrix
Run slice analysis (best + worst)
Review top failures and label issues

Quick metric picks

Binary tasks: Precision/Recall + PR-AUC
Multi-class: Macro-F1 + per-class recall
Ranking: NDCG / MRR + Recall@K
Detection: mAP + small/occluded breakdown
LLM: rubric scoring + pass/fail tests

If you remember one thing

Evaluation isn’t “a metric”. It’s evidence that your system will work on the messy reality you’ll actually ship into.

Wrap-up

The “looks good” trap is normal—and avoidable. If you build a realistic test set, choose metrics by cost, run slice analysis, and review failures, your AI work becomes repeatable. Demos are fun, but evaluation is what makes quality real.

Your next step

Create a frozen test set from real production inputs.
Pick one primary metric + one safety metric (worst-slice or “cannot fail” slice).
Run a short error review and write down the top 3 failure themes.

UniLab Editorial

Modern learning notes for practical builders.

Evaluating AI: Test Sets, Metrics, and the ‘Looks Good’ Trap

Quickstart: evaluate an AI feature in 60 minutes

1) Define success like a product

2) Build a “real” test set

3) Choose metrics that match costs

4) Do slice checks (where it breaks)

Overview: why demos lie (even when nobody is cheating)

Four ways evaluation goes wrong

Core concepts: test sets, metrics, and what they miss

1) Test sets: the point is realism, not size

A good test set is…

A bad test set is…

2) Data leakage: the reason metrics feel magical

Leakage checklist (fast)

3) Metrics: choose what matches the cost of errors

Metric map: what to use when

4) Thresholds and calibration: “When do we say yes?”

Step-by-step: an evaluation playbook you can reuse

Step 1 — Define the decision and the failure costs

Step 2 — Choose the right split (random is not always right)

Split strategy guide

Step 3 — Design a test set that can’t be gamed

Test set composition (practical)

Edge cases to include

Step 4 — Report more than one number

Minimum evaluation report

Step 5 — Add lightweight human evaluation (especially for LLMs)

A simple human eval setup

Rubric dimensions that matter

Step 6 — Production evaluation: monitor, don’t guess

Production monitoring checklist

Common mistakes (and how to avoid them)

Mistake 1 — Optimizing the test set

Mistake 2 — Relying on accuracy

Mistake 3 — Ignoring leakage

Mistake 4 — No error analysis

FAQ: evaluation questions people search for

What is a test set in machine learning?

Why is accuracy a bad metric for many AI problems?

What is data leakage and how do I detect it?

What is slice analysis?

How do I evaluate LLM outputs reliably?

When is a model “good enough” to ship?

Cheatsheet: the “don’t fool yourself” evaluation checklist

Before training

After training

Quick metric picks

Wrap-up

Quiz

Related posts