The eval gap. Why most AI features ship without a test suite.

Almost every AI project we inherit, we inherit without an evaluation suite. Not bad evals. Not stale evals. No evals. The feature shipped because someone tried a handful of prompts, the output looked plausible, and the Friday deploy went green.

Then the provider upgrades the model. Or a prompt gets a small edit. Or the data distribution shifts under a new marketing campaign. The output goes sideways, and nobody notices for two weeks, because nobody is checking. That is the eval gap, and closing it is the single biggest payoff most teams can get from their AI systems this quarter.

Three levels of evaluation rigour

In descending order of how many teams we see operating at each level.

Level 1: no evals. The engineer runs the prompt three times in a playground, the output looks good, they ship. Reliability is "it worked when I tried it." A new model version drops? Fingers crossed.

Level 2: vibes-based evals. There is a notebook with ten test cases the engineer runs before deploys. Nobody else can run it, because it is wired to their laptop and their personal API key. It is not in CI. It answers the question "did I break something obvious", but nothing beyond that.

Level 3: a real suite. A held-out dataset of 50 to 500 cases, a scoring function per task, and a CI job that runs them on every change to prompts, tools, or model versions. Passing thresholds are set. A regression blocks the deploy. When a provider ships a new model, the team runs the suite against it the same day.

Nine out of ten teams we audit are at level 1 or 2. The cost of moving to level 3 is usually two weeks of focused work, maybe three. The return is that the system stops surprising everyone.

What a real eval actually looks like

The template depends on the task. Three we run across client builds.

Classification tasks. A CSV of inputs and expected labels. Score is accuracy, or macro F1 if the classes are imbalanced. We keep a confusion matrix in the commit notes when accuracy moves more than two percentage points between runs. The test set is frozen. New examples go into a staging set that gets promoted on a schedule, not per-change, so that you cannot accidentally overfit the prompt to the eval.

Structured generation tasks. A schema, an input, and a set of invariants the output must satisfy. Not string equality. Invariants like "every line item has a positive quantity", "the total matches the line sum", "the currency is one of the three allowed values". We write those as boolean predicates and score pass rate. A schema violation is a failure. Partial credit is rare.

Free-form generation tasks. Two scores. An automated one from a judge model, and a manual rubric we sample every tenth run. The judge scores along dimensions we pre-commit to: correctness, tone, completeness, hedging. The manual sample is eight minutes a week and catches the judge drifting away from our rubric. Without that sample, we have watched judge scores silently decouple from reality within a month.

The test set is the real artefact. It should look like production traffic. If it does not, you are evaluating the wrong thing.

The model-drift check, which is cheap and nobody runs

Providers ship new model versions quietly. Sometimes they change defaults. Sometimes they deprecate a version and silently route traffic to a successor. Your eval is how you know the new version is better, the same, or worse.

We run the full eval suite against the current production model, the newly released model, and the one behind it. Three columns, one table. If the new model is worse on our tasks, we pin the version until we understand why. If it is better, we switch.

Anthropic, OpenAI, and Google have all released upgrades in the last year that regressed tasks our evals caught. In every case the regression was a narrow one, not a catastrophe, but it would have shipped to a customer who was measuring outcomes only after the fact. You do not want the customer to be your eval.

Build the eval before the prompt

This is the discipline that makes the rest work. On new features, we write five to ten evaluation cases by hand, with expected outputs, before the prompt exists. The prompt is then engineered against the eval, not against an internal feeling about what "looks good".

Two benefits. The prompt converges faster because the target is explicit. The team has a reusable contract for the feature that survives staffing changes, because the contract is a test file, not a Slack thread.

The rule: if we cannot write an eval for the task, we do not have the task defined well enough to build it. That rule has saved us more scope-creep conversations than any other single practice.

The things that do not work

For the record, so we do not have to explain it every month.

LLM-as-judge without a rubric. The judge will drift, and the scores will start looking great the week before everything breaks.
Eyeballing through a dashboard. A dashboard is not an eval. It is a vibe with better styling.
Evals that run only at release time. If your eval job takes an hour, it will not run per-commit, and it will not catch regressions until the release window, when it is too late to fix calmly.
Production logs as your eval set. Logs are biased toward the easy queries you already handle. Your eval set needs the cases you got wrong, not the cases you got right.

Cost

The eval infrastructure is usually 10 to 20 per cent of an AI feature's code, and it is the portion teams cut first. Every hour of eval code returns roughly the same value as three hours of prompt tuning, because the tuning compounds and the eval is the substrate that lets it compound. When we quote fixed-price builds, we line-item the evals. Clients who question the line item are clients we are careful with.

The discipline is not interesting. It is just that the teams that skip it get surprised by production, and the teams that keep it do not. The difference compounds. A year in, your AI system either has a calm release cadence or a chaotic one, and the eval suite is what decides which.