The playground is lying to you.
You wrote a system prompt, tested a few examples, the results looked good, so you shipped. Three weeks later, a user finds an edge case where the model outputs something embarrassing. You tweak the prompt. The edge case improves. Another edge case appears. You tweak again. The first edge case gets worse.
This is prompt engineering without evals. It's whack-a-mole.
What an eval actually is
An eval is a dataset of inputs with expected outputs or grading criteria, combined with a runner that executes your AI feature against that dataset and reports pass/fail or a score.
That's the whole thing. It's not a new concept — it's the same idea as a test suite. A collection of cases that let you verify your system's behavior without running it manually against production. The difference is that LLM outputs are probabilistic, so "expected output" requires more nuance than a simple equality check. That's a solved problem.
Why teams skip them
The friction is real. Building an eval requires deciding what "correct" looks like, which is hard when your AI feature has subjective outputs. A customer support assistant — is the response good? Is it helpful? Is it appropriately concise? These aren't binary criteria.
So teams default to manual testing: read a few outputs, make a judgment call, ship. This works for a single developer who understands the system deeply. It fails when the system scales, the team grows, or the underlying model gets swapped.
The other reason is timeline pressure. Evals feel like extra work before the feature is done. They're actually the work that determines whether the feature stays done.
A minimal setup
For a text-generation feature, a working eval setup has three components.
A dataset. Twenty to fifty representative inputs collected from your actual use case — not invented examples. Include your golden cases (inputs where you know exactly what good looks like) and your edge cases (inputs where the feature has failed before, or where failure would be expensive).
A scorer. A function that takes an output and returns a pass/fail or numeric score. Scorers range from simple — is the JSON valid? does the output mention the product name correctly? — to complex — does the output match the expected intent? For nuanced quality judgments, an LLM-as-judge pattern works well: a second model evaluates the first model's output against a rubric you define. This sounds circular but is reliable in practice when the rubric is specific.
A runner. A script that runs the dataset through your feature, calls the scorer on each result, and reports the aggregate pass rate. Run it in CI. Failing evals block deployment.
A concrete example
We built a document summarization feature for a legal tech client. Before shipping, we created thirty eval cases from actual legal documents: simple agreements, dense multi-party contracts, and a handful of edge cases with atypical structure.
The scorer checked three things: does the summary contain no information not present in the source (hallucination check via LLM-as-judge), is the summary shorter than the original (format check), and does the summary include the key obligations a human reviewer flagged (task-specific quality check).
Initial pass rate: 71%. After two rounds of prompt iteration driven by the failing eval cases, we shipped at 91%.
That 71% was invisible in the playground. When you pick your own test inputs and read them one at a time, you unconsciously select the cases that look good. The eval set is representative. The playground is curated.
The eval that runs before every deploy
Once an eval set exists, the workflow changes structurally:
- Change a prompt, run evals, see if pass rate went up or down before touching production
- Swap model versions, run evals, know immediately whether quality regressed
- A user reports a bug, add their case to the eval set, write a fix, confirm it passes, ship with confidence
This is identical to how testing works for regular code. The only adjustment is probabilistic scoring rather than exact matching — which, again, is a solved problem.
The compounding effect matters here. An eval set that starts at thirty cases grows with every bug report and edge case discovered in production. Eighteen months in, you have a dataset that systematically captures everything your feature has ever failed at. That's a significant engineering asset that manual testing can never produce.
What a studio without evals is actually telling you
Ask any team building your AI features: how do you know if a prompt change made things better or worse? If the answer is "we test it manually" or "we try a few examples in the playground," that's not a system. It's hope with extra steps.
A studio that ships AI features without evals is operating on vibes. They may have excellent vibes — good taste, experienced engineers who catch obvious failures. But they have no regression protection, no quantified quality baseline, and no way to hand the system to a new team member who can evaluate changes rigorously.
The presence of evals signals something broader about engineering discipline. Teams that build eval harnesses have thought carefully about what "correct" means for their feature, have instrumented their pipelines to support testing, and have built a feedback loop that catches regressions before users do. That discipline shows up everywhere else in the codebase.
AI features that work beautifully in demos and degrade in production almost always lack this. The demo was handpicked. Production is everything else.