A Pretty Accuracy Number Hid Dozens of Money-Moving Errors — How to Read the Eval to Ship

Post #4 of the After Launch series. The first three: which rows, how many, how reliably labeled. This one is the last step — turning labeled samples into a launch decision that doesn't fool you. 中文版:一个漂亮的做对率,盖住了几十个会动钱的错。
With reliable labels and each intent labeled deeply enough, you reach the last step: turning those labeled samples into a "can it ship / can we ramp" decision. This is the step where people trip.
I ran the money-moving scenario for a retail customer-service agent — refunds, cancels, address changes, shipment interceptions, the writes that move real money. At the launch review, the vendor brought a pretty overall accuracy and requested a rollout. I did one thing: pull the money-moving intents out of that total and compute them on their own.
The moment they were pulled out, the number was ugly: the money-moving intents' wrong-action rate was alarming, dozens of money-touching errors sat there the whole time — and the pretty overall accuracy had diluted them into the tens of thousands of harmless "where's my order" rows, looking perfectly fine.
The money-moving errors were hidden by one blended number. That's the first lesson of a launch decision: a vague "accuracy" or "self-serve rate" is the best hiding place for money-losing errors.
Money-moving errors need their own gate — on the wrong-action rate
A launch decision can't ride on one blended number — money-moving writes must be pulled out on their own, watched by their wrong-action rate, at a far stricter threshold than consult intents.
Why a total lies: money-moving writes are a fraction of a percent of all traffic (the long tail from post #1). Blend them with tens of thousands of read-only queries into one accuracy figure and even if the money-moving intents are a disaster, weighted into the total they move only the second decimal. You nod at a 95% overall while dozens of wrong-refund, mishandled-compliance orders sit in that fraction of a percent.
A wrong consult answer? The customer asks again. A wrong money-moving action? Real money lost, maybe a compliance incident. So money-moving intents get their own track and their own gate, watched not on "accuracy" but on the wrong-action rate (the share of conversations where a money-moving action was executed incorrectly), set at zero tolerance — not "accuracy over 90%" but "wrong-action rate pushed below some very low line."
One blended accuracy dilutes the money-moving track's crash into tens of thousands of read-only queries — two tracks, two costs of error, must have two different gates.Detection move: whenever someone requests launch with a blended accuracy / self-serve rate, ask one thing — pull the money-moving writes out on their own, what's the wrong-action rate? "Can't say" or "never computed it separately" means the request can't be approved.
And some of those dozens of errors are old debt from before the fix
Ready isn't whether the point estimate cleared — it's whether the confidence-interval bound clears the threshold, and you have to surgically cut pre-fix old data per scenario, or you can't see the current true level nor prove the fix worked.
Those dozens of errors hide one more trap: some of them are old wrong answers left by the bug you just fixed last week. Accuracy = correct / total; if the denominator still holds pre-fix old debt, an already-fixed agent gets dragged to "not ready" by its own history — you either block an agent that should ramp, or you can't tell whether the fix did anything.
The remedy is a per-scenario version cut: a scenario that got a bug fix has its pre-fix data cut from the denominator — only that scenario, nothing else touched. A global cut would delete valid labels from unaffected scenarios; a surgical cut washes out the old debt without collateral damage. Then layer the Wilson interval on top, not the point estimate:
wrong_action_rate(money_intent, channel) = mis-executed convos / total convos
where convo.time ≥ that scenario's latest fix version-cut # cut old debt per scenario, nothing else
verdict: ready only when the Wilson UPPER bound is below the red line # read the interval, not the point
Detection move: ask — "does this accuracy's denominator cut pre-fix data? Global, or per-scenario?"
Launch isn't a one-shot ship — it's a progressive rollout
A launch decision's output isn't "ship or not," it's "which channel, at what volume" — each channel gated on its own, ramped only when it clears.
Launch isn't a switch, it's a rollout curve: 10% canary → 50% → full, each step its own gate. And as post #1 noted, the channel is a real variable in accuracy — private clearing doesn't mean the public channel clears. So the rollout is per-channel:
for channel in channels:
if money_wrong_action_rate(channel) not below red line: → don't ramp, go fix the agent
elif accuracy_lower_bound(channel) ≥ threshold: → ramp to the next step (10%→50%→full)
else: → hold this step, keep labeling toward a conclusion
The worst case is then just one channel stuck at 10% canary — not one pretty total pushing all channels to full together and then crashing on some channel's money-moving intent.
Three things you can do this week
- Pull the money-moving writes out of the total, compute the wrong-action rate on their own. Any blended accuracy / self-serve rate — split into "money-moving vs consult" first; watch money-moving on the wrong-action rate at zero tolerance.
- Do a per-scenario version cut on the accuracy denominator. For scenarios that got a fix, cut pre-fix data from the denominator, only that scenario; judge on the Wilson lower/upper bound, not the point estimate.
- Turn launch into a per-channel progressive rollout. Don't push full with one total in one shot; each channel gated on its own (lower bound clears + wrong-action rate held), ramped only when it clears — 10%→50%→full.
A launch decision isn't scoring the agent — it answers a harder question: at which channel, at what volume, how confident am I that it won't crash where money moves. Split it into "wrong-action its own gate + lower bound + cut old debt + per-channel ramp" and you finally have a gate that doesn't fool you.
Which rows, how many, how reliably labeled, how to judge launch — this After Launch mechanism now closes a loop that honestly measures an agent. But launch is only the beginning: an agent that clears today can quietly slide six weeks later. The next post covers post-launch silent degradation — how to catch it before it gets dumb.
If this turned your launch decision from "nod at one total" into "wrong-action its own gate + lower bound + cut old debt + per-channel ramp," send me the keyword "RAMP KIT" and I'll share the money-moving dual-track gate + per-scenario cut + progressive rollout template.
Subscribe for updates
Get the latest AI engineering posts delivered to your inbox.
