Your Dashboards Are Green While the Agent Quietly Gets Dumber — Post-Launch Silent Drift

Yaqin Hei·July 5, 2026·13 min read

Agentic AI After Launch Silent Degradation Drift Concept Drift Monitoring Evaluation Customer Service Agent Enterprise AI

Your Dashboards Are Green While the Agent Quietly Gets Dumber — Post-Launch Silent Drift

Post #5 of the After Launch series. The first four got the agent shipped: which rows, how many, how reliably labeled, how to judge launch. This one is about after — how an agent that clears today quietly gets dumber with no alarm at all. 中文版：看板全绿，agent 却在悄悄变笨。

Running a customer-service agent at a consumer-tech company for a few years, I learned one counterintuitive thing from watching the metrics: they're never a flat line — they wobble constantly.

The hard part isn't spotting that the agent "broke" — it almost never breaks suddenly. The hard part is telling, from an accuracy that jitters every day, whether this dip is "normal noise" or "it's genuinely getting dumber."

And it genuinely keeps changing. Every product launch, every back-to-school season, the question distribution shifts hard and a wave of phrasings the agent has never seen pours in; intent drifts, concept drifts, season drifts. And none of it shows up — not a pixel — on the all-green CPU, QPS, latency, uptime dashboards.

An agent that clears the gate today will quietly slide over six weeks, with no alarm at all. And by the time you notice, it's usually because complaints and GMV dropped — the most lagging signal there is.

The dashboards are green while the agent quietly gets dumber

An agent doesn't break suddenly, it degrades quietly — and "is it still right" isn't on any dashboard you have.

Infra dashboards (CPU / QPS / latency / uptime) measure "is it alive, is it fast," not "is it right." An agent can run at 200ms latency, 99.99% uptime, 30% CPU, and simultaneously miscompute every price difference and answer every new-product question with old-model facts. Infrastructure health and answer-quality health are two completely different things — and most teams only have a dashboard for the first.

Infra dashboards measure "alive and fast," not "still right" — an agent at 200ms and 99.99% uptime can miscompute every price difference at the same time.

That's the sneakiest thing about silent degradation: it trips no alarm. No 5xx, no timeout, no CPU spike. Every traditional ops signal is green while the agent gets batch after batch wrong.

Your eval set is frozen at launch day, scoring an old world

The eval set you built on launch day will use a pretty old number to hide a world that's changing.

The sample you drew at launch, the 200 rows you carefully labeled, keeps reporting the same 94% every week after — because it's the same batch. But real traffic moved on long ago: back-to-school poured in "how do I get the student discount," a launch poured in "what's the difference between the new and old model," and your eval set has not one of them. It's scoring a world that no longer exists — and scoring it high.

Post #1 said the first lesson of sampling is "don't scroll logs, sample." Here's an equally important second one: sampling isn't one-time either. The eval set has to be alive — re-sampling from current traffic, adding new intents, cutting stale ones — to keep up with drift. An eval set frozen at launch day is the most respectable self-deception: it hands you a green light every week while the world has turned over three times.

By the time GMV and complaints drop, it's too late — watch the leading signals

The metric jitters daily; you can't react to every wobble (that's noise), and you can't wait for GMV and complaints (that's weeks late). What to watch are the signals that fire before them.

GMV, complaints, satisfaction are lagging signals — by the time they fire, the agent has been rotting for weeks. These leading signals move before quality visibly drops:

Unknown rate rises — intent classification can't hold it, meaning new phrasings poured in (see intent codebook evolution).
Fallback / human-handoff rate rises — the share where the agent raises its hand and says "I can't" is climbing.
Critic block rate shifts — the pattern of money-moving writes getting blocked changed.
An intent's share jumps — back-to-school takes "student discount" from 0 to 8%, the distribution drifted.
Confidence distribution shifts left — the model is less and less sure of its own answers.

Leading signals move the moment "the world changed"; lagging signals wait for errors to pile into a GMV dip — the weeks between them are your only window to stop the bleeding.

They "lead" because they move the moment the world changes, not after the agent's errors pile into a GMV drop. As for telling real drift from noise: don't watch single-point jitter, watch the trend plus a control line — a signal crossing the line for several consecutive batches is real degradation, not a one-day wobble (the other side of post #2's CI thinking: use intervals and trends, not single points).

Where drift comes from: four real sources

Drift isn't one disease, it's four — different sources, different ways to watch.

Intent drift: users change how they phrase the same thing, old intent classification can't hold it → unknown rate rises.
Concept drift: the same word changes meaning. "New model" meant A last month, B this month; "promo" swaps from a big sale to back-to-school → the agent answers by the old concept, confidently wrong.
Seasonal drift: back-to-school, big sales, launch events — the distribution shifts cyclically, and your everyday eval set can't represent these weeks at all.
New product / new app launch: the fiercest. Every launch pours in a wave of never-seen phrasings; unknown rate and handoff rate step up the same day.

One of these four is especially sneaky: your knowledge base or policy changed, but the agent still answers by the old version — the source moved, the agent didn't. It's the engineering form of concept drift; I dissected one such case fully in Anatomy of a 9-Day Silent KB Desync. The first step in spotting drift is knowing it wears these several faces — don't treat "unknowns from a launch surge" and "model degradation" as the same thing to fix.

Three things you can do this week

Add a "quality dashboard" next to the infra one. At minimum, three leading signals: unknown rate, handoff rate, confidence distribution. Remember one line — infra green does not mean quality green.
Make the eval set alive. Stop using the launch-day batch; re-sample a small slice from current traffic weekly, add it to the eval set via post #1's stratified frame, so it keeps up with drift.
Draw a control line on each leading signal. Alarm only after several consecutive batches cross it (trend, not single-point jitter) — separating "wobbled today" from "genuinely degrading," so you're neither woken by noise nor oversleeping.

An agent that cleared your gate isn't the finish line. The world keeps moving and it won't keep up on its own. Silent degradation isn't an accident, it's the default outcome. And catching drift is only stopping the bleeding.

The real cure is feeding the drift signals back into the next version of the agent — turning the new phrasings in "unknown" into new intents, new labels, the next iteration. The next post covers that loop: the data flywheel — how to make an agent get sharper with use, not dumber.

Which rows, how many, how reliably labeled, how to judge launch, how to guard against degradation — the After Launch mechanism, up to here, lets an agent not just "get up" but "stay up."

If this made you decide to add a quality dashboard and watch a few leading signals, send me the keyword "DRIFT KIT" and I'll share the 6-leading-signal checklist plus how to set the control lines.

Share on X

Subscribe for updates

Get the latest AI engineering posts delivered to your inbox.

← All posts