← All posts

Customer Service Agent

10 posts

How Much to Label: Not a Percentage of Traffic, but "Label Until You Can Conclude"

"We labeled 50, 96% correct — ship it?" No — the statistical lower bound is only 86%. In 5 minutes you'll see through the small-sample 96% mirage; in 10, why labeling volume tracks "intents × channels," not traffic share; in 20, you'll have a table mapping true accuracy to rows-to-label, plus the cheapest rule there is: labeling volume grows with the number of intents × channels, not with traffic.

Jul 3, 2026·12 min read

After Launch Is Where Agent Architecture Is Decided — Start With How You Sample Your Eval Set

"The stronger AI gets, the fewer people you need" is the most popular illusion of the past two years. Where agents are built with real money, labeling and evaluation teams are growing, not shrinking. In 5 minutes you'll see what separates "can demo it" from "can keep it right"; in 10, the 3 sampling questions that expose an eval set; in 20, how to redraw yours from "eyeball the logs" into a stratified sampling frame that pulls rare, high-stakes events — a wrong refund, a mishandled compliance case — back from "never sampled" into view.

Jul 2, 2026·12 min read

Corpus Drives Codebook — Why Your Intent Taxonomy Is Stuck at 60% and How It Evolves from 36 to 48 | Agentic AI in Practice (X)

Customer-service Agent in production, 36 intents, unknown rate 40%, the business side asks 'can we just add an LLM fallback?' The real problem is not the classifier — it's the codebook itself. Five minutes in you can spot the wrong diagnosis ('unknown rate high = classifier weak'); ten minutes in you have the four-quadrant test that filters 80% of pseudo-missing-intent requests; twenty minutes in you have the corpus → codebook iteration loop that evolves a taxonomy from 36 to 48 stable intents.

May 28, 2026·14 min read

Don't Let AI Agents Call APIs Directly — A 5-Layer Tool-Calling Stack + 25-API Contract Checklist

The most common fake architecture in customer-service Agent projects this year: 'we let Agents call order / ticket / logistics APIs directly, 25 integrations done, full coverage' — then ask 'what happens when the OMS vendor changes?' answer 'rewrite,' 'how does QA do mock integration?' answer 'wait for the real interface,' 'compliance audit for write operations?' answer 'we'll add logging.' This is missing layers. Written for architects, founders, and project owners running enterprise Agentic projects: 5 min to spot the most expensive architecture mistake, 10 min to lock in the 5-layer responsibility split (Adapter / Service ABC / Tool / Workflow / Critic), 20 min to walk out with a 6-systems × 25-APIs integration matrix + 5 architecture decisions to drive this week.

May 27, 2026·17 min read

Intent Classification for Chatbots: Why Pure-Rule and Pure-LLM Both Fail (a 3-Tier Cascade)

Intent classification is the first node in any customer-service Agent — get it wrong and the next four architecture decisions are wasted. Pure-rule is brittle; pure-LLM blows the budget. The 3-tier fallback (rule → embedding → LLM) is the only engineering trade-off that stands up. Five minutes in you can spot the two fake architectures ('just use an LLM' / '100% rules'); ten minutes in you have starting thresholds for all three tiers; twenty minutes in you have the signals that say it's time to evolve from HybridClassifier to LLM Router.

May 25, 2026·16 min read

Pytest-Green Doesn't Mean Ship-Ready: How to Actually Test an AI Agent (Dual-Track)

The thing your customer-service Agent project gets most easily fooled by this year: 'pytest 400+ green, coverage 79%, CI gate passing.' Then the boss asks 'what's the faithfulness rate? Tone compliance? Prompt-injection block rate?' and nobody answers. The 'tests passed' bar for an AI system is not the 'tests passed' bar for traditional software. This piece is for architects, founders, and project owners shipping Agentic AI inside an enterprise: 5 min to see why pytest-green is misleading, 10 min to decide who owns which 4 of the 8+ test buckets, 20 min to walk out with a 7-quality-dimension threshold table + 3-cadence rhythm + 5 things to drive this week — bring it to your next architecture review.

May 25, 2026·16 min read

Containment Rate vs Resolution Rate: The Only Customer-Service AI Metric That Matters (How "98% CSAT" Gets Faked)

The CEO gets a weekly email from the vendor: CSAT 98%. I pulled the raw data — ~5% of customers rated 'satisfied,' a fraction of a percent rated 'unsatisfied,' 95% never responded. 'Silent = satisfied by default' is how that 98% got built. Five minutes to see through four flavors of fake-resolution claim; ten minutes to redraw your team's customer-service north star.

May 22, 2026·16 min read