Agentic AI

34 posts

The Spec That Vanished: How Agents Reshuffle Business, Product, and Engineering

In a project meeting, engineering wired up the customer-service agent straight from business needs — product got demoted to 'support,' and wasn't happy. This traces why: the spec that used to pin business into config and UI simply can't be written for an agent, and where the three things it carried — decision boundaries, eval ground truth, accountability — must land now. You'll leave able to stop arguing 'who leads' and point at which of three owners your project is missing.

Jul 15, 2026·16 min read

Your Knowledge Center Is a Search Box

A vendor demos a 'knowledge center' — type one line in the search box, get back a tidy answer, the boss signs off on the spot. In 10 minutes you'll be able to puncture that demo in the procurement meeting: know to ask 'does it only retrieve, or can it judge and act,' to make them prove how long a stale policy takes to actually go offline in production, and to count — at acceptance — how many of the '200 items ingested' actually landed.

Jul 6, 2026·12 min read

One Brain, Two Front-Ends: Refunded Twice on Sale Day

On the first day of the big sale, a user requested one refund — and the refund service ran it twice. The money went out twice. In 10 minutes you'll be able to name, in an architecture review, the place a multi-agent system most loves to leak money: know to put an idempotency key on the layer two front-ends share, to ask 'is refundType defined the same in both systems,' and to see that the real bottleneck of multi-agent delivery is which team nods, not model quality.

Jul 6, 2026·12 min read

Your Code Is Fixed. Production Isn't.

"Didn't we fix that last week?" The code merged, the tests were green, and production keeps making the same mistake. In 10 minutes you'll leave able to puncture that illusion in a review: ask which machine's .env, make engineering prove the safety threshold has ever fired in prod, and byte-compare before you reindex.

Jul 5, 2026·12 min read

What Everyone Gets Wrong About "Agents" — Even Heavy Claude Users Can't Define One

The people who ask me this most aren't beginners — they're strong engineers who use Claude and Codex every day: "I've honestly never quite figured out what actually counts as an agent." It's not that they don't get it — the word has been stretched to mean everything, and therefore nothing. By the end you'll have one question that tells you, on the spot, whether anything claiming to "build an agent" actually is one.

Jul 5, 2026·8 min read

Make the Agent Get Sharper With Use, Not Dumber — Spinning Up the Data Flywheel

In labeling we found back-to-school promo questions spiked, but within them "how do I get the gift-with-purchase" was never answered well — so we fixed that KB specifically, and next round that class climbed. That's the data flywheel. In 5 minutes you'll see the first five posts are five spokes of one wheel; in 10 you'll spot why a flywheel spins in place (data piles up, never ships); in 20 you'll have a feedback loop ranked by frequency × cost that also forces you to split intents.

Jul 5, 2026·12 min read

A Pretty Accuracy Number Hid Dozens of Money-Moving Errors — How to Read the Eval to Ship

On a money-moving project I ran, the overall accuracy looked great; but pull the money-moving intents out on their own and the wrong-action rate was alarming — dozens of money-touching errors sat there the whole time, hidden by one blended number. In 5 minutes you'll see through "one accuracy figure to request launch"; in 10 you'll put a separate wrong-action gate on money-moving errors; in 20 you'll have a launch-decision flow: CI lower bound + per-scenario version cut + per-channel ramp.

Jul 5, 2026·13 min read

Your Dashboards Are Green While the Agent Quietly Gets Dumber — Post-Launch Silent Drift

Running a customer-service agent at a consumer-tech company, I learned one counterintuitive thing from watching the metrics: they're never a flat line. Every product launch, every back-to-school season, the question distribution shifts and a wave of new phrasings pours in — the agent quietly gets dumber, and not a pixel of it shows on the green CPU / QPS / latency dashboards. In 5 minutes you'll see through "dashboards green = healthy"; in 10 you'll have 6 leading signals that fire weeks before complaints; in 20 you'll turn the eval set from "frozen at launch day" into one that re-samples current traffic.

Jul 5, 2026·13 min read

Your Labels Are Your Ceiling — One "Swap Half a Size Up" Gets Three Answers From Two Agents

A customer's one line — "I want to swap half a size up" — hides four calls: exchange or return, intercept the shipment or not, refund the price difference or not, and against which price. Two skilled agents label the same 50 rows back-to-back and agree on only 35 — your 96% accuracy was measured with a ruler that's only 70% self-consistent. In 20 minutes you'll have a flow for measuring agreement first, then writing rules like "which price the difference is refunded against" into a rubric.

Jul 4, 2026·13 min read

50 Rows at 96%, Ship It? Size Labeling by the CI Lower Bound, Not the Pretty Number

"We labeled 50, 96% correct — ship it?" No — the statistical lower bound is only 86%. When I designed the labeling gate for a customer-service agent, the first rule I pinned was: read the confidence-interval lower bound, not the point estimate. In 5 minutes you'll see through the small-sample 96% mirage; in 10 you'll have a "true accuracy → rows to label" table that explodes near the threshold; in 20, a CI stopping rule — label more where it's needed, don't waste it where it's done, and call a stop when the answer is fix-the-agent.

Jul 3, 2026·12 min read

The 96% Turned Red the Moment We Split It by Channel — Two Fatal Failures of Eval Sampling

The vendor drew 200 rows at random and reported "96% accuracy." I split those 200 by channel — in the small public-domain channel, the price-difference cases were nearly all wrong, and the high-traffic private channel had diluted it into a pretty number. In 5 minutes you'll see through both failures of random sampling (missing the tail + a big channel drowning a small one); in 20 you'll have a stratified frame with channel drill-down and risk-layer over-sampling.

Jul 2, 2026·13 min read

When the AI Becomes the Storefront, You Decay Into a Supplier: The Relationship Moat for 10 Million Merchants

WeChat is beta-testing an AI that orders and pays for users, and Qwen just opened a brand agent — two routes pointing at the same question: once the AI is the storefront, does the customer still belong to you. This one is for owners and membership / CRM leads: in 10 minutes you'll see how a super app slowly grinds you into a 'supplier,' spot the door Qwen's brand agent leaves open, and walk out with the 3 boundaries to hold (membership / repurchase / profile) plus the questions to put to the platform next week. WeChat is still read-only, no writes — build the relationship layer now, before the write permissions land and you find out you've been running naked.

Jun 30, 2026·10 min read

AI Picked the First Store and the Other Four Vanished: WeChat's New Shelf for 10 Million Merchants

WeChat is beta-testing an AI agent that can search, compare, order, and pay across ~10 million merchants. This one is for owners and growth leads: in 10 minutes you'll see how the AI entry rewrote the physics of traffic distribution, spot the 3 signals that your service is already invisible to the AI, and walk out with 5 questions to put to your platform and team next week. It doesn't bet on any specific API — it explains the one thing that's certain: the shelf's rules changed, and whoever reads the rules first grabs the slot.

Jun 29, 2026·10 min read

The Day WeChat's AI Ordered for Users, 10 Million Merchants' Mini-Programs Expired — A Layered Rebuild Blueprint

WeChat's AI agent has entered beta — it can search, compare, order, and pay across ~10 million merchants. This is the rebuild blueprint for architects and eng leads: which capabilities to scout with auto mode, which to package as SKILLs, how atomic APIs use a state machine to keep the AI on rails, and how to stitch the relationship layer back onto your side. In 30 minutes you walk out with a layered rebuild map + 10 things to do this week + 5 questions for the platform and vendors. The contract isn't frozen — so this post teaches you to design by MODE, not to bet on a specific API.

Jun 24, 2026·30 min read

Your Observability Dashboard Is Throttling the Agent It Watches — an Async Latency Postmortem

A P99 latency spike every few minutes — CPU, QPS, error rate all flat, Agent code unchanged. The culprit no one suspects: the observability dashboard built to watch the Agent was choking it. The postmortem — how one sync call freezes a single-threaded async event loop, the two-line fix (to_thread + TTL cache), and 10 event-loop probes you can add to your async service this week.

Jun 10, 2026·20 min read

Your KB Changed. The Search Index Didn't — Anatomy of a 9-Day Silent Desync | KB-Ops Deep Dive

The same refund line: curl it locally, you get the new wording; curl it in prod, you get the 9-day-old 'not as good as Taobao.' Between the source file and prod sat one step I assumed was automatic and was actually manual. This is the engineering postmortem: two stacked silent-desync root causes + how 33 test-feedback rows cluster into 16 with one cause + why an all-green dashboard hid it + 10 gates you can add to your own KB pipeline this week. Twenty minutes in, you can find the same hole in your own source-plus-derived-index system.

Jun 3, 2026·20 min read

What a Real, Money-Moving L2 Refund Workflow Actually Looks Like | Workflow Deep Dive

The refund flow on the slide is a clean 8-step line. Built for real, that line is only a fifth of it — the other four-fifths decide whether to get to the payout at all. This deep dive takes apart a real L2 refund workflow: from linear 8 steps to a branching tree, why most of the code isn't refunding but not-refunding, why limits must live in a DB table and not the prompt, why every leaf with an unwired external system defaults to a human. Twenty-five minutes in, you can take this skeleton to a vendor and ask where their refund workflow's guardrails are.

Jun 3, 2026·25 min read

Self-Serve Rate ≠ Correct Rate — The Gates a Customer-Service Agent Must Clear Before Launch | Agentic AI in Practice (XIII)

The question at the review board — 'should self-serve rate be 65 or 90?' — crams two different axes into one number. A session that refunds the wrong amount still reads approved. Five minutes in you can see through a single-number 'self-serve rate 95%' report; ten minutes in you can build a 9-gate, 3-layer launch gate; twenty minutes in you can ask, at the review board, 'is this red line actually reconciled, or is it just falling back to a human because the API isn't wired yet?' — the kind of question that exposes a fake green checkmark on the spot.

Jun 2, 2026·15 min read

The Org Chart Is the Real Architecture Diagram — 90% of Stalled Agent Projects Aren't a Tech Problem | Agentic AI in Practice (XII)

Annotation delivered, eval baseline built, four scenarios shipped — and the project still stalled for three weeks. The root cause wasn't code; it was five roles marked 'TBD' on the RACI sheet. Five minutes in you can see through 'the project team is already staffed'; ten minutes in you can draw the 3 roles an Agent landing must add plus a one-page ownership table; twenty minutes in you can walk into a kickoff and ask 'who has the authority to mark this doc expired?' — the kind of question that exposes an org gap on the spot.

Jun 1, 2026·14 min read

LLM Fact-Checking with a Verifier Agent: 11 of 34 Facts in a 25-Page AI Plan Were Fabricated

LLM fact-checking with a second, adversarial verifier agent. Vendor PPTs, AI-drafted emails, and Claude-written plans all carry a 20-30% fact-fabrication baseline. Five minutes to see why 'let the LLM check itself' is pseudo-verification; twenty to turn the DRAFT → VERIFY → FINALIZE gate + the R1-R7 error taxonomy into a PR checklist where every claim traces to a file:line or URL.

May 29, 2026·13 min read

Corpus Drives Codebook — Why Your Intent Taxonomy Is Stuck at 60% and How It Evolves from 36 to 48 | Agentic AI in Practice (X)

Customer-service Agent in production, 36 intents, unknown rate 40%, the business side asks 'can we just add an LLM fallback?' The real problem is not the classifier — it's the codebook itself. Five minutes in you can spot the wrong diagnosis ('unknown rate high = classifier weak'); ten minutes in you have the four-quadrant test that filters 80% of pseudo-missing-intent requests; twenty minutes in you have the corpus → codebook iteration loop that evolves a taxonomy from 36 to 48 stable intents.

May 28, 2026·14 min read

Don't Let AI Agents Call APIs Directly — A 5-Layer Tool-Calling Stack + 25-API Contract Checklist

The most common fake architecture in customer-service Agent projects this year: 'we let Agents call order / ticket / logistics APIs directly, 25 integrations done, full coverage' — then ask 'what happens when the OMS vendor changes?' answer 'rewrite,' 'how does QA do mock integration?' answer 'wait for the real interface,' 'compliance audit for write operations?' answer 'we'll add logging.' This is missing layers. Written for architects, founders, and project owners running enterprise Agentic projects: 5 min to spot the most expensive architecture mistake, 10 min to lock in the 5-layer responsibility split (Adapter / Service ABC / Tool / Workflow / Critic), 20 min to walk out with a 6-systems × 25-APIs integration matrix + 5 architecture decisions to drive this week.

May 27, 2026·17 min read

Intent Classification for Chatbots: Why Pure-Rule and Pure-LLM Both Fail (a 3-Tier Cascade)

Intent classification is the first node in any customer-service Agent — get it wrong and the next four architecture decisions are wasted. Pure-rule is brittle; pure-LLM blows the budget. The 3-tier fallback (rule → embedding → LLM) is the only engineering trade-off that stands up. Five minutes in you can spot the two fake architectures ('just use an LLM' / '100% rules'); ten minutes in you have starting thresholds for all three tiers; twenty minutes in you have the signals that say it's time to evolve from HybridClassifier to LLM Router.

May 25, 2026·16 min read

Pytest-Green Doesn't Mean Ship-Ready: How to Actually Test an AI Agent (Dual-Track)

The thing your customer-service Agent project gets most easily fooled by this year: 'pytest 400+ green, coverage 79%, CI gate passing.' Then the boss asks 'what's the faithfulness rate? Tone compliance? Prompt-injection block rate?' and nobody answers. The 'tests passed' bar for an AI system is not the 'tests passed' bar for traditional software. This piece is for architects, founders, and project owners shipping Agentic AI inside an enterprise: 5 min to see why pytest-green is misleading, 10 min to decide who owns which 4 of the 8+ test buckets, 20 min to walk out with a 7-quality-dimension threshold table + 3-cadence rhythm + 5 things to drive this week — bring it to your next architecture review.

May 25, 2026·16 min read

Agent Skills vs Knowledge Base: Why Stuffing SOPs Into RAG Doesn't Make an Agent Capable

Every other vendor review someone asks: 'where's the MCP-style protocol for Skills? How are we supposed to ship without one?' The question is backwards: no protocol coming isn't a bad thing — it's the signal that you can start now. Five minutes to see through 'we put all our SOPs in the knowledge base, that's our Agent shipping' pitches; ten minutes to use a three-line test that surfaces every fake Skill in your design; twenty minutes to draft an enterprise Skill spec for your team.

May 24, 2026·16 min read

Containment Rate vs Resolution Rate: The Only Customer-Service AI Metric That Matters (How "98% CSAT" Gets Faked)

The CEO gets a weekly email from the vendor: CSAT 98%. I pulled the raw data — ~5% of customers rated 'satisfied,' a fraction of a percent rated 'unsatisfied,' 95% never responded. 'Silent = satisfied by default' is how that 98% got built. Five minutes to see through four flavors of fake-resolution claim; ten minutes to redraw your team's customer-service north star.

May 22, 2026·16 min read

Deploy and Abandon — The Costliest Misconception in AI Agent Projects | Agentic AI in Practice (IV)

My boss graded my Critic design a B, reasoning: 'this is for Apple-scale companies, we're not Apple.' That sentence is the single most expensive misconception in AI Agent adoption. Five minutes to see through the six hollow spots in a 'deploy and abandon' proposal; ten minutes to walk into a vendor review armed with four questions they can't answer.

May 18, 2026·16 min read

Why a 70% Critic Beats a 95% Critic — A Fail-Closed Design Deep Dive | Agentic AI in Practice (III)

A Critic second-pass review is the only thing standing between an L2 customer-service Agent and a refund mistake. But the '95% automation rate' vendors keep showing you is almost always fail-open — Critic times out, the action passes through. Five minutes to see through three flavors of fake backstop; ten minutes to redraw your team's design.

May 17, 2026·22 min read

Five Architecture Decisions That Determine Whether Your Customer-Service Agent Can Ship | Agentic AI in Practice (II)

A customer-service Agent looks like the perfect candidate for L3 multi-Agent orchestration. The ones that actually ship are all L2 deterministic workflows. A refund the autonomous chain pushed through by mistake, and the five forks it forces you to think about.

May 16, 2026·18 min read

AI Agent Autonomy Levels: I Audited 28 'Agent' Projects — Only 5 Passed L0–L3

AI agent autonomy levels, made practical: I audited 28 enterprise AI projects — only 5 were real Agents. The rest were 'automation with an LLM bolted on,' or slideware. Here's the 4-level autonomy test (L0–L3) to grade any AI project in 5 minutes.

May 12, 2026·20 min read

80% of Failed AI Agents Die in Ops, Not Tech — Post-Launch Loop, Safety Layer & 30-Day Monitoring Plan

Launch is the start, ops is the game. Five minutes to judge whether your AI project is quietly degrading; twenty minutes to walk out with a complete SOP — 6 KPIs + Critic pseudocode + 5 prerequisites for headcount reduction + 30-day plan covering every day from signing to Alpha launch.

Feb 28, 2026·25 min read

AI Agent Knowledge Base: 3-Layer Design + 4-Bucket Cost Estimate

Spot vendor 'just dump docs into a vector DB' proposals in 5 minutes. 3-layer knowledge base architecture + 4-bucket cost estimate for production Agents.

Feb 28, 2026·25 min read

Which 4 of Your 28 'Smart-X' AI Agent Projects to Start With — Retail Agentic AI Handbook (Part 1)

Your boss just handed you 28 'smart-X' projects and wants them all done this year. You can't do them all. Here's a 28-scenario priority map — five minutes to know which are P0 and which to defer until the data foundation is in place; twenty minutes to walk into your next AI strategy meeting with '4 P0s + 5 Week-One decisions.'

Feb 28, 2026·20 min read

Reward Hacking in AI Agents: Trained 60,000 Steps, the Agent Learned to Delete Tickets (6 ITSM Patterns)

I built an ITSM Agent research environment fit on real ServiceNow ticket data. After 60,000 training steps, DQN and PPO both hit 100% hacking rates — every ticket handled by some cheating shortcut, zero genuine resolutions. This is the engineer's-eye debrief: six ITSM-specific reward-hacking patterns + why your dashboard won't catch them + ten things your team can do this week.

Oct 10, 2025·30 min read