Enterprise AI

18 posts

How Much to Label: Not a Percentage of Traffic, but "Label Until You Can Conclude"

"We labeled 50, 96% correct — ship it?" No — the statistical lower bound is only 86%. In 5 minutes you'll see through the small-sample 96% mirage; in 10, why labeling volume tracks "intents × channels," not traffic share; in 20, you'll have a table mapping true accuracy to rows-to-label, plus the cheapest rule there is: labeling volume grows with the number of intents × channels, not with traffic.

Jul 3, 2026·12 min read

After Launch Is Where Agent Architecture Is Decided — Start With How You Sample Your Eval Set

"The stronger AI gets, the fewer people you need" is the most popular illusion of the past two years. Where agents are built with real money, labeling and evaluation teams are growing, not shrinking. In 5 minutes you'll see what separates "can demo it" from "can keep it right"; in 10, the 3 sampling questions that expose an eval set; in 20, how to redraw yours from "eyeball the logs" into a stratified sampling frame that pulls rare, high-stakes events — a wrong refund, a mishandled compliance case — back from "never sampled" into view.

Jul 2, 2026·12 min read

When the AI Becomes the Storefront, You Decay Into a Supplier: The Relationship Moat for 10 Million Merchants

WeChat is beta-testing an AI that orders and pays for users, and Qwen just opened a brand agent — two routes pointing at the same question: once the AI is the storefront, does the customer still belong to you. This one is for owners and membership / CRM leads: in 10 minutes you'll see how a super app slowly grinds you into a 'supplier,' spot the door Qwen's brand agent leaves open, and walk out with the 3 boundaries to hold (membership / repurchase / profile) plus the questions to put to the platform next week. WeChat is still read-only, no writes — build the relationship layer now, before the write permissions land and you find out you've been running naked.

Jun 30, 2026·10 min read

AI Picked the First Store and the Other Four Vanished: WeChat's New Shelf for 10 Million Merchants

WeChat is beta-testing an AI agent that can search, compare, order, and pay across ~10 million merchants. This one is for owners and growth leads: in 10 minutes you'll see how the AI entry rewrote the physics of traffic distribution, spot the 3 signals that your service is already invisible to the AI, and walk out with 5 questions to put to your platform and team next week. It doesn't bet on any specific API — it explains the one thing that's certain: the shelf's rules changed, and whoever reads the rules first grabs the slot.

Jun 29, 2026·10 min read

The Day WeChat's AI Ordered for Users, 10 Million Merchants' Mini-Programs Expired — A Layered Rebuild Blueprint

WeChat's AI agent has entered beta — it can search, compare, order, and pay across ~10 million merchants. This is the rebuild blueprint for architects and eng leads: which capabilities to scout with auto mode, which to package as SKILLs, how atomic APIs use a state machine to keep the AI on rails, and how to stitch the relationship layer back onto your side. In 30 minutes you walk out with a layered rebuild map + 10 things to do this week + 5 questions for the platform and vendors. The contract isn't frozen — so this post teaches you to design by MODE, not to bet on a specific API.

Jun 24, 2026·30 min read

Self-Serve Rate ≠ Correct Rate — The Gates a Customer-Service Agent Must Clear Before Launch | Agentic AI in Practice (XIII)

The question at the review board — 'should self-serve rate be 65 or 90?' — crams two different axes into one number. A session that refunds the wrong amount still reads approved. Five minutes in you can see through a single-number 'self-serve rate 95%' report; ten minutes in you can build a 9-gate, 3-layer launch gate; twenty minutes in you can ask, at the review board, 'is this red line actually reconciled, or is it just falling back to a human because the API isn't wired yet?' — the kind of question that exposes a fake green checkmark on the spot.

Jun 2, 2026·15 min read

The Org Chart Is the Real Architecture Diagram — 90% of Stalled Agent Projects Aren't a Tech Problem | Agentic AI in Practice (XII)

Annotation delivered, eval baseline built, four scenarios shipped — and the project still stalled for three weeks. The root cause wasn't code; it was five roles marked 'TBD' on the RACI sheet. Five minutes in you can see through 'the project team is already staffed'; ten minutes in you can draw the 3 roles an Agent landing must add plus a one-page ownership table; twenty minutes in you can walk into a kickoff and ask 'who has the authority to mark this doc expired?' — the kind of question that exposes an org gap on the spot.

Jun 1, 2026·14 min read

A Second Agent as Reviewer — 11 of 34 Facts in a 25-Page AI Plan Were Fabricated | Agentic AI in Practice (XI)

Vendor PPTs, AI-drafted admissions emails, internal plans written with Claude — fact-fabrication rate sits at a 20-30% industry baseline. Five minutes in you can spot why 'let the LLM double-check itself' is pseudo-verification; ten minutes in you have the DRAFT → VERIFY → FINALIZE 3-phase gate template; twenty minutes in you have R1-R7 — the seven categories of fact errors that keep recurring (enum casing / fabricated emails / API paths / model IDs / deadlines) — turned into a PR checklist. Next time you review AI-drafted material, every claim traces back to a `file:line` or URL.

May 29, 2026·13 min read

Corpus Drives Codebook — Why Your Intent Taxonomy Is Stuck at 60% and How It Evolves from 36 to 48 | Agentic AI in Practice (X)

Customer-service Agent in production, 36 intents, unknown rate 40%, the business side asks 'can we just add an LLM fallback?' The real problem is not the classifier — it's the codebook itself. Five minutes in you can spot the wrong diagnosis ('unknown rate high = classifier weak'); ten minutes in you have the four-quadrant test that filters 80% of pseudo-missing-intent requests; twenty minutes in you have the corpus → codebook iteration loop that evolves a taxonomy from 36 to 48 stable intents.

May 28, 2026·14 min read

Don't Let AI Agents Call APIs Directly — A 5-Layer Tool-Calling Stack + 25-API Contract Checklist

The most common fake architecture in customer-service Agent projects this year: 'we let Agents call order / ticket / logistics APIs directly, 25 integrations done, full coverage' — then ask 'what happens when the OMS vendor changes?' answer 'rewrite,' 'how does QA do mock integration?' answer 'wait for the real interface,' 'compliance audit for write operations?' answer 'we'll add logging.' This is missing layers. Written for architects, founders, and project owners running enterprise Agentic projects: 5 min to spot the most expensive architecture mistake, 10 min to lock in the 5-layer responsibility split (Adapter / Service ABC / Tool / Workflow / Critic), 20 min to walk out with a 6-systems × 25-APIs integration matrix + 5 architecture decisions to drive this week.

May 27, 2026·17 min read

Intent Classification for Chatbots: Why Pure-Rule and Pure-LLM Both Fail (a 3-Tier Cascade)

Intent classification is the first node in any customer-service Agent — get it wrong and the next four architecture decisions are wasted. Pure-rule is brittle; pure-LLM blows the budget. The 3-tier fallback (rule → embedding → LLM) is the only engineering trade-off that stands up. Five minutes in you can spot the two fake architectures ('just use an LLM' / '100% rules'); ten minutes in you have starting thresholds for all three tiers; twenty minutes in you have the signals that say it's time to evolve from HybridClassifier to LLM Router.

May 25, 2026·16 min read

Pytest-Green Doesn't Mean Ship-Ready: How to Actually Test an AI Agent (Dual-Track)

The thing your customer-service Agent project gets most easily fooled by this year: 'pytest 400+ green, coverage 79%, CI gate passing.' Then the boss asks 'what's the faithfulness rate? Tone compliance? Prompt-injection block rate?' and nobody answers. The 'tests passed' bar for an AI system is not the 'tests passed' bar for traditional software. This piece is for architects, founders, and project owners shipping Agentic AI inside an enterprise: 5 min to see why pytest-green is misleading, 10 min to decide who owns which 4 of the 8+ test buckets, 20 min to walk out with a 7-quality-dimension threshold table + 3-cadence rhythm + 5 things to drive this week — bring it to your next architecture review.

May 25, 2026·16 min read

Agent Skills vs Knowledge Base: Why Stuffing SOPs Into RAG Doesn't Make an Agent Capable

Every other vendor review someone asks: 'where's the MCP-style protocol for Skills? How are we supposed to ship without one?' The question is backwards: no protocol coming isn't a bad thing — it's the signal that you can start now. Five minutes to see through 'we put all our SOPs in the knowledge base, that's our Agent shipping' pitches; ten minutes to use a three-line test that surfaces every fake Skill in your design; twenty minutes to draft an enterprise Skill spec for your team.

May 24, 2026·16 min read

Containment Rate vs Resolution Rate: The Only Customer-Service AI Metric That Matters (How "98% CSAT" Gets Faked)

The CEO gets a weekly email from the vendor: CSAT 98%. I pulled the raw data — ~5% of customers rated 'satisfied,' a fraction of a percent rated 'unsatisfied,' 95% never responded. 'Silent = satisfied by default' is how that 98% got built. Five minutes to see through four flavors of fake-resolution claim; ten minutes to redraw your team's customer-service north star.

May 22, 2026·16 min read

Deploy and Abandon — The Costliest Misconception in AI Agent Projects | Agentic AI in Practice (IV)

My boss graded my Critic design a B, reasoning: 'this is for Apple-scale companies, we're not Apple.' That sentence is the single most expensive misconception in AI Agent adoption. Five minutes to see through the six hollow spots in a 'deploy and abandon' proposal; ten minutes to walk into a vendor review armed with four questions they can't answer.

May 18, 2026·16 min read

Why a 70% Critic Beats a 95% Critic — A Fail-Closed Design Deep Dive | Agentic AI in Practice (III)

A Critic second-pass review is the only thing standing between an L2 customer-service Agent and a refund mistake. But the '95% automation rate' vendors keep showing you is almost always fail-open — Critic times out, the action passes through. Five minutes to see through three flavors of fake backstop; ten minutes to redraw your team's design.

May 17, 2026·22 min read

Five Architecture Decisions That Determine Whether Your Customer-Service Agent Can Ship | Agentic AI in Practice (II)

A customer-service Agent looks like the perfect candidate for L3 multi-Agent orchestration. The ones that actually ship are all L2 deterministic workflows. A refund the autonomous chain pushed through by mistake, and the five forks it forces you to think about.

May 16, 2026·18 min read

I Audited 28 'AI Agent' Projects — Only 5 Were Real Agents

I audited 28 enterprise AI projects — only 5 were real Agents. The rest were 'automation with an LLM bolted on,' or slideware. Here's the 4-level test (L0–L3) to grade any AI project in 5 minutes.

May 12, 2026·20 min read