Yaqin Hei — Shipping AI into Production

Free agent failure teardown

Building an LLM agent? Send me your worst conversation.

I built production agents at Apple. I’ll tell you why it broke and how to fix it — free. Get a teardown →

★ Start here

What Everyone Gets Wrong About "Agents" — Even Heavy Claude Users Can't Define One

The people who ask me this most aren't beginners — they're strong engineers who use Claude and Codex every day: "I've honestly never quite figured out what actually counts as an agent." It's not that they don't get it — the word has been stretched to mean everything, and therefore nothing. By the end you'll have one question that tells you, on the spot, whether anything claiming to "build an agent" actually is one.

Yaqin Hei·Jul 5, 2026·8 min read·👁 2·EN · 中

Anthropic Hired 11 Top People in Six Months. Not One to Build a Product.

In half a year, from Karpathy to AlphaFold's John Jumper, eleven top-tier people walked into the same company. This post hands you a detective's method — read a hiring list and infer where an AI lab is betting for the next 12–18 months — plus one counterintuitive finding: on this roster, there isn't a single product hire.

Yaqin Hei·Jul 17, 2026·11 min read·EN · 中

The Spec That Vanished: How Agents Reshuffle Business, Product, and Engineering

In a project meeting, engineering wired up the customer-service agent straight from business needs — product got demoted to 'support,' and wasn't happy. This traces why: the spec that used to pin business into config and UI simply can't be written for an agent, and where the three things it carried — decision boundaries, eval ground truth, accountability — must land now. You'll leave able to stop arguing 'who leads' and point at which of three owners your project is missing.

Yaqin Hei·Jul 15, 2026·16 min read·👁 1·EN · 中

I Ran the Whole Token-Saving Playbook. The Savings Got Re-Spent.

A tweet with 1.1M views promises to cut your Claude Code token usage by 90%. I ran the whole playbook, then measured the five things actually eating my budget: an SSE chain across 19 files and 8,716 lines, a 13 MB enterprise API spec PDF, 24 knowledge-governance documents, 52 MB of daily business logs, and a 49,480-character CLAUDE.md. The four tools reach two of them. Here is the full measurement: rtk was installed but never once called, the real compression ratio is 32.2% and not 88%, codegraph's stale window is exactly 300 seconds, and 73% of my budget went to a squad of subagents with no model declared. Twenty minutes from now you will be able to open /context and your usage attribution panel and point at where the money actually goes.

Yaqin Hei·Jul 13, 2026·20 min read·👁 4·EN · 中

One Brain, Two Front-Ends: Refunded Twice on Sale Day

On the first day of the big sale, a user requested one refund — and the refund service ran it twice. The money went out twice. In 10 minutes you'll be able to name, in an architecture review, the place a multi-agent system most loves to leak money: know to put an idempotency key on the layer two front-ends share, to ask 'is refundType defined the same in both systems,' and to see that the real bottleneck of multi-agent delivery is which team nods, not model quality.

Yaqin Hei·Jul 6, 2026·12 min read·👁 2·EN · 中

Your Knowledge Center Is a Search Box

A vendor demos a 'knowledge center' — type one line in the search box, get back a tidy answer, the boss signs off on the spot. In 10 minutes you'll be able to puncture that demo in the procurement meeting: know to ask 'does it only retrieve, or can it judge and act,' to make them prove how long a stale policy takes to actually go offline in production, and to count — at acceptance — how many of the '200 items ingested' actually landed.

Yaqin Hei·Jul 6, 2026·12 min read·👁 1·EN · 中

Your Dashboards Are Green While the Agent Quietly Gets Dumber — Post-Launch Silent Drift

Running a customer-service agent at a consumer-tech company, I learned one counterintuitive thing from watching the metrics: they're never a flat line. Every product launch, every back-to-school season, the question distribution shifts and a wave of new phrasings pours in — the agent quietly gets dumber, and not a pixel of it shows on the green CPU / QPS / latency dashboards. In 5 minutes you'll see through "dashboards green = healthy"; in 10 you'll have 6 leading signals that fire weeks before complaints; in 20 you'll turn the eval set from "frozen at launch day" into one that re-samples current traffic.

Yaqin Hei·Jul 5, 2026·13 min read·👁 2·EN · 中

A Pretty Accuracy Number Hid Dozens of Money-Moving Errors — How to Read the Eval to Ship

On a money-moving project I ran, the overall accuracy looked great; but pull the money-moving intents out on their own and the wrong-action rate was alarming — dozens of money-touching errors sat there the whole time, hidden by one blended number. In 5 minutes you'll see through "one accuracy figure to request launch"; in 10 you'll put a separate wrong-action gate on money-moving errors; in 20 you'll have a launch-decision flow: CI lower bound + per-scenario version cut + per-channel ramp.

Yaqin Hei·Jul 5, 2026·13 min read·👁 1·EN · 中

Make the Agent Get Sharper With Use, Not Dumber — Spinning Up the Data Flywheel

In labeling we found back-to-school promo questions spiked, but within them "how do I get the gift-with-purchase" was never answered well — so we fixed that KB specifically, and next round that class climbed. That's the data flywheel. In 5 minutes you'll see the first five posts are five spokes of one wheel; in 10 you'll spot why a flywheel spins in place (data piles up, never ships); in 20 you'll have a feedback loop ranked by frequency × cost that also forces you to split intents.

Yaqin Hei·Jul 5, 2026·12 min read·EN · 中

Your Code Is Fixed. Production Isn't.

"Didn't we fix that last week?" The code merged, the tests were green, and production keeps making the same mistake. In 10 minutes you'll leave able to puncture that illusion in a review: ask which machine's .env, make engineering prove the safety threshold has ever fired in prod, and byte-compare before you reindex.

Yaqin Hei·Jul 5, 2026·12 min read·👁 1·EN · 中

Your Labels Are Your Ceiling — One "Swap Half a Size Up" Gets Three Answers From Two Agents

A customer's one line — "I want to swap half a size up" — hides four calls: exchange or return, intercept the shipment or not, refund the price difference or not, and against which price. Two skilled agents label the same 50 rows back-to-back and agree on only 35 — your 96% accuracy was measured with a ruler that's only 70% self-consistent. In 20 minutes you'll have a flow for measuring agreement first, then writing rules like "which price the difference is refunded against" into a rubric.

Yaqin Hei·Jul 4, 2026·13 min read·👁 3·EN · 中

50 Rows at 96%, Ship It? Size Labeling by the CI Lower Bound, Not the Pretty Number

"We labeled 50, 96% correct — ship it?" No — the statistical lower bound is only 86%. When I designed the labeling gate for a customer-service agent, the first rule I pinned was: read the confidence-interval lower bound, not the point estimate. In 5 minutes you'll see through the small-sample 96% mirage; in 10 you'll have a "true accuracy → rows to label" table that explodes near the threshold; in 20, a CI stopping rule — label more where it's needed, don't waste it where it's done, and call a stop when the answer is fix-the-agent.

Yaqin Hei·Jul 3, 2026·12 min read·👁 1·EN · 中

The 96% Turned Red the Moment We Split It by Channel — Two Fatal Failures of Eval Sampling

The vendor drew 200 rows at random and reported "96% accuracy." I split those 200 by channel — in the small public-domain channel, the price-difference cases were nearly all wrong, and the high-traffic private channel had diluted it into a pretty number. In 5 minutes you'll see through both failures of random sampling (missing the tail + a big channel drowning a small one); in 20 you'll have a stratified frame with channel drill-down and risk-layer over-sampling.

Yaqin Hei·Jul 2, 2026·13 min read·👁 2·EN · 中

When the AI Becomes the Storefront, You Decay Into a Supplier: The Relationship Moat for 10 Million Merchants

WeChat is beta-testing an AI that orders and pays for users, and Qwen just opened a brand agent — two routes pointing at the same question: once the AI is the storefront, does the customer still belong to you. This one is for owners and membership / CRM leads: in 10 minutes you'll see how a super app slowly grinds you into a 'supplier,' spot the door Qwen's brand agent leaves open, and walk out with the 3 boundaries to hold (membership / repurchase / profile) plus the questions to put to the platform next week. WeChat is still read-only, no writes — build the relationship layer now, before the write permissions land and you find out you've been running naked.

Yaqin Hei·Jun 30, 2026·10 min read·👁 2·EN · 中

AI Picked the First Store and the Other Four Vanished: WeChat's New Shelf for 10 Million Merchants

WeChat is beta-testing an AI agent that can search, compare, order, and pay across ~10 million merchants. This one is for owners and growth leads: in 10 minutes you'll see how the AI entry rewrote the physics of traffic distribution, spot the 3 signals that your service is already invisible to the AI, and walk out with 5 questions to put to your platform and team next week. It doesn't bet on any specific API — it explains the one thing that's certain: the shelf's rules changed, and whoever reads the rules first grabs the slot.

Yaqin Hei·Jun 29, 2026·10 min read·👁 1·EN · 中

The Day WeChat's AI Ordered for Users, 10 Million Merchants' Mini-Programs Expired — A Layered Rebuild Blueprint

WeChat's AI agent has entered beta — it can search, compare, order, and pay across ~10 million merchants. This is the rebuild blueprint for architects and eng leads: which capabilities to scout with auto mode, which to package as SKILLs, how atomic APIs use a state machine to keep the AI on rails, and how to stitch the relationship layer back onto your side. In 30 minutes you walk out with a layered rebuild map + 10 things to do this week + 5 questions for the platform and vendors. The contract isn't frozen — so this post teaches you to design by MODE, not to bet on a specific API.

Yaqin Hei·Jun 24, 2026·30 min read·👁 7·EN · 中

Your Observability Dashboard Is Throttling the Agent It Watches — an Async Latency Postmortem

A P99 latency spike every few minutes — CPU, QPS, error rate all flat, Agent code unchanged. The culprit no one suspects: the observability dashboard built to watch the Agent was choking it. The postmortem — how one sync call freezes a single-threaded async event loop, the two-line fix (to_thread + TTL cache), and 10 event-loop probes you can add to your async service this week.

Yaqin Hei·Jun 10, 2026·20 min read·👁 8·EN · 中

What a Real, Money-Moving L2 Refund Workflow Actually Looks Like | Workflow Deep Dive

The refund flow on the slide is a clean 8-step line. Built for real, that line is only a fifth of it — the other four-fifths decide whether to get to the payout at all. This deep dive takes apart a real L2 refund workflow: from linear 8 steps to a branching tree, why most of the code isn't refunding but not-refunding, why limits must live in a DB table and not the prompt, why every leaf with an unwired external system defaults to a human. Twenty-five minutes in, you can take this skeleton to a vendor and ask where their refund workflow's guardrails are.

Yaqin Hei·Jun 3, 2026·25 min read·👁 10·EN · 中

Your KB Changed. The Search Index Didn't — Anatomy of a 9-Day Silent Desync | KB-Ops Deep Dive

The same refund line: curl it locally, you get the new wording; curl it in prod, you get the 9-day-old 'not as good as Taobao.' Between the source file and prod sat one step I assumed was automatic and was actually manual. This is the engineering postmortem: two stacked silent-desync root causes + how 33 test-feedback rows cluster into 16 with one cause + why an all-green dashboard hid it + 10 gates you can add to your own KB pipeline this week. Twenty minutes in, you can find the same hole in your own source-plus-derived-index system.

Yaqin Hei·Jun 3, 2026·20 min read·👁 15·EN · 中

Self-Serve Rate ≠ Correct Rate — The Gates a Customer-Service Agent Must Clear Before Launch | Agentic AI in Practice (XIII)

The question at the review board — 'should self-serve rate be 65 or 90?' — crams two different axes into one number. A session that refunds the wrong amount still reads approved. Five minutes in you can see through a single-number 'self-serve rate 95%' report; ten minutes in you can build a 9-gate, 3-layer launch gate; twenty minutes in you can ask, at the review board, 'is this red line actually reconciled, or is it just falling back to a human because the API isn't wired yet?' — the kind of question that exposes a fake green checkmark on the spot.

Yaqin Hei·Jun 2, 2026·15 min read·👁 9·EN · 中

The Org Chart Is the Real Architecture Diagram — 90% of Stalled Agent Projects Aren't a Tech Problem | Agentic AI in Practice (XII)

Annotation delivered, eval baseline built, four scenarios shipped — and the project still stalled for three weeks. The root cause wasn't code; it was five roles marked 'TBD' on the RACI sheet. Five minutes in you can see through 'the project team is already staffed'; ten minutes in you can draw the 3 roles an Agent landing must add plus a one-page ownership table; twenty minutes in you can walk into a kickoff and ask 'who has the authority to mark this doc expired?' — the kind of question that exposes an org gap on the spot.

Yaqin Hei·Jun 1, 2026·14 min read·👁 5·EN · 中

LLM Fact-Checking with a Verifier Agent: 11 of 34 Facts in a 25-Page AI Plan Were Fabricated

LLM fact-checking with a second, adversarial verifier agent. Vendor PPTs, AI-drafted emails, and Claude-written plans all carry a 20-30% fact-fabrication baseline. Five minutes to see why 'let the LLM check itself' is pseudo-verification; twenty to turn the DRAFT → VERIFY → FINALIZE gate + the R1-R7 error taxonomy into a PR checklist where every claim traces to a file:line or URL.

Yaqin Hei·May 29, 2026·13 min read·👁 13·EN · 中

Corpus Drives Codebook — Why Your Intent Taxonomy Is Stuck at 60% and How It Evolves from 36 to 48 | Agentic AI in Practice (X)

Customer-service Agent in production, 36 intents, unknown rate 40%, the business side asks 'can we just add an LLM fallback?' The real problem is not the classifier — it's the codebook itself. Five minutes in you can spot the wrong diagnosis ('unknown rate high = classifier weak'); ten minutes in you have the four-quadrant test that filters 80% of pseudo-missing-intent requests; twenty minutes in you have the corpus → codebook iteration loop that evolves a taxonomy from 36 to 48 stable intents.

Yaqin Hei·May 28, 2026·14 min read·👁 12·EN · 中

Don't Let AI Agents Call APIs Directly — A 5-Layer Tool-Calling Stack + 25-API Contract Checklist

The most common fake architecture in customer-service Agent projects this year: 'we let Agents call order / ticket / logistics APIs directly, 25 integrations done, full coverage' — then ask 'what happens when the OMS vendor changes?' answer 'rewrite,' 'how does QA do mock integration?' answer 'wait for the real interface,' 'compliance audit for write operations?' answer 'we'll add logging.' This is missing layers. Written for architects, founders, and project owners running enterprise Agentic projects: 5 min to spot the most expensive architecture mistake, 10 min to lock in the 5-layer responsibility split (Adapter / Service ABC / Tool / Workflow / Critic), 20 min to walk out with a 6-systems × 25-APIs integration matrix + 5 architecture decisions to drive this week.

Yaqin Hei·May 27, 2026·17 min read·👁 21·EN · 中

Pytest-Green Doesn't Mean Ship-Ready: How to Actually Test an AI Agent (Dual-Track)

The thing your customer-service Agent project gets most easily fooled by this year: 'pytest 400+ green, coverage 79%, CI gate passing.' Then the boss asks 'what's the faithfulness rate? Tone compliance? Prompt-injection block rate?' and nobody answers. The 'tests passed' bar for an AI system is not the 'tests passed' bar for traditional software. This piece is for architects, founders, and project owners shipping Agentic AI inside an enterprise: 5 min to see why pytest-green is misleading, 10 min to decide who owns which 4 of the 8+ test buckets, 20 min to walk out with a 7-quality-dimension threshold table + 3-cadence rhythm + 5 things to drive this week — bring it to your next architecture review.

Yaqin Hei·May 25, 2026·16 min read·👁 18·EN · 中

Intent Classification for Chatbots: Why Pure-Rule and Pure-LLM Both Fail (a 3-Tier Cascade)

Intent classification is the first node in any customer-service Agent — get it wrong and the next four architecture decisions are wasted. Pure-rule is brittle; pure-LLM blows the budget. The 3-tier fallback (rule → embedding → LLM) is the only engineering trade-off that stands up. Five minutes in you can spot the two fake architectures ('just use an LLM' / '100% rules'); ten minutes in you have starting thresholds for all three tiers; twenty minutes in you have the signals that say it's time to evolve from HybridClassifier to LLM Router.

Yaqin Hei·May 25, 2026·16 min read·👁 44·EN · 中

Agent Skills vs Knowledge Base: Why Stuffing SOPs Into RAG Doesn't Make an Agent Capable

Every other vendor review someone asks: 'where's the MCP-style protocol for Skills? How are we supposed to ship without one?' The question is backwards: no protocol coming isn't a bad thing — it's the signal that you can start now. Five minutes to see through 'we put all our SOPs in the knowledge base, that's our Agent shipping' pitches; ten minutes to use a three-line test that surfaces every fake Skill in your design; twenty minutes to draft an enterprise Skill spec for your team.

Yaqin Hei·May 24, 2026·16 min read·👁 17·EN · 中

Containment Rate vs Resolution Rate: The Only Customer-Service AI Metric That Matters (How "98% CSAT" Gets Faked)

The CEO gets a weekly email from the vendor: CSAT 98%. I pulled the raw data — ~5% of customers rated 'satisfied,' a fraction of a percent rated 'unsatisfied,' 95% never responded. 'Silent = satisfied by default' is how that 98% got built. Five minutes to see through four flavors of fake-resolution claim; ten minutes to redraw your team's customer-service north star.

Yaqin Hei·May 22, 2026·16 min read·👁 23·EN · 中

Deploy and Abandon — The Costliest Misconception in AI Agent Projects | Agentic AI in Practice (IV)

My boss graded my Critic design a B, reasoning: 'this is for Apple-scale companies, we're not Apple.' That sentence is the single most expensive misconception in AI Agent adoption. Five minutes to see through the six hollow spots in a 'deploy and abandon' proposal; ten minutes to walk into a vendor review armed with four questions they can't answer.

Yaqin Hei·May 18, 2026·16 min read·👁 16·EN · 中

Why a 70% Critic Beats a 95% Critic — A Fail-Closed Design Deep Dive | Agentic AI in Practice (III)

A Critic second-pass review is the only thing standing between an L2 customer-service Agent and a refund mistake. But the '95% automation rate' vendors keep showing you is almost always fail-open — Critic times out, the action passes through. Five minutes to see through three flavors of fake backstop; ten minutes to redraw your team's design.

Yaqin Hei·May 17, 2026·22 min read·👁 39·EN · 中

Five Architecture Decisions That Determine Whether Your Customer-Service Agent Can Ship | Agentic AI in Practice (II)

A customer-service Agent looks like the perfect candidate for L3 multi-Agent orchestration. The ones that actually ship are all L2 deterministic workflows. A refund the autonomous chain pushed through by mistake, and the five forks it forces you to think about.

Yaqin Hei·May 16, 2026·18 min read·👁 20·EN · 中·▶ Video

AI Agent Autonomy Levels: I Audited 28 'Agent' Projects — Only 5 Passed L0–L3

AI agent autonomy levels, made practical: I audited 28 enterprise AI projects — only 5 were real Agents. The rest were 'automation with an LLM bolted on,' or slideware. Here's the 4-level autonomy test (L0–L3) to grade any AI project in 5 minutes.

Yaqin Hei·May 12, 2026·20 min read·👁 52·EN · 中

4 小时→30 分钟：独立开发者用 Claude Code 自动化公众号排版的踩坑实录

从 Python 脚本到完整产品的真实记录。Claude Code + 微信公众号 API，把公众号创作从 4 小时压到 30 分钟。含技术选型、架构设计、API 踩坑、成本对比。独立开发者必看。

Yaqin Hei·Mar 7, 2026·12分钟阅读·👁 37

Which 4 of Your 28 'Smart-X' AI Agent Projects to Start With — Retail Agentic AI Handbook (Part 1)

Your boss just handed you 28 'smart-X' projects and wants them all done this year. You can't do them all. Here's a 28-scenario priority map — five minutes to know which are P0 and which to defer until the data foundation is in place; twenty minutes to walk into your next AI strategy meeting with '4 P0s + 5 Week-One decisions.'

Yaqin Hei·Feb 28, 2026·20 min read·👁 6·EN · 中

AI Agent Knowledge Base: 3-Layer Design + 4-Bucket Cost Estimate

Spot vendor 'just dump docs into a vector DB' proposals in 5 minutes. 3-layer knowledge base architecture + 4-bucket cost estimate for production Agents.

Yaqin Hei·Feb 28, 2026·25 min read·👁 6·EN · 中

80% of Failed AI Agents Die in Ops, Not Tech — Post-Launch Loop, Safety Layer & 30-Day Monitoring Plan

Launch is the start, ops is the game. Five minutes to judge whether your AI project is quietly degrading; twenty minutes to walk out with a complete SOP — 6 KPIs + Critic pseudocode + 5 prerequisites for headcount reduction + 30-day plan covering every day from signing to Alpha launch.

Yaqin Hei·Feb 28, 2026·25 min read·👁 8·EN · 中

Catastrophic Forgetting in LLMs: 52 Domains Fine-Tuned, the Earlier 51 Regressed — A Dual-Replay Field Report

Sequentially fine-tuned across 52 product domains, NLU F1 on earlier ones dropped 1-2 points each time (BWT -7.2). Dual-Replay — 9M adapter params + 20% dual-stream replay — pulled BWT to -4.7 (35% less forgetting), p99 under 100 ms. Five minutes in, you tell real improvement from dashboard noise; thirty in, you have five forgetting failure modes plus five questions for any vendor.

Yaqin Hei·Oct 13, 2025·30 min read·👁 14·EN · 中

Reward Hacking in AI Agents: Trained 60,000 Steps, the Agent Learned to Delete Tickets (6 ITSM Patterns)

I built an ITSM Agent research environment fit on real ServiceNow ticket data. After 60,000 training steps, DQN and PPO both hit 100% hacking rates — every ticket handled by some cheating shortcut, zero genuine resolutions. This is the engineer's-eye debrief: six ITSM-specific reward-hacking patterns + why your dashboard won't catch them + ten things your team can do this week.

Yaqin Hei·Oct 10, 2025·30 min read·👁 19·EN · 中