Brain and Hands — Why Stuffing SOPs Into a Knowledge Base Doesn't Make Your Agent Capable

Sixth piece in the Agentic AI in Practice methodology series. The first five covered "can the design ship + how not to abandon it after + what to actually measure" — L0-L3 grading, five L2 architecture decisions, fail-closed Critic, deploy and abandon, containment vs resolution. This one switches angles — why 90% of "we've shipped our Agent" projects are actually "we've shipped our RAG," and how to draw the line. 中文版:Agent 的脑和手——把 SOP 塞进知识库不等于 Agent 能做事.
"We've Ingested All Our SOPs" — The Most Common Fake Agent in Vendor Reviews
Customer-service Agent acceptance review for a retail client. Page 12 of the vendor deck:
"320 SOP documents / 1,400 policy clauses / 8,000 FAQs / 60,000 product talking points — all indexed. Recall@K 92%. RAG accuracy 87%."
The CEO nods. Then asks, almost in passing: "So if a customer says 'refund the order I placed yesterday,' the Agent can just do that?"
The deck doesn't have a page for that. The vendor says: "We'll add a refund Skill later." Which means: no.
That moment, everyone in the room realizes something at once. 320 SOPs + 1,400 policies + 92% recall only let the Agent know what the refund policy looks like. None of it lets the Agent actually issue a refund for the customer.
Every customer-service / replenishment / ticket-routing Agent project I've worked on this year has had the same blind spot at technical review — not the algorithm, not the model, not which vector DB to pick. It's this one: no one separates "knowing" from "doing".
Knowledge base fills up → ship. Recall accuracy climbs → ship. The first time a customer asks for a write action — the project derails on the spot.
This piece breaks down three things: (1) where brain and hands actually diverge in engineering, (2) a three-line test that exposes the fake Skills in your design, and (3) why "where's the MCP-style protocol for Skills?" is the wrong question.
Knowledge Base Is the Brain, Skill Is the Hands — One Table to Separate Them
Verdict up front: knowledge base and Skill aren't two implementations of the same thing. They're two completely different layers in Agent architecture. Conflating them is the design's original sin.
| Knowledge Base | Skill | |
|---|---|---|
| The question it answers | What the Agent knows | What the Agent can do |
| Content shape | Facts, policies, SOPs, product info, conversation corpus | Trigger + tool calls + constraints + output contract |
| How it's triggered | Retrieval (query comes in → recall) | Invocation (Agent decides an action needs to happen) |
| Key failure mode | Wrong retrieval → wrong answer | Wrong call → customer refunded $980 extra / coupon to the wrong person |
| Maintained by | Business / ops | Engineering |
| Updates when | Policy changes / new product | API changes / business rule changes |
| Audit unit | Document version | Call trace + inputs + outputs + decision basis |
| Reversibility | Re-embed with corrected doc | Refund issued = refund issued / ticket closed = ticket closed |
| How you test it | Recall precision, Recall@K | End-to-end reconciliation, idempotency, Critic miss rate |
The last two rows matter most: write actions are irreversible. A retrieval miss can be fixed next turn with a different prompt; a wrong Skill call leaves money out the door. That's why every Skill needs a Critic (series 3 is entirely on this) — a knowledge base doesn't necessarily.
One sentence: knowledge base answers "why is the customer upset"; Skill decides "do I refund them." The former needs RAG. The latter needs trigger / tool / constraint / Critic — the whole stack.
Same refund policy. Drop it in a knowledge base and the Agent can tell the customer "7-day no-questions-asked applies." Wrap it as a Skill and the Agent can actually trigger the refund, reconcile, leave an audit trail, and have a Critic block bad cases. The gap is invisible at L1, decisive at L2.A Three-Line Test: Is That "Skill" Really a Skill, or a Knowledge Base in Disguise?
I've used this in vendor reviews all year. Anything labeled "Skill," push back with these three questions:
Test 1 — Is it triggered by the Agent's judgment, or by the user's question shape?
- Knowledge-base trigger: user asks "how does refund work?" → embedding search → recall policy paragraph → LLM extracts key facts → answer.
- Skill trigger: user says "refund yesterday's order" → Agent judges (this is a write op + needs
order_id+ involves money) → decides to invokerefund_order.
If your "Skill" is fundamentally "question comes in → retrieve → say what was retrieved," it's not a Skill — it's a knowledge base under another name.
Test 2 — Does it have a write op, a side effect, or an external system call?
- Calling the order API to change status → side effect
- Calling IM to message the customer → side effect
- Calling CRM to tag the customer → side effect
- Reciting the refund policy → no side effect (read-only)
Anything read-only, no writes, no external system, no state change — it's a knowledge-base extension wearing a Skill costume. Calling it a Skill is marketing, not engineering.
Test 3 — Is the failure mode "wrong answer" or "wrong action"?
- Wrong answer → user sees the wrong reply → usually salvageable with a follow-up question → loss: experience
- Wrong action → system state changed → no "ask again to undo" path → loss: money, tickets, inventory, compliance
A wrong write op is not reversible — that's the real engineering signature of a Skill. If the failure mode is "wrong answer," it's a knowledge base. If it's "wrong action," it's a Skill, and that means you need Critic + canary + audit trail.
Any "no" on these three = that Skill is a knowledge base in disguise. Run this at the next vendor review and most of the 5-8 Skills in the deck collapse to 2-3 real ones.
Writing Rules in CLAUDE.md Doesn't Mean the Rules Get Followed — A Sibling Problem in This Blog's Repo
This same thing bit me running this blog for 18 months.
CLAUDE.md is now ~20k characters. It's full of rules — every one of them earned the hard way: "don't run npm run build while the dev server is on :3002," "in markdown destined for WeChat, **X**:desc strands the colon on mobile — write **X:** desc instead," "DB-backed pages need export const revalidate = N or counts freeze at build time," "never add an explicit distDir to next.config.mjs."
Every rule is written down — clear, with rationale and burn-site references. The question is: when a new conversation opens, will Claude (or me, six months from now) actually follow them?
Answer: depends on whether the rule became execution.
| Rule | Sat in CLAUDE.md (knowledge) | Promoted to code (Skill) | State |
|---|---|---|---|
Don't npm run build while dev server on :3002 | "Gotchas" section | scripts/check-dev-server.js prebuild hook — detects port + aborts with rationale | ✅ Skillified |
**X**: in markdown strands the colon on mobile WeChat | notes/2026-05-18-wechat-colon-orphan.md | scripts/mdx-to-wechat-html.mjs one-line regex before marked.parse auto-rewrites | ✅ Skillified |
API endpoints touching the Post table must call ensurePostInDB(slug) | In CLAUDE.md | lib/db.ts exposes ensurePostInDB(), view/clap endpoints both call it | ✅ Skillified |
Don't add explicit distDir to next.config | In "Gotchas" with 6-hour outage burn note | No lint / hook enforcing it | ⚠️ Still knowledge |
Don't write <2% in MDX (parsed as JSX, breaks Vercel build) | In CLAUDE.md with case study | No pre-commit check | ⚠️ Still knowledge |
When ZH-first shipping, strip alternateLang from frontmatter | Written in series production notes | No script check | ⚠️ Still knowledge |
The first three became scripts and stopped depending on anyone remembering. The last three are still prose — every new conversation has a probability of missing them.
This is the engineering definition of a Skill: a rule that sits in docs is a hope. It only becomes a capability the moment execution-time enforces it.
18 months running this blog accumulated ~20k chars of CLAUDE.md. Every rule has a non-zero miss probability next session — until it becomes a prebuild hook, a marked pre-processor, or anensurePostInDB function. Customer-service Agents have the same shape: knowledge doesn't convert to action on its own.
Same problem in a customer-service Agent:
- "7-day no-questions-asked refund" sitting in the agent training manual → knowledge base
- "7-day no-questions-asked refund" wired into the order page + a refund button that auto-grays at day 8 → Skillified
The first means "every agent knows the policy." The second means "the policy is enforced by the flow." The line between them is whether code (or a deterministic process) does the enforcing. Same line for Agents.
"Why Isn't There an MCP-Style Skill Protocol?" — That Question Is Backwards
Every two weeks someone asks this at a vendor review. Different wording — "is there an industry Skill standard?" "should we wait for OpenAI to publish a protocol?" "does Anthropic's Skills count as an industrial standard?" — but the underlying question is the same: should we wait for a standard before we start?
No. The question is backwards, for three reasons:
Reason 1 — MCP and Skill aren't at the same layer. MCP solves transport: how the Agent calls external tools, how tools expose capability, how auth / channels / serialization work. That's protocol-layer stuff — it can be standardized.
Skill is policy: whether this tool is allowed at all in your customer-service context, whether to check membership tier before calling, whether failures should auto-compensate, what threshold escalates to human. That's business-rule packaging, not transport.
The industry isn't going to publish a unified protocol that tells you "customer-service Agent refunds above $500 should escalate to human," because every company is different. If your AOV is $30, the threshold might be $50. Your competitor's AOV is $300, theirs is $5,000. A unified protocol at that layer doesn't exist — and it shouldn't.
Reason 2 — MCP has already covered the layer you're worried about. Your concern is "I'm integrating 5 SaaS systems, each with its own API, none of it reuses" — MCP solved that. Order system, ticket system, logistics system, CRM, IM — each expose capability via MCP server, the Agent calls them all the same way. That's protocol-level standardization, for real.
What's left — "how this tool gets used in this scenario, when to call it, what to do first" — that's supposed to be your company's own asset. It shouldn't wait for an external standard.
Reason 3 — Waiting is not starting. Every company I've heard say "we'll wait for Skill standards to settle" has shipped exactly nothing six months later. They're not waiting for a protocol. They're using "waiting for the protocol" as an excuse not to start.
The real layer map:
| Layer | Standard | State |
|---|---|---|
| Tool transport | MCP / OpenAPI / function-calling schema | Mature, just integrate |
| Knowledge retrieval | Embedding + RAG + rerank | Mature, plenty of open + closed options |
| Skill definition | YAML + prompt + tool refs + Critic | No external standard, and you don't need one |
| Multi-Agent orchestration | A2A / OpenAI Swarm / LangGraph | Early, designs still converging |
The middle two layers are your company's internal muscle. No external protocol can tell you how your customer-service Agent should handle refunds. Same shape as the ERP/OMS era — SAP didn't ship a protocol telling you how to tag SKUs, it gave you the fields and the workflow and you defined the tagging. Skills are at the equivalent position in the Agent era.
What a Real Skill Looks Like — More Than YAML, Five-Plus Blocks
Most folks see "Skill defined in YAML" and think they're done. Not enough. A shippable Skill has at least six blocks. Below is the minimum spec my projects have converged on:
skill:
name: "refund_order"
description: "Initiate a refund for an unshipped order, or one within 7 days"
# 1. Trigger: who calls it, how it knows to fire
trigger:
intents: ["refund_request", "cancel_order"]
required_entities: ["order_id"]
optional_entities: ["refund_amount", "reason"]
# Critical: trigger isn't just intent match — required entities must
# also be present. Missing order_id → route to "clarify intent" branch,
# not this Skill.
# 2. Knowledge refs: what policies to check before / during the call
knowledge:
must_check:
- "refund_policy.7day_no_reason" # 7-day no-questions-asked
- "refund_policy.shipped_status" # shipped vs unshipped
- "member_tier.refund_priority" # tier impacts processing
# 3. Tool calls: what gets called, with what args, what to do on failure
tools:
- server: "order-service"
action: "query_order_status"
timeout_ms: 1500
on_failure: "escalate_to_human"
- server: "order-service"
action: "initiate_refund"
idempotency_key: "refund:{order_id}:{timestamp}"
timeout_ms: 3000
on_failure: "rollback_and_escalate" # rollback must be possible
# 4. Constraints: what this Skill is NOT allowed to do, in this company,
# in this scenario. Business sign-off lives here.
constraints:
- "refund_amount must be <= order_total_paid"
- "refund_amount > 500 → must escalate to human for second approval"
- "same order_id cannot be called twice within 24h"
- "user is not KYC-verified → cannot invoke"
- "02:00-08:00 local time → escalate (risk control window)"
# 5. Critic: pre-call + post-call double review
critic:
pre_call:
- check: "amount reconciles: amount == queried_order_total"
- check: "status reconciles: order_status in ['paid','shipped_not_delivered']"
- check: "time window: now - order_create_time <= 7 days"
timeout_action: "fail_closed" # timeout = escalate (series 3)
post_call:
- verify: "refund record written"
- verify: "user notified"
- verify: "third-party reconciliation matches"
timeout_action: "alert_human"
# 6. Output contract: how the result flows back to the conversation
output:
success:
template: "Refund initiated. Expected X business days to settle. Order: {order_id}"
failure:
template: "[Routing to human] Connecting you to an agent who'll follow up on the refund. One moment."
attach_context: true # human sees the Agent's decision chain
The four critical blocks: constraints + critic + output contract + irreversible-action handling.
Constraints is your company's risk policy, packaged. "Above $500 escalates" is your threshold; the company down the street's is $5,000. Only you can set this — no external standard helps.
Critic is the write-op safety net. pre_call checks the args reconcile; post_call checks the state changed correctly. Any failure → escalate to human (series 3 fail-closed). A Skill without a Critic is a production incident on a delay timer.
Output contract decides whether the Agent can hand context back to the rest of the conversation. Including the failure template and the context payload when handing to a human. Vendors skip this by default because demos only show happy path — failure-path output contracts never get written.
Without these three, the rest of the YAML is just a thin API wrapper, not a Skill.
Five Failure Modes — Why "We'll Just Use RAG" Projects Crash in Six Months
The projects I've worked on this year where the vendor's pitch started with "we'll put SOPs in the knowledge base + LLM and that's enough." Five typical six-month-out failure modes:
Mode 1 — Knowledge drift. Policy changes, knowledge base doesn't get updated, Agent recites the old policy. Happens weekly in customer service: SKU delisted, promo expired, rule changed. RAG architecturally cannot catch this — it has no concept of knowledge staleness. Skillified version: pre_call Critic queries the live policy endpoint, doesn't rely on the embedded copy.
Mode 2 — Hallucinated execution. Agent retrieves the "7-day no-questions-asked refund" policy doc and fabricates a response that the refund has been initiated. Customer thinks it's done. Two days later they notice no money. Comes back furious. Seen this in at least 3 projects this year. RAG can't stop it — in a read-only chain, the LLM saying "refund initiated" costs nothing to invent. Skillified version: the Agent literally cannot say that string without calling initiate_refund.
Mode 3 — No audit trail. Customer disputes "why was I refunded $200 over?" Company wants to trace who approved it and why. RAG mode only shows which policy paragraphs were retrieved — nothing about why this number was chosen. Skill mode logs the whole chain: trigger → critic → tool call → response, with inputs and outputs at every step. That trace decides whether you can assign responsibility in an incident review.
Mode 4 — Prompt-only governance. All constraints live in the prompt: "always escalate amounts over $500," "no refunds overnight," "remember to check membership tier." LLMs comply ~70-80% of the time. The other 20-30% they don't. The longer the prompt, the worse this gets. Skillified version: constraints are code if-then. Deterministic enforcement, not LLM self-discipline.
Mode 5 — Version rollback is impossible. RAG: tweak one policy doc → rebuild the whole embedding index → impact unknown → rollback means another rebuild. Skill: YAML in git, every change diffs, canary controllable, one-button rollback.
Common thread: pure RAG has no engineering safety net for write actions. All safety is "the prompt reminds the LLM to behave." When the LLM doesn't behave, the failure hits production users.
This is why L1 can be RAG-only but L2/L3 can't (L0-L3 grading) — higher levels mean more writes, more engineering reliance, harder requirement for Skillification.
Path to Production: Five Skills First, Build the Library in Three Months
Don't wait for a protocol. Don't try to define 50 Skills at once. Start with the 5 that hurt most:
Month 1: Inventory what your customer-service Agent actually does write-action-wise. A typical L2 has 5-8: check order, initiate refund, create ticket, change address, grant coupon, mark complaint, escalate to human. Write a full YAML for each (all 5 blocks: trigger / knowledge / tools / constraints / critic).
Month 2: YAML files in git. Make a skills/ directory. One Skill = one PR = code review. Two sign-offs required: constraints signed by business, critic signed by engineering. Neither signs → doesn't ship.
Month 3: Same template, non-customer-service scenarios. Replenishment, slow-mover alerts, scheduling — same YAML, new triggers and tools. Three months in, you have an enterprise Skill spec.
This doesn't need an external protocol. Every company that's been "waiting" is still waiting three months later.
Five Questions to Ask Vendors Who Claim "We Already Have N Skills"
Next time a vendor deck says "we've built X Skills," ask these five. Any "can't answer" = that's not a Skill.
Question 1: Show me the YAML for five of your Skills. A real Skill has at least trigger + tools + constraints + critic blocks. If what they show you is "we have skill_refund.py with some API-calling code in it" — that's a wrapper, not a Skill.
Question 2: Where's the Critic for each Skill? What does pre_call check, what does post_call check, what happens on timeout? "We don't have a Critic, the LLM decides" = end the review (series 3 is on why).
Question 3: Who sets the constraints, and what's the process to change them? Good answer: business and engineering both sign + PR-driven code review + canary rollout + one-button rollback. Bad answer: "they're in the prompt." Prompt-encoded constraints = no constraints.
Question 4: When a Skill call fails, how does context get handed to a human? Good answer: the human agent sees which Skill triggered, what knowledge was retrieved, what tool was attempted, and why it failed. Bad answer: "we just escalate to human" (strips context, human starts from zero).
Question 5: Of your N Skills, how many are write-ops and how many are read-only? Make the vendor draw the line. If 6 of 8 are "query + answer" and only 2 are real writes — the design covers L1 + L1.5. The L2 part is essentially undone.
Ten Things to Do This Week
- List every "capability" claim in your team's current Agent design and run each through the three-line test. Mark fake Skills vs. real ones.
- Pick the one write-action most likely to crash first (usually refund / address-change / coupon-grant). Fill in the 6-block YAML.
- Pull constraints into a separate doc for business sign-off — this is governance signing, not technical.
- Pull
pre_call/post_callCritic checks into a separate doc for engineering sign-off. - Pre-launch, run a miss-rate test: 100 boundary cases (amount exactly $500 / order exactly day 8 / 02:01 timestamp). Count how many the Critic catches.
- Write a Skill-review checklist for the wall. New-Skill PRs must walk through this table.
- Reconcile knowledge-base content vs Skill constraints — anywhere the KB says "cannot / must / if over X then Y," the Skill constraints must have the equivalent if-then.
- In your team's CLAUDE.md / SOP, pick the rule that gets forgotten most — write it as a script / hook. Make it unforgettable.
- Add idempotency keys to every write-op Skill (typical format
<action>:<entity_id>:<timestamp>) to prevent duplicate triggers. - Bring the five vendor questions to your next review.
Closing: Brains Can Memorize SOPs, Hands Need Flow Enforcement
The knowledge-base / Skill distinction isn't a naming exercise. It decides whether your design has any safety net when the first write-op incident hits.
"We ingested all our SOPs" = "every agent has memorized that policy." True, and entirely separate from "the refund flow works." The latter needs: button-press triggers a real refund, reconciliation matches, errors are traceable to a person — and that's what Skillification engineers, not knowledge-base engineering.
People will keep asking "where's the MCP-style Skill protocol?" Let them wait. The companies that started defining their 5-8 Skills this year have real L2 Agents running three months in — they've stopped asking that question.
If your vendor's deck claims 5+ Skills but real write-op Skills count is ≤2 and none of them has a full Critic — that's a fake-Agent pitch. Hand them the 5-block YAML above. Fixing one Skill to actually-shippable is far easier than fixing five, and far cheaper than one production incident.
Get the Toolkit
If you want the tools from this piece for your next vendor review or internal Skill definition, I put together a PDF kit:
Send me the keyword "SKILL KIT" and I'll send the pack:
- Full 6-block Skill YAML template + 4 filled examples (refund / cancel / address-change / coupon-grant)
- The "real Skill vs knowledge-base in disguise" three-line test card (vendor reviews + internal reviews)
- Skill review checklist (9 items every PR must walk through + the two sign-off fields)
- Five vendor questions with good / bad answer samples
A year of work in customer-service Agent projects compressed into one toolkit. Yours.
(Channels in the footer — X or email both work.)
What's Next in This Series
- Three-tier intent cascade tuning — rule + embedding + LLM fallback, thresholds you can actually defend (deep dive on series 2 decision 2)
- LLM into enterprise SaaS — 25 APIs → 5 tools, contract and idempotency design (deep dive on series 2 decision 4)
- How to test AI systems — dual-track architecture + 7 quality dimensions + SSE spike
- Enterprise LLM Agent war stories — async ES blocking the event loop / Critic timeout fallback misuse / proxy intercepting localhost / intent rename across 7+ files / KB-ES desync
If your team is shipping an L2 customer-service Agent or another write-op-heavy Agent, those four are next.
Subscribe for updates
Get the latest AI engineering posts delivered to your inbox.

