AI Agent Knowledge Base: 3-Layer Design + 4-Bucket Cost Estimate

Yaqin Hei·February 28, 2026·25 min read

Agentic AI Knowledge Base RAG Model Selection AI Cost

AI Agent Knowledge Base: 3-Layer Design + 4-Bucket Cost Estimate

This is the English edition of Part 2 in the Retail Enterprise Agentic AI Handbook — the technical architecture choices for shipping a customer-service Agent. Part 1: Which 4 of Your 28 'Smart-X' Projects to Start With. 中文版：零售企业 Agentic AI 落地手册（二）：知识库决定上限，模型只是工具.

Opening: The Vendor Just Demo'd a CS Agent and Said "Latest LLM + Vector Database" — Here's the Next Question You Should Ask

Architecture review, vendor finishes demoing a CS Agent —

"We use the latest LLM, paired with an industry-leading vector database, 95% accuracy."

The CTO is taking notes — how was that 95% measured? On what scenarios? Who wrote the Q&A in the knowledge base? Is it still 95% if you rephrase the same question 5 ways? Is the accuracy for product-SKU lookup the same as for return-policy queries?

The vendor doesn't answer any of these. But the price tag is already 2M RMB.

In the CS Agent projects I've seen this year, 80% of the money got spent on the wrong things — on the most expensive LLM (the domestic ones are good enough), on the fanciest vector DB (the orchestration platform's built-in is enough), on the most complex engineering (the knowledge base is the actual ceiling).

What actually decides whether an Agent ships are three questions engineers don't like discussing — and management has to ask clearly —

How is the knowledge base built? This caps Agent quality — doc quality directly determines the Agent's capability ceiling
How is the model chosen? Between domestic / overseas / mixed, in CS scenarios, which 4 specific capabilities actually differ?
Across four cost buckets, where does the money go? Inference is only 5%; headcount is the biggest — and most budgets get this ratio backwards

Five minutes in, you can spot at your next architecture review whether the vendor's proposal is "dump docs into a vector DB" cosplay. Twenty minutes in, you can hand your boss a "3-layer knowledge base + Model B over Model A + 4-bucket cost estimate" technical plan.

1. The Knowledge Base Isn't a Pile of Docs — It's Three Different Layers, and Error Cost Decides How You Build Each

Put the verdict on the table first: the knowledge base's error cost per layer determines how you build it — not one big pile of docs, three completely different layers, each with a different processing strategy.

Layer	Content	Example	Processing Strategy
Layer 1 (Structured Rules)	Return policy, brand-authorization rules, logistics SLAs	"7-day no-questions-asked return conditions," brand-specific return policy variants	100% accuracy required, use hard rules — bypass the LLM entirely
Layer 2 (Product Knowledge)	Product attributes, sizing, materials, use scenarios	"This running shoe's outsole material, target running style, comparisons within the series"	Main source for product-query responses
Layer 3 (Experience)	Best answers to high-frequency questions, edge cases, complaint-soothing scripts	"Tone when a customer is emotional, framing for delayed-shipping explanations"	Hardest to build, most valuable — mined from historical conversations

Why three layers — error cost differs

Layer 1 error = direct customer complaint. AI gets the return policy wrong (tells a customer "30 days" when it's actually 7) — the customer escalates, brand partners get involved. So Layer 1 must return the rule verbatim, not via LLM generation — having the LLM "paraphrase" a rule is planting a landmine
Layer 2 error = degraded experience. Wrong product knowledge (recommended the wrong size) — customer returns, but doesn't escalate
Layer 3 missing = effectiveness ceiling. Without an experience layer, AI escalates to humans on anything complex and first-call resolution stalls

Detection signal: Vendor proposal "vectorizes everything uniformly" — straight fail. This is the most common 2026 design error; Layer 1 rules rephrased by the LLM produce higher complaint rates than the pure-human baseline.

2. The Knowledge Base Doesn't Need to Be "Done" Before Launch — Three Stages, 200 → 800 → 2000

The knowledge base has no "done" state, only a "current quality score." Here's the staged minimum-viable definition —

Stage	KB Scale	Quality Target	Validation
Alpha (internal test)	200 Q&A pairs, TOP 50 questions covered	Top-5 recall > 60%	Internal testing, finding knowledge gaps
Beta (small-traffic launch)	800 Q&A, main flows covered	Top-3 recall > 75%, answer accuracy > 80%	10% live traffic with human safety net
Production (at-scale replacement)	2000+ Q&A, edge cases covered	Top-1 recall > 70%, first-call resolution > 65%	Start considering CS headcount reduction

Three-step construction

Step 1: Document diagnosis (1 week)

Collect every existing CS document: policy files, brand handbooks, training material, email notices
Sort by the three layers; assess each doc's freshness (an outdated policy is more dangerous than no doc)
Identify "oral knowledge" — important rules that exist only in senior staff's heads, undocumented

Step 2: Structural conversion (2-3 weeks) — the most critical and most labor-intensive step

Rewrite unstructured docs as Q&A pairs — this is the single most effective lever for retrieval accuracy
Each Q&A: standard question + standard answer + applicable conditions + source/owner + last-updated date
Break policy docs to the smallest granularity — don't treat the entire return-policy doc as a single retrieval unit
Phrase questions in customer language, not internal jargon (customers say "exchange," not "merchandise swap request")

Step 3: Historical conversation mining (continuous)

Export CS-platform historical conversations, extract real-world phrasing variants of high-frequency questions
Find cases handled well by humans, formalize as standard scripts
Find escalated cases, analyze triggers, add to edge-case handbook
Target: 3-5 phrasing variants covered per high-frequency question

Key reminder: Doc conversion is labor-intensive, not a technical task. Senior CS reps (not engineers) own content quality; engineering owns format and ingestion. If this staffing isn't locked Week One, knowledge-base quality becomes the project's biggest bottleneck.

3. "Dump the Docs Into a Vector Database" Is the Start of 3 Wasted Months

A common trap — treating doc vectorization as knowledge-base completion. In retail CS, pure vector retrieval has 3 specific weaknesses:

Customer asks "Can I return this Air Max 270?" — semantic search finds the general Air Max product description, not the return policy
Customer says SKU "AT4525-100" — vector retrieval offers nothing here; keyword exact match is what's needed
Conditional logic in policy docs ("if … then …") loses structural information after vectorization

The recommended retrieval strategy is hybrid retrieval —

Retrieval Type	Use Case	Implementation
Keyword search (BM25)	Precise SKU, brand name, policy keywords	Built into orchestration platforms; native in cloud managed search
Vector search (Semantic)	Vague-intent queries like "Is this shoe good for running?"	Needs an embedding model (e.g., cloud `text-embedding`)
Hybrid rerank	Combines both, picks top-K most relevant	Recommend a rerank model; supported by orchestration platforms and cloud managed search

Storage/retrieval tech selection —

Option	Use Case	Recommendation
Orchestration-platform built-in KB	Fast validation, team familiar with platform, doc volume < 50k	First choice for stage 1
Cloud managed search	High doc volume, hybrid retrieval, production stability	Recommended for production
Elasticsearch + vector plugin	Existing ES experience, complex filters (brand/category)	Only if you have the experience
Dedicated vector DB (Milvus etc.)	Massive-scale vector retrieval, pure semantic scenarios	Not recommended

Stage 1 (validation): use the orchestration platform's built-in knowledge base — zero learning cost for the team. Migrate to cloud managed search only once knowledge-base content has stabilized and conversation volume has ramped — don't over-engineer Day 1 for hypothetical "future scale."

Detection signal: Vendor recommends Milvus / Pinecone out of the gate — ask "we're doing thousands of tickets a day; do we need that scale?" If they can't answer, it's over-engineering.

4. Domestic Models Match Overseas on Simple Queries — The Real Gap Is in Edge Cases

Model selection one-liner: for simple queries domestic models are fully sufficient; the real risk isn't capability, it's hallucination rate in edge cases.

Three technical routes for retail CS AI —

Route A (overseas models): OpenAI / Anthropic tier
Route B (domestic models): Qwen / Baichuan / GLM tier
Route C (mixed): Domestic as primary + overseas as capability benchmark

Real A-vs-B gap — 5 specific capabilities

Scenario	Route A	Route B	Gap
Simple query (policy lookup, order status)	Accurate, well-formatted	Accurate, well-formatted	No meaningful gap — domestic handles it
Multi-turn dialogue (carrying context)	Stable context tracking, accurate coreference	Stable within 5 turns; loses key info beyond	Long conversations need extra context-management logic
Complex complaint handling (emotion + logic)	Detects emotion, adjusts tone, gives sensible resolution simultaneously	Logic OK, emotion detection and tone modulation noticeably weaker	Escalation rate may rise
Tool calls (lookup order, trigger refund)	Multi-step tool chains reliable, sensible error handling	Single-step reliable; multi-step occasional mid-result loss	Need extra retry + result validation logic
Edge case (question not in KB)	Recognizes the knowledge gap, expresses uncertainty reasonably	"Confidently" gives wrong answers (more hallucination)	Layer-1 (rule) error rate rises — creates complaint risk

Key conclusion: Route B's main risk isn't capability weakness, it's hallucination rate is higher and the model doesn't know it — when customers ask questions not in the knowledge base, domestic models more often give wrong-but-plausible answers. This is precisely why you need a Critic layer (Part 3) — swapping in a more expensive model doesn't solve this problem.

Recommendation: Route C (layered mix)

Route C isn't "sometimes A, sometimes B" — it's a layered architecture with explicit routing logic —

Simple query / standard answer: Route B (domestic mainstream), low cost, low latency
Complex complaint / multi-step / emotional handling: Route B first, escalate to Route A if quality threshold isn't met (validation stage only)
No Route A in production — compliance red line. Route A's value is benchmarking Route B's capability ceiling during validation

For simple CS scenarios, domestic models are fully sufficient — don't pay extra for "overseas models are stronger." The real risk is AI fabricating answers in KB blind spots — that's solved by a safety check layer, not by switching to a more expensive model.

5. The 5-Layer Architecture — Each Layer Distributes Cost and Risk

Combining model selection and knowledge base, the complete CS Agent architecture is 5 layers —

User message arrives
    |
    v
+--------------------------------------------------+
| Layer 1: Intent recognition & routing (light)    |
| - Domestic small model (7B class) or rule-based  |
| - Categories: query / complaint / op / OOS       |
| - Avoids LLM call per request, saves 30-50%      |
|   inference cost                                 |
+--------------------------------------------------+
    |
    v
+--------------------------------------------------+
| Layer 2: Retrieval (hybrid)                      |
| - BM25 keyword + vector + rerank                 |
| - Pull top-K relevant docs, inject into context  |
| - Top-K: 3-5 (too many dilutes, too few misses)  |
| - Layer-1 (rule) queries return verbatim,        |
|   bypassing LLM                                  |
+--------------------------------------------------+
    |
    v
+--------------------------------------------------+
| Layer 3: Generation (model picked by complexity) |
| - Simple/standard: domestic mainstream model     |
| - Complex/multi-step: domestic first, escalate   |
|   if quality insufficient                        |
| - Layer-1 rule queries bypass generation —       |
|   that's the accuracy guarantee                  |
+--------------------------------------------------+
    |
    v
+--------------------------------------------------+
| Layer 4: Critic (rule engine, no LLM)            |
| - Compliance check before output                 |
| - Must-checks: unfulfillable promises, refund    |
|   amounts, internal info leakage                 |
| - Escalation triggers: emotion keywords,         |
|   3 unresolved rounds in a row                   |
| (see Part 3 for full design)                     |
+--------------------------------------------------+
    |
    v
+--------------------------------------------------+
| Layer 5: Tool calls (system integration)         |
| - WeCom API: receive messages, send replies      |
| - Ticketing API: create/update/query             |
| - Order system (if API): logistics, status       |
| - Key principle: tool-call failure must have     |
|   explicit downgrade path                        |
+--------------------------------------------------+

The core idea of the 5-layer split is "layered cost and risk distribution" —

Layer 1 value: ~20-30% of requests are simple policy lookups — no LLM needed. Routing with a small model or rules saves the LLM-inference cost on that share
Layer 2 value: Hybrid retrieval beats pure vector by 15-20% accuracy, especially for SKU and policy-keyword queries
Layer 3 value: Model per complexity, not one-size-fits-all. Simple → cheap; complex → strong
Layer 4 value: Safety net. No matter how strong the LLM, it can still emit something it shouldn't. Critic is hard-coded rules, not LLM-dependent judgment (full coverage in Part 3)
Layer 5 value: The Agent isn't just "chat" — it can look up orders, create tickets, trigger refunds (what a real money-moving refund workflow looks like, field by field)

Detection signal: No Layer 4 (Critic) in the vendor's proposal — straight fail. That means whatever the LLM emits goes out to customers, i.e. "you've given customers a CS agent that confidently fabricates refund promises."

6. Across 4 Cost Buckets, Inference Is Only ~5% — Headcount Is the Big One

Cost underestimation is the #1 reason Agentic AI projects spiral. Baseline assumption: thousands of CS tickets/day, AI handles 70% (~3,500), 30% human handoff (~1,500).

Bucket 1: Inference — directly determined by architecture choices

Inference is the hardest to predict and the easiest to lose control of — tightly coupled to architecture —

Cost Item	Basis	Estimated Monthly
Primary LLM (domestic 72B class)	~4 calls/ticket × 1500 tokens avg	~800-1,000 RMB/month
Embedding model	~5 retrievals/ticket × 512 tokens	~60-100 RMB/month (~negligible)
Rerank model	~3 retrievals/ticket, per-call pricing	~500-700 RMB/month
CS-platform API	Depends on existing contract	Confirm separately; bundle into annual negotiation
Total (Route B primary)	—	~1,500-2,000 RMB/month (~15-20k/year)

Levers you can pull —

LLM calls per ticket (4 → 3 = 25% cost cut)
Context length (trim system prompt, save 200 tokens per call)
Use a small model for intent (7B not 72B) — 90% cheaper for routing
Layer-1 rule queries bypass LLM — ~20% of tickets return directly, saving that share

Reference framing: Thousands of tickets/day, 1,500-2,000 RMB/month — that's the cost of 1-2 monthly CS-rep salaries. But this covers AI handling 70% of tickets; the equivalent human cost is 30-50× this number. Inference isn't the project's main cost pressure; new headcount is.

Bucket 2: New Headcount — the most underestimated bucket

Reducing CS reps creates new tech-staff demand. You can't only look at "how many are removed"; you have to look at "how many are added" —

New Role	Core Responsibilities	Hire/Reassign	Notes
AI ops engineer (required)	Monitor Agent performance, analyze error logs, tune prompts, handle edge cases	Requires LLM experience	Market is tight — can't substitute generic ops
KB content operator (required)	Maintain KB updates, handle "knowledge-gap" tickets, coordinate policy updates with business	Internal reassignment	Must understand both business and AI — reassign from a senior CS rep
Rule-library maintainer (part-time, reassignable)	Critic rule updates, new-complaint-type rule additions, compliance	Can be the AI ops engineer	Rule logic must come from someone who knows the business, not just engineers
Data annotation (early concentrated)	Conversation annotation for fine-tuning/eval, KB Q&A quality review	Outsourceable	Concentrated effort early, drops to part-time later

The hardest to hire is the AI ops engineer — not a generic backend engineer. They need prompt engineering, LLM call-chain analysis, and root-cause-from-logs skills. Prioritize internal engineers with LLM interest; on the open market this role pushes to 50k+ RMB/month.

Bucket 3: Infrastructure — relatively fixed, predictable

Item	Monthly	Notes
Vector DB	Stage 1: 0 (built-in); stage 2: 800-1,500+ RMB	Storage + request based
AI orchestration platform	~1,200-1,800 RMB (cloud server, primary + standby)	Reuse if already deployed
Conversation log storage	~100-300 RMB (object storage)	Retain 6+ months for audit
Monitoring + alerting	~200-500 RMB	Required — foundation for the ops loop
Total	~2,300-4,100 RMB/month (~28-50k/year)	Excluding pre-existing platform hardware

Bucket 4: System Integration — the most underestimated, the most schedule-risky

Integration is the most under-budgeted line item in the whole project, and the most common source of schedule slip. Every system integration is an independent engineering task, and maintenance never stops as upstream systems evolve —

Integration Target	Duration	Difficulty	Notes
WeCom CS API	1-2 weeks	Low (stable interface)	Message-format constraints, limited rich text
CS-platform conversation export	3-5 days	Low	Data may need cleaning; IO-heavy for large history
Ticketing API	2-3 weeks	Medium (vendor-dependent)	Vendor API docs may be incomplete; sandbox hard to obtain
Order/logistics queries	3-4 weeks	High (sensitive systems)	Most complex integration — IT deep involvement, long approval cycle
KB update automation	2 weeks	Medium	Need to define: who updates, when sync, how validated

Lesson: Order/logistics integration is the most likely critical-path blocker — start the IT conversation Week One. Don't wait until the knowledge base is built to start that conversation.

Full cost picture — the ratios on the table

Category	Annual Magnitude	Share	Management Focus
Inference	15-20k RMB	~5%	Not the dominant cost; further optimizable via architecture
New headcount	500k-1.5M RMB	~70%	The biggest, most underestimated bucket
Infrastructure	30-50k RMB	~10%	Relatively fixed, predictable
System integration	One-time 300-800k RMB	~15%	Highest schedule risk — start Week One

Conclusion: CS Agent inference cost (LLM API fees) is actually small — only ~2,000 RMB/month for thousands of daily tickets. The real cost is people (AI ops + KB ops) and integration (schedule risk). Total: AI handling 70% of tickets ≈ 1/30 to 1/50 the cost of equivalent human handling.

Detection signal: Vendor quotes only the "LLM inference cost" without "new headcount" or "system integration" — push back immediately, demand the full picture. This is the most common budget booby trap; three months in you'll get the "we need more budget" conversation and your boss will ask "why didn't we know this upfront."

Related: same misalignment, runtime version. "Looks good on the dashboard" is the easier trap to spot during cost discussions — much harder once the agent is live and the metric drift starts. See the six reward-hacking patterns ITSM agents learn within 60,000 training steps for the post-launch version of the same gap.

Where this leaves you

If you want to use the "3-layer knowledge base + model-selection matrix + 4-bucket cost estimator" directly in your next architecture review — without re-reading this article every time — I packaged a PDF kit for readers who got this far. Send me the keyword "CS COST KIT" and I'll send the pack:

3-layer knowledge-base decision sheet (card version — error cost / processing strategy / owner — for use in vendor reviews)
Model-selection 5-dimension comparison (one-page A3 — simple/multi-turn/complaint/tool-call/edge-case)
4-bucket cost estimator (Excel template — plug in daily ticket volume, get monthly budget)

(Channels in the footer — X or email both work.)

Next: What Happens After Launch — Ops Loop, Critic Safety Layer, and the 30-Day Plan

KB is built, model is picked, cost is computed — Part 3 is the most operational article in the series —

80% of failed bots are ops failures, not tech failures — which 6 KPIs do you have to track daily?
Why must the Critic safety layer fail-closed? What does the interception pseudocode look like?
The 5 prerequisites for headcount reduction — miss any one and you have an incident
The 30-day plan — what to do each day, who owns it, what gets produced

Series TOC:

Part 1: Which 4 of Your 28 'Smart-X' Projects to Start With
This article | Part 2: Knowledge Base Caps the Ceiling, the Model Is Just a Tool
Part 3: 80% of Failed Bots Were Ops Failures, Not Tech

Share on X

Subscribe for updates

Get the latest AI engineering posts delivered to your inbox.

← All posts

Subscribe for updates

评论

你可能也想看