Data Labeling — Yaqin Hei

How Much to Label: Not a Percentage of Traffic, but "Label Until You Can Conclude"

"We labeled 50, 96% correct — ship it?" No — the statistical lower bound is only 86%. In 5 minutes you'll see through the small-sample 96% mirage; in 10, why labeling volume tracks "intents × channels," not traffic share; in 20, you'll have a table mapping true accuracy to rows-to-label, plus the cheapest rule there is: labeling volume grows with the number of intents × channels, not with traffic.

After Launch Is Where Agent Architecture Is Decided — Start With How You Sample Your Eval Set

"The stronger AI gets, the fewer people you need" is the most popular illusion of the past two years. Where agents are built with real money, labeling and evaluation teams are growing, not shrinking. In 5 minutes you'll see what separates "can demo it" from "can keep it right"; in 10, the 3 sampling questions that expose an eval set; in 20, how to redraw yours from "eyeball the logs" into a stratified sampling frame that pulls rare, high-stakes events — a wrong refund, a mishandled compliance case — back from "never sampled" into view.

Jul 2, 2026·12 min read