Testing — Yaqin Hei

Why Pytest-Green Does Not Mean Ship-Ready — Dual-Track Testing for Agentic AI

The thing your customer-service Agent project gets most easily fooled by this year: 'pytest 400+ green, coverage 79%, CI gate passing.' Then the boss asks 'what's the faithfulness rate? Tone compliance? Prompt-injection block rate?' and nobody answers. The 'tests passed' bar for an AI system is not the 'tests passed' bar for traditional software. This piece is for architects, founders, and project owners shipping Agentic AI inside an enterprise: 5 min to see why pytest-green is misleading, 10 min to decide who owns which 4 of the 8+ test buckets, 20 min to walk out with a 7-quality-dimension threshold table + 3-cadence rhythm + 5 things to drive this week — bring it to your next architecture review.

May 25, 2026·16 min read