Yaqin HeiAbout
← All posts

observability

1 post

Your Dashboard Is Throttling the Agent It Watches

A periodic P99 spike, arriving every few minutes like clockwork — but CPU, QPS, and error rate are all flat, and the Agent code hasn't changed. Everyone's first guess is 'ES retrieval got slow.' It didn't: the retrieval path is fully async, clean. The culprit is the one place you'd never suspect — the ops dashboard built to watch the Agent was quietly choking it. This is the postmortem: how one sync call freezes a single-threaded event loop, how to align spike timestamps to dashboard refreshes, the two-line fix (to_thread + TTL cache), and 10 event-loop probes you can add to your own async service this week. 20 minutes, and you'll be able to catch the same 'one sync call stalls a whole loop' bug in your own stack.

Jun 10, 2026·20 min read

微信公众号 京墨AI研习社 @HeiLabAI · 视频号 Yaqin.AI

X @yaqinhei · GitHub @AmyHei · amyheiny@gmail.com

© 2026 Yaqin Hei · About