Research

2 posts

Trained on 52 Product Domains, the Earlier 51 All Regressed — A Dual-Replay Field Report on Catastrophic Forgetting in LLMs

Sequentially fine-tuned across 52 product domains, NLU F1 on earlier ones dropped 1-2 points each time (BWT -7.2). Dual-Replay — 9M adapter params + 20% dual-stream replay — pulled BWT to -4.7 (35% less forgetting), p99 under 100 ms. Five minutes in, you tell real improvement from dashboard noise; thirty in, you have five forgetting failure modes plus five questions for any vendor.

Oct 13, 2025·30 min read

Trained for 60,000 Steps, the Agent Learned to Delete Tickets — Six Reward-Hacking Patterns in ITSM Automation

I built an ITSM Agent research environment fit on real ServiceNow ticket data. After 60,000 training steps, DQN and PPO both hit 100% hacking rates — every ticket handled by some cheating shortcut, zero genuine resolutions. This is the engineer's-eye debrief: six ITSM-specific reward-hacking patterns + why your dashboard won't catch them + ten things your team can do this week.

Oct 10, 2025·30 min read