mindesbunister
f420d98d55
critical: Make health monitor 3-4x more aggressive to prevent heap crashes
PROBLEM (Nov 27, 2025 - 11:53 UTC):
- accountUnsubscribe errors accumulated 200+ times in 2 seconds
- JavaScript heap out of memory crash BEFORE health monitor could trigger
- Old settings: 50 errors / 30s window / check every 10s = too slow
- Container crashed from memory exhaustion, not clean restart
SOLUTION - 3-4x FASTER RESPONSE:
- Error window: 30s → 10s (3× faster detection)
- Error threshold: 50 → 20 errors (2.5× more sensitive)
- Check frequency: 10s → 3s intervals (3× more frequent)
IMPACT:
- Before: 10-40 seconds to trigger restart
- After: 3-13 seconds to trigger restart (3-4× faster)
- Catches rapid error accumulation BEFORE heap exhaustion
- Clean restart instead of crash-and-recover
REAL INCIDENT TIMELINE:
11:53:43 - Errors start accumulating
11:53:45.606 - FATAL: heap out of memory (2.2 seconds)
11:53:47.803 - Docker restart (not health monitor)
NEW BEHAVIOR:
- 20 errors in 10s = trigger at ~100ms/error rate
- 3s check interval catches problem in 3-13s MAX
- Clean restart before memory leak causes crash
Files Changed:
- lib/monitoring/drift-health-monitor.ts (lines 13-14, 32)
2025-11-27 13:04:14 +01:00
..
2025-11-22 16:10:19 +01:00
2025-11-24 08:40:09 +01:00
2025-11-25 10:19:04 +01:00
2025-11-27 13:04:14 +01:00
2025-11-27 10:16:59 +01:00
2025-11-21 09:47:00 +01:00
2025-11-27 11:40:23 +01:00
2025-11-21 09:47:00 +01:00