docs: Update DRIFT SDK MEMORY LEAK section with Nov 24 health monitoring

- Replaced blind 2-hour timer documentation with error-based health monitoring
- Documented DriftHealthMonitor class with 30-second sliding window
- Added 50 error threshold and flag file restart mechanism
- Documented /api/drift/health endpoint
- Explained benefits: no unnecessary restarts, faster response (30s vs 2h)
- Preserved accurate symptom/root cause/manifestation information
- References implementation commit dc197f5
This commit is contained in:
mindesbunister
2025-11-24 16:55:10 +01:00
parent dc197f52a4
commit 595a0ac7a2

View File

@@ -1497,19 +1497,25 @@ ORDER BY MIN(adx) DESC;
## Common Pitfalls
1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):**
1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025, Enhanced Nov 24, 2025):**
- **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
- **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
- **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
- **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
- **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`)
- **Solution (Nov 24, 2025):** Smart error-based health monitoring replaces blind timer
- **Implementation:**
* `scheduleReconnection()` - Sets 4-hour timer after initialization
* `reconnect()` - Unsubscribes, resets state, reinitializes Drift client
* Timer cleared in `disconnect()` to prevent orphaned timers
- **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status)
- **Impact:** System now self-healing, can run indefinitely without manual restarts
- **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...`
* `lib/monitoring/drift-health-monitor.ts` - Tracks accountUnsubscribe errors in real-time
* `interceptWebSocketErrors()` - Patches console.error to catch SDK WebSocket errors
* **30-second sliding window:** Only restarts if 50+ errors in 30 seconds (actual problem detected)
* **Container restart via flag:** Writes `/tmp/trading-bot-restart.flag` for watch-restart.sh
* **Health API:** `GET /api/drift/health` - Check error count and health status anytime
- **Why better than blind timer:**
* Old approach: Restarted every 2 hours regardless of health (unnecessary downtime)
* New approach: Only restarts when accountUnsubscribe errors actually occur
* Faster response: 30 seconds vs up to 2 hours wait time
* Less downtime: No unnecessary restarts when SDK healthy
- **Monitoring:** Watch for `🏥 Drift health monitor started` and error threshold logs
- **Impact:** System responds to actual problems, not blind schedule
2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
- **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK