docs: Update DRIFT SDK MEMORY LEAK section with Nov 24 health monitoring

- Replaced blind 2-hour timer documentation with error-based health monitoring - Documented DriftHealthMonitor class with 30-second sliding window - Added 50 error threshold and flag file restart mechanism - Documented /api/drift/health endpoint - Explained benefits: no unnecessary restarts, faster response (30s vs 2h) - Preserved accurate symptom/root cause/manifestation information - References implementation commit dc197f5
2025-11-24 16:55:10 +01:00
parent dc197f52a4
commit 595a0ac7a2
1 changed files with 14 additions and 8 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -1497,19 +1497,25 @@ ORDER BY MIN(adx) DESC;

 ## Common Pitfalls

-1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):**
+1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025, Enhanced Nov 24, 2025):**
   - **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
   - **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
   - **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
   - **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
-   - **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`)
+   - **Solution (Nov 24, 2025):** Smart error-based health monitoring replaces blind timer
   - **Implementation:**
-     * `scheduleReconnection()` - Sets 4-hour timer after initialization
-     * `reconnect()` - Unsubscribes, resets state, reinitializes Drift client
-     * Timer cleared in `disconnect()` to prevent orphaned timers
-   - **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status)
-   - **Impact:** System now self-healing, can run indefinitely without manual restarts
-   - **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...`
+     * `lib/monitoring/drift-health-monitor.ts` - Tracks accountUnsubscribe errors in real-time
+     * `interceptWebSocketErrors()` - Patches console.error to catch SDK WebSocket errors
+     * **30-second sliding window:** Only restarts if 50+ errors in 30 seconds (actual problem detected)
+     * **Container restart via flag:** Writes `/tmp/trading-bot-restart.flag` for watch-restart.sh
+     * **Health API:** `GET /api/drift/health` - Check error count and health status anytime
+   - **Why better than blind timer:**
+     * Old approach: Restarted every 2 hours regardless of health (unnecessary downtime)
+     * New approach: Only restarts when accountUnsubscribe errors actually occur
+     * Faster response: 30 seconds vs up to 2 hours wait time
+     * Less downtime: No unnecessary restarts when SDK healthy
+   - **Monitoring:** Watch for `🏥 Drift health monitor started` and error threshold logs
+   - **Impact:** System responds to actual problems, not blind schedule

 2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
   - **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK