diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 8529fc8..ee86727 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -1497,19 +1497,25 @@ ORDER BY MIN(adx) DESC; ## Common Pitfalls -1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):** +1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025, Enhanced Nov 24, 2025):** - **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s) - **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup - **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs - **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash - - **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`) + - **Solution (Nov 24, 2025):** Smart error-based health monitoring replaces blind timer - **Implementation:** - * `scheduleReconnection()` - Sets 4-hour timer after initialization - * `reconnect()` - Unsubscribes, resets state, reinitializes Drift client - * Timer cleared in `disconnect()` to prevent orphaned timers - - **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status) - - **Impact:** System now self-healing, can run indefinitely without manual restarts - - **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...` + * `lib/monitoring/drift-health-monitor.ts` - Tracks accountUnsubscribe errors in real-time + * `interceptWebSocketErrors()` - Patches console.error to catch SDK WebSocket errors + * **30-second sliding window:** Only restarts if 50+ errors in 30 seconds (actual problem detected) + * **Container restart via flag:** Writes `/tmp/trading-bot-restart.flag` for watch-restart.sh + * **Health API:** `GET /api/drift/health` - Check error count and health status anytime + - **Why better than blind timer:** + * Old approach: Restarted every 2 hours regardless of health (unnecessary downtime) + * New approach: Only restarts when accountUnsubscribe errors actually occur + * Faster response: 30 seconds vs up to 2 hours wait time + * Less downtime: No unnecessary restarts when SDK healthy + - **Monitoring:** Watch for `🏥 Drift health monitor started` and error threshold logs + - **Impact:** System responds to actual problems, not blind schedule 2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):** - **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK