docs: Update DRIFT SDK MEMORY LEAK section with Nov 24 health monitoring
- Replaced blind 2-hour timer documentation with error-based health monitoring
- Documented DriftHealthMonitor class with 30-second sliding window
- Added 50 error threshold and flag file restart mechanism
- Documented /api/drift/health endpoint
- Explained benefits: no unnecessary restarts, faster response (30s vs 2h)
- Preserved accurate symptom/root cause/manifestation information
- References implementation commit dc197f5
This commit is contained in:
22
.github/copilot-instructions.md
vendored
22
.github/copilot-instructions.md
vendored
@@ -1497,19 +1497,25 @@ ORDER BY MIN(adx) DESC;
|
|||||||
|
|
||||||
## Common Pitfalls
|
## Common Pitfalls
|
||||||
|
|
||||||
1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):**
|
1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025, Enhanced Nov 24, 2025):**
|
||||||
- **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
|
- **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
|
||||||
- **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
|
- **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
|
||||||
- **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
|
- **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
|
||||||
- **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
|
- **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
|
||||||
- **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`)
|
- **Solution (Nov 24, 2025):** Smart error-based health monitoring replaces blind timer
|
||||||
- **Implementation:**
|
- **Implementation:**
|
||||||
* `scheduleReconnection()` - Sets 4-hour timer after initialization
|
* `lib/monitoring/drift-health-monitor.ts` - Tracks accountUnsubscribe errors in real-time
|
||||||
* `reconnect()` - Unsubscribes, resets state, reinitializes Drift client
|
* `interceptWebSocketErrors()` - Patches console.error to catch SDK WebSocket errors
|
||||||
* Timer cleared in `disconnect()` to prevent orphaned timers
|
* **30-second sliding window:** Only restarts if 50+ errors in 30 seconds (actual problem detected)
|
||||||
- **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status)
|
* **Container restart via flag:** Writes `/tmp/trading-bot-restart.flag` for watch-restart.sh
|
||||||
- **Impact:** System now self-healing, can run indefinitely without manual restarts
|
* **Health API:** `GET /api/drift/health` - Check error count and health status anytime
|
||||||
- **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...`
|
- **Why better than blind timer:**
|
||||||
|
* Old approach: Restarted every 2 hours regardless of health (unnecessary downtime)
|
||||||
|
* New approach: Only restarts when accountUnsubscribe errors actually occur
|
||||||
|
* Faster response: 30 seconds vs up to 2 hours wait time
|
||||||
|
* Less downtime: No unnecessary restarts when SDK healthy
|
||||||
|
- **Monitoring:** Watch for `🏥 Drift health monitor started` and error threshold logs
|
||||||
|
- **Impact:** System responds to actual problems, not blind schedule
|
||||||
|
|
||||||
2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
|
2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
|
||||||
- **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK
|
- **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK
|
||||||
|
|||||||
Reference in New Issue
Block a user