From d654ad3e5e88fc7fcfa7cfa8ca2691791d77f069 Mon Sep 17 00:00:00 2001
From: mindesbunister <github_service@egonetix.de>
Date: Sat, 15 Nov 2025 09:37:13 +0100
Subject: [PATCH] docs: Add Drift SDK memory leak to Common Pitfalls #1

- Documented memory leak fix from Nov 15, 2025
- Symptoms: Heap grows to 4GB+, Telegram timeouts, OOM crash after 10+ hours
- Root cause: WebSocket subscription accumulation in Drift SDK
- Solution: Automatic reconnection every 4 hours
- Renumbered all subsequent pitfalls (2-33)
- Added monitoring guidance and manual control endpoint info
---
 .github/copilot-instructions.md | 78 +++++++++++++++++++--------------
 1 file changed, 46 insertions(+), 32 deletions(-)

diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index b4df98d..358d1a4 100644
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -997,7 +997,21 @@ ORDER BY MIN(adx) DESC;
 
 ## Common Pitfalls
 
-1. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
+1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):**
+   - **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
+   - **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
+   - **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
+   - **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
+   - **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`)
+   - **Implementation:**
+     * `scheduleReconnection()` - Sets 4-hour timer after initialization
+     * `reconnect()` - Unsubscribes, resets state, reinitializes Drift client
+     * Timer cleared in `disconnect()` to prevent orphaned timers
+   - **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status)
+   - **Impact:** System now self-healing, can run indefinitely without manual restarts
+   - **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...`
+
+2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
    - **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK
    - **Root Cause CONFIRMED:** Alchemy's rate limiting breaks Drift SDK's burst subscription pattern during initialization
    - **Definitive Proof (Nov 14, 21:14 CET):**
@@ -1040,81 +1054,81 @@ ORDER BY MIN(adx) DESC;
    - **Investigation Closed:** This is DEFINITIVE. Use Helius. Do not use Alchemy.
    - **Test Yourself:** `curl 'http://localhost:3001/api/testing/drift-init?rpc=alchemy'`
 
-2. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
+3. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
 
-3. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`
+4. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`
 
-4. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)
+5. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)
 
-5. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions
+6. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions
 
-6. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues
+7. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues
 
-7. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully
+8. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully
 
-8. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized
+9. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized
 
-9. **TP2-as-Runner configuration:** 
+10. **TP2-as-Runner configuration:** 
    - `takeProfit2SizePercent: 0` means "TP2 activates trailing stop, no position close"
    - This creates runner of remaining % after TP1 (default 25%, configurable via TAKE_PROFIT_1_SIZE_PERCENT)
    - `TAKE_PROFIT_2_PERCENT=0.7` sets TP2 trigger price, `TAKE_PROFIT_2_SIZE_PERCENT` should be 0
    - Settings UI correctly shows "TP2 activates trailing stop" with dynamic runner % calculation
 
-9. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
+11. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
 ```typescript
 const profitPercent = this.calculateProfitPercent(trade.entryPrice, exitPrice, trade.direction)
 const actualRealizedPnL = (closedSizeUSD * profitPercent) / 100
 trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
 ```
 
-10. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.
+12. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.
 
-11. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
+13. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
     1. Open position + place exit orders
     2. Save to database (`createTrade()`)
     3. Add to Position Manager (`positionManager.addTrade()`)
     
     If Position Manager is added before database save, race conditions occur where monitoring checks before the trade exists in DB.
 
-12. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.
+14. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.
 
-13. **Drift minimum position sizes:** Actual minimums differ from documentation:
+15. **Drift minimum position sizes:** Actual minimums differ from documentation:
     - SOL-PERP: 0.1 SOL (~$5-15 depending on price)
     - ETH-PERP: 0.01 ETH (~$38-40 at $4000/ETH)
     - BTC-PERP: 0.0001 BTC (~$10-12 at $100k/BTC)
     
     Always calculate: `minOrderSize × currentPrice` must exceed Drift's $4 minimum. Add buffer for price movement.
 
-14. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.
+16. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.
 
-15. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.
+17. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.
 
-16. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
+18. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
     - 5min charts naturally have lower ADX (12-22 healthy) and ATR (0.2-0.7% healthy) than daily charts
     - Without timeframe awareness, valid 5min breakouts get blocked as "low quality"
     - Anti-chop filter applies -20 points for extreme sideways regardless of timeframe
     - Always pass `timeframe` parameter from TradingView alerts to `scoreSignalQuality()`
 
-17. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
+19. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
     - Database analysis showed overnight flip-flop losses all had price position 9-94% (chasing extremes)
     - These trades had valid ADX (16-18) but entered at worst possible time
     - Quality scoring now penalizes -15 to -30 points for range extremes
     - Prevents rapid reversals when price is already overextended
 
-18. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
+20. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
     - Higher timeframes can use ADX 20+ for strong trends
     - 5min charts need lower threshold to catch valid breakouts
     - Bot's quality scoring provides second-layer filtering with context-aware metrics
     - Two-stage filtering (TradingView + bot) prevents both overtrading and missing valid signals
 
-19. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
+21. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
     - Use `any` type for numeric fields in `$queryRaw` results: `total_pnl: any`
     - Convert with `Number()` before returning to frontend: `totalPnL: Number(stat.total_pnl) || 0`
     - Frontend uses `.toFixed()` which doesn't exist on Decimal objects
     - Applies to all aggregations: SUM(), AVG(), ROUND() - all return Decimal types
     - Example: `/api/analytics/version-comparison` converts all numeric fields
 
-20. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
+22. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
     - **Problem:** At $168 SOL, 0.3% = $0.50 wiggle room. Trades with +7-9% MFE exited for losses.
     - **Fix:** `trailingDistancePercent = (atrAtEntry / currentPrice * 100) × trailingStopAtrMultiplier`
     - **Config:** `TRAILING_STOP_ATR_MULTIPLIER=1.5`, `MIN=0.25%`, `MAX=0.9%`, `ACTIVATION=0.5%`
@@ -1124,14 +1138,14 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **ActiveTrade interface:** Must include `atrAtEntry?: number` field for calculation
     - See `ATR_TRAILING_STOP_FIX.md` for full details and database analysis
 
-21. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
+23. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
     - Interface defines what parameters `createTrade()` accepts
     - Must add new field to interface (e.g., `indicatorVersion?: string`)
     - Must add field to Prisma create data object in `createTrade()` function
     - TypeScript build will fail if endpoint passes field not in interface
     - Example: indicatorVersion tracking required 3-file update (execute route.ts, CreateTradeParams interface, createTrade function)
 
-22. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
+24. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
     - **Symptom:** Position Manager detects false TP1 hits, moves SL to breakeven prematurely
     - **Root Cause:** `lib/drift/client.ts` returns `position.size` as BASE ASSET TOKENS (12.28 SOL), not USD ($1,950)
     - **Bug:** Comparing tokens (12.28) directly to USD ($1,950) → 12.28 < 1,950 × 0.95 = "99.4% reduction" → FALSE TP1!
@@ -1149,7 +1163,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **Where it matters:** Position Manager, any code querying Drift positions
     - **Database evidence:** Trade showed `tp1Hit: true` when 100% still open, `slMovedToBreakeven: true` prematurely
 
-23. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
+25. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
     - **Symptom:** Telegram notifications showing "⚡ Leverage: 10x" when actual position uses 15x or 20x
     - **Root Cause:** API response returning `config.leverage` (global default) instead of symbol-specific value
     - **Fix:** Use actual leverage from `getPositionSizeForSymbol()`:
@@ -1163,13 +1177,13 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **Impact:** Misleading notifications, user confusion about actual position risk
     - **Hierarchy:** Per-symbol ENV (SOLANA_LEVERAGE) → Market config → Global ENV (LEVERAGE) → Defaults
 
-24. **Indicator version tracking (Nov 12, 2025+):**
+26. **Indicator version tracking (Nov 12, 2025+):**
     - Database field `indicatorVersion` tracks which TradingView strategy generated the signal
     - **v5:** Buy/Sell Signal strategy (pre-Nov 12)
     - **v6:** HalfTrend + BarColor strategy (Nov 12+)
     - Used for performance comparison between strategies
 
-26. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
+27. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
     - **Symptom:** Trades showing 7-8x larger losses than actual ($58 loss when Drift shows $7 loss)
     - **Root Cause:** Position Manager monitoring loop re-processes external closures multiple times before trade removed from activeTrades Map
     - **Bug sequence:**
@@ -1201,7 +1215,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - Must update `CreateTradeParams` interface when adding new database fields (see pitfall #21)
     - Analytics endpoint `/api/analytics/version-comparison` compares v5 vs v6 performance
 
-25. **Signal quality threshold adjustment (Nov 12, 2025):**
+28. **Signal quality threshold adjustment (Nov 12, 2025):**
     - **Lowered from 65 → 60** based on data analysis of 161 trades
     - **Reason:** Score 60-64 tier outperformed higher scores:
       - 60-64: 2 trades, +$45.78 total, 100% WR, +$22.89 avg
@@ -1213,7 +1227,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **Risk:** Small sample size (2 trades) could be outliers, but downside limited
     - SQL analysis showed clear pattern: stricter filtering was blocking profitable setups
 
-27. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
+29. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
     - **Symptom:** Positions opened on Drift with NO database record, NO Position Manager tracking, NO TP/SL protection
     - **Root Cause:** Execute endpoint saved to database AFTER adding to Position Manager, with silent error catch
     - **Bug sequence:**
@@ -1246,7 +1260,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **Documentation:** See `CRITICAL_INCIDENT_UNPROTECTED_POSITION.md` for full incident report
     - **Rule:** Database persistence ALWAYS comes before in-memory state updates
 
-28. **DNS retry logic (Nov 13, 2025):**
+30. **DNS retry logic (Nov 13, 2025):**
     - **Problem:** Trading bot fails with "fetch failed" errors when DNS resolution temporarily fails for `mainnet.helius-rpc.com`
     - **Impact:** n8n workflow failures, missed trades, container restart failures
     - **Root Cause:** `EAI_AGAIN` errors are transient DNS issues that resolve in seconds, but bot treated them as permanent failures
@@ -1264,7 +1278,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **Note:** DNS retries use 2s delays (fast recovery), rate limit retries use 5s delays (RPC cooldown)
     - **Documentation:** See `docs/DNS_RETRY_LOGIC.md` for monitoring queries and metrics
 
-29. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
+31. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
     - **Symptom:** AI says "position is protected" or "fix is deployed" when container still running old code
     - **Root Cause:** Conflating "code committed to git" with "code running in production"
     - **Real Incident:** Database-first fix committed 15:56, declared "working" at 19:42, but container started 15:06 (old code)
@@ -1281,7 +1295,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **Impact:** This is a REAL MONEY system - premature declarations cause financial losses
     - **Documentation:** Added mandatory deployment verification to VERIFICATION MANDATE section
 
-30. **Phantom trade notification workflow breaks (Nov 14, 2025):**
+32. **Phantom trade notification workflow breaks (Nov 14, 2025):**
     - **Symptom:** Phantom trade detected, position opened on Drift, but n8n workflow stops with HTTP 500 error. User NOT notified.
     - **Root Cause:** Execute endpoint returned HTTP 500 when phantom detected, causing n8n chain to halt before Telegram notification
     - **Problem:** Unmonitored phantom position on exchange while user is asleep/away = unlimited risk exposure
@@ -1298,7 +1312,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
     - **Impact:** Protects user from unlimited risk during unavailable hours. Phantom trades are rare edge cases (oracle issues, exchange rejections).
     - **Database tracking:** `status='phantom'`, `exitReason='manual'`, enables analysis of phantom frequency and patterns
 
-31. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
+33. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
     - **Symptom:** Flip-flop detection showing "100% price move" when actual movement was 0.2%, allowing trades that should be blocked
     - **Root Cause:** `currentPrice` parameter not available in check-risk endpoint (trade hasn't opened yet), so calculation used undefined/zero
     - **Real incident:** Nov 14, 06:05 CET - SHORT allowed with 0.2% flip-flop, lost -$1.56 in 5 minutes