docs: Add Drift SDK memory leak to Common Pitfalls #1

- Documented memory leak fix from Nov 15, 2025 - Symptoms: Heap grows to 4GB+, Telegram timeouts, OOM crash after 10+ hours - Root cause: WebSocket subscription accumulation in Drift SDK - Solution: Automatic reconnection every 4 hours - Renumbered all subsequent pitfalls (2-33) - Added monitoring guidance and manual control endpoint info
2025-11-15 09:37:13 +01:00
parent fb4beee418
commit d654ad3e5e
1 changed files with 46 additions and 32 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -997,7 +997,21 @@ ORDER BY MIN(adx) DESC;

 ## Common Pitfalls

-1. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
+1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):**
+   - **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
+   - **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
+   - **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
+   - **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
+   - **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`)
+   - **Implementation:**
+     * `scheduleReconnection()` - Sets 4-hour timer after initialization
+     * `reconnect()` - Unsubscribes, resets state, reinitializes Drift client
+     * Timer cleared in `disconnect()` to prevent orphaned timers
+   - **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status)
+   - **Impact:** System now self-healing, can run indefinitely without manual restarts
+   - **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...`
+
+2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
   - **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK
   - **Root Cause CONFIRMED:** Alchemy's rate limiting breaks Drift SDK's burst subscription pattern during initialization
   - **Definitive Proof (Nov 14, 21:14 CET):**
@@ -1040,81 +1054,81 @@ ORDER BY MIN(adx) DESC;
   - **Investigation Closed:** This is DEFINITIVE. Use Helius. Do not use Alchemy.
   - **Test Yourself:** `curl 'http://localhost:3001/api/testing/drift-init?rpc=alchemy'`

-2. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
+3. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`

-3. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`
+4. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`

-4. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)
+5. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)

-5. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions
+6. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions

-6. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues
+7. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues

-7. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully
+8. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully

-8. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized
+9. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized

-9. **TP2-as-Runner configuration:** 
+10. **TP2-as-Runner configuration:** 
   - `takeProfit2SizePercent: 0` means "TP2 activates trailing stop, no position close"
   - This creates runner of remaining % after TP1 (default 25%, configurable via TAKE_PROFIT_1_SIZE_PERCENT)
   - `TAKE_PROFIT_2_PERCENT=0.7` sets TP2 trigger price, `TAKE_PROFIT_2_SIZE_PERCENT` should be 0
   - Settings UI correctly shows "TP2 activates trailing stop" with dynamic runner % calculation

-9. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
+11. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
 ```typescript
 const profitPercent = this.calculateProfitPercent(trade.entryPrice, exitPrice, trade.direction)
 const actualRealizedPnL = (closedSizeUSD * profitPercent) / 100
 trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
 ```

-10. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.
+12. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.

-11. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
+13. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
    1. Open position + place exit orders
    2. Save to database (`createTrade()`)
    3. Add to Position Manager (`positionManager.addTrade()`)
    
    If Position Manager is added before database save, race conditions occur where monitoring checks before the trade exists in DB.

-12. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.
+14. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.

-13. **Drift minimum position sizes:** Actual minimums differ from documentation:
+15. **Drift minimum position sizes:** Actual minimums differ from documentation:
    - SOL-PERP: 0.1 SOL (~$5-15 depending on price)
    - ETH-PERP: 0.01 ETH (~$38-40 at $4000/ETH)
    - BTC-PERP: 0.0001 BTC (~$10-12 at $100k/BTC)
    
    Always calculate: `minOrderSize × currentPrice` must exceed Drift's $4 minimum. Add buffer for price movement.

-14. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.
+16. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.

-15. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.
+17. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.

-16. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
+18. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
    - 5min charts naturally have lower ADX (12-22 healthy) and ATR (0.2-0.7% healthy) than daily charts
    - Without timeframe awareness, valid 5min breakouts get blocked as "low quality"
    - Anti-chop filter applies -20 points for extreme sideways regardless of timeframe
    - Always pass `timeframe` parameter from TradingView alerts to `scoreSignalQuality()`

-17. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
+19. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
    - Database analysis showed overnight flip-flop losses all had price position 9-94% (chasing extremes)
    - These trades had valid ADX (16-18) but entered at worst possible time
    - Quality scoring now penalizes -15 to -30 points for range extremes
    - Prevents rapid reversals when price is already overextended

-18. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
+20. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
    - Higher timeframes can use ADX 20+ for strong trends
    - 5min charts need lower threshold to catch valid breakouts
    - Bot's quality scoring provides second-layer filtering with context-aware metrics
    - Two-stage filtering (TradingView + bot) prevents both overtrading and missing valid signals

-19. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
+21. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
    - Use `any` type for numeric fields in `$queryRaw` results: `total_pnl: any`
    - Convert with `Number()` before returning to frontend: `totalPnL: Number(stat.total_pnl) || 0`
    - Frontend uses `.toFixed()` which doesn't exist on Decimal objects
    - Applies to all aggregations: SUM(), AVG(), ROUND() - all return Decimal types
    - Example: `/api/analytics/version-comparison` converts all numeric fields

-20. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
+22. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
    - **Problem:** At $168 SOL, 0.3% = $0.50 wiggle room. Trades with +7-9% MFE exited for losses.
    - **Fix:** `trailingDistancePercent = (atrAtEntry / currentPrice * 100) × trailingStopAtrMultiplier`
    - **Config:** `TRAILING_STOP_ATR_MULTIPLIER=1.5`, `MIN=0.25%`, `MAX=0.9%`, `ACTIVATION=0.5%`
@@ -1124,14 +1138,14 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **ActiveTrade interface:** Must include `atrAtEntry?: number` field for calculation
    - See `ATR_TRAILING_STOP_FIX.md` for full details and database analysis

-21. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
+23. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
    - Interface defines what parameters `createTrade()` accepts
    - Must add new field to interface (e.g., `indicatorVersion?: string`)
    - Must add field to Prisma create data object in `createTrade()` function
    - TypeScript build will fail if endpoint passes field not in interface
    - Example: indicatorVersion tracking required 3-file update (execute route.ts, CreateTradeParams interface, createTrade function)

-22. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
+24. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
    - **Symptom:** Position Manager detects false TP1 hits, moves SL to breakeven prematurely
    - **Root Cause:** `lib/drift/client.ts` returns `position.size` as BASE ASSET TOKENS (12.28 SOL), not USD ($1,950)
    - **Bug:** Comparing tokens (12.28) directly to USD ($1,950) → 12.28 < 1,950 × 0.95 = "99.4% reduction" → FALSE TP1!
@@ -1149,7 +1163,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Where it matters:** Position Manager, any code querying Drift positions
    - **Database evidence:** Trade showed `tp1Hit: true` when 100% still open, `slMovedToBreakeven: true` prematurely

-23. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
+25. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
    - **Symptom:** Telegram notifications showing "⚡ Leverage: 10x" when actual position uses 15x or 20x
    - **Root Cause:** API response returning `config.leverage` (global default) instead of symbol-specific value
    - **Fix:** Use actual leverage from `getPositionSizeForSymbol()`:
@@ -1163,13 +1177,13 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Impact:** Misleading notifications, user confusion about actual position risk
    - **Hierarchy:** Per-symbol ENV (SOLANA_LEVERAGE) → Market config → Global ENV (LEVERAGE) → Defaults

-24. **Indicator version tracking (Nov 12, 2025+):**
+26. **Indicator version tracking (Nov 12, 2025+):**
    - Database field `indicatorVersion` tracks which TradingView strategy generated the signal
    - **v5:** Buy/Sell Signal strategy (pre-Nov 12)
    - **v6:** HalfTrend + BarColor strategy (Nov 12+)
    - Used for performance comparison between strategies

-26. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
+27. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
    - **Symptom:** Trades showing 7-8x larger losses than actual ($58 loss when Drift shows $7 loss)
    - **Root Cause:** Position Manager monitoring loop re-processes external closures multiple times before trade removed from activeTrades Map
    - **Bug sequence:**
@@ -1201,7 +1215,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - Must update `CreateTradeParams` interface when adding new database fields (see pitfall #21)
    - Analytics endpoint `/api/analytics/version-comparison` compares v5 vs v6 performance

-25. **Signal quality threshold adjustment (Nov 12, 2025):**
+28. **Signal quality threshold adjustment (Nov 12, 2025):**
    - **Lowered from 65 → 60** based on data analysis of 161 trades
    - **Reason:** Score 60-64 tier outperformed higher scores:
      - 60-64: 2 trades, +$45.78 total, 100% WR, +$22.89 avg
@@ -1213,7 +1227,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Risk:** Small sample size (2 trades) could be outliers, but downside limited
    - SQL analysis showed clear pattern: stricter filtering was blocking profitable setups

-27. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
+29. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
    - **Symptom:** Positions opened on Drift with NO database record, NO Position Manager tracking, NO TP/SL protection
    - **Root Cause:** Execute endpoint saved to database AFTER adding to Position Manager, with silent error catch
    - **Bug sequence:**
@@ -1246,7 +1260,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Documentation:** See `CRITICAL_INCIDENT_UNPROTECTED_POSITION.md` for full incident report
    - **Rule:** Database persistence ALWAYS comes before in-memory state updates

-28. **DNS retry logic (Nov 13, 2025):**
+30. **DNS retry logic (Nov 13, 2025):**
    - **Problem:** Trading bot fails with "fetch failed" errors when DNS resolution temporarily fails for `mainnet.helius-rpc.com`
    - **Impact:** n8n workflow failures, missed trades, container restart failures
    - **Root Cause:** `EAI_AGAIN` errors are transient DNS issues that resolve in seconds, but bot treated them as permanent failures
@@ -1264,7 +1278,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Note:** DNS retries use 2s delays (fast recovery), rate limit retries use 5s delays (RPC cooldown)
    - **Documentation:** See `docs/DNS_RETRY_LOGIC.md` for monitoring queries and metrics

-29. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
+31. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
    - **Symptom:** AI says "position is protected" or "fix is deployed" when container still running old code
    - **Root Cause:** Conflating "code committed to git" with "code running in production"
    - **Real Incident:** Database-first fix committed 15:56, declared "working" at 19:42, but container started 15:06 (old code)
@@ -1281,7 +1295,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Impact:** This is a REAL MONEY system - premature declarations cause financial losses
    - **Documentation:** Added mandatory deployment verification to VERIFICATION MANDATE section

-30. **Phantom trade notification workflow breaks (Nov 14, 2025):**
+32. **Phantom trade notification workflow breaks (Nov 14, 2025):**
    - **Symptom:** Phantom trade detected, position opened on Drift, but n8n workflow stops with HTTP 500 error. User NOT notified.
    - **Root Cause:** Execute endpoint returned HTTP 500 when phantom detected, causing n8n chain to halt before Telegram notification
    - **Problem:** Unmonitored phantom position on exchange while user is asleep/away = unlimited risk exposure
@@ -1298,7 +1312,7 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Impact:** Protects user from unlimited risk during unavailable hours. Phantom trades are rare edge cases (oracle issues, exchange rejections).
    - **Database tracking:** `status='phantom'`, `exitReason='manual'`, enables analysis of phantom frequency and patterns

-31. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
+33. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
    - **Symptom:** Flip-flop detection showing "100% price move" when actual movement was 0.2%, allowing trades that should be blocked
    - **Root Cause:** `currentPrice` parameter not available in check-risk endpoint (trade hasn't opened yet), so calculation used undefined/zero
    - **Real incident:** Nov 14, 06:05 CET - SHORT allowed with 0.2% flip-flop, lost -$1.56 in 5 minutes