From d654ad3e5e88fc7fcfa7cfa8ca2691791d77f069 Mon Sep 17 00:00:00 2001 From: mindesbunister Date: Sat, 15 Nov 2025 09:37:13 +0100 Subject: [PATCH] docs: Add Drift SDK memory leak to Common Pitfalls #1 - Documented memory leak fix from Nov 15, 2025 - Symptoms: Heap grows to 4GB+, Telegram timeouts, OOM crash after 10+ hours - Root cause: WebSocket subscription accumulation in Drift SDK - Solution: Automatic reconnection every 4 hours - Renumbered all subsequent pitfalls (2-33) - Added monitoring guidance and manual control endpoint info --- .github/copilot-instructions.md | 78 +++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 32 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index b4df98d..358d1a4 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -997,7 +997,21 @@ ORDER BY MIN(adx) DESC; ## Common Pitfalls -1. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):** +1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):** + - **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s) + - **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup + - **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs + - **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash + - **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`) + - **Implementation:** + * `scheduleReconnection()` - Sets 4-hour timer after initialization + * `reconnect()` - Unsubscribes, resets state, reinitializes Drift client + * Timer cleared in `disconnect()` to prevent orphaned timers + - **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status) + - **Impact:** System now self-healing, can run indefinitely without manual restarts + - **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...` + +2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):** - **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK - **Root Cause CONFIRMED:** Alchemy's rate limiting breaks Drift SDK's burst subscription pattern during initialization - **Definitive Proof (Nov 14, 21:14 CET):** @@ -1040,81 +1054,81 @@ ORDER BY MIN(adx) DESC; - **Investigation Closed:** This is DEFINITIVE. Use Helius. Do not use Alchemy. - **Test Yourself:** `curl 'http://localhost:3001/api/testing/drift-init?rpc=alchemy'` -2. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build` +3. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build` -3. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432` +4. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432` -4. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`) +5. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`) -5. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions +6. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions -6. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues +7. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues -7. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully +8. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully -8. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized +9. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized -9. **TP2-as-Runner configuration:** +10. **TP2-as-Runner configuration:** - `takeProfit2SizePercent: 0` means "TP2 activates trailing stop, no position close" - This creates runner of remaining % after TP1 (default 25%, configurable via TAKE_PROFIT_1_SIZE_PERCENT) - `TAKE_PROFIT_2_PERCENT=0.7` sets TP2 trigger price, `TAKE_PROFIT_2_SIZE_PERCENT` should be 0 - Settings UI correctly shows "TP2 activates trailing stop" with dynamic runner % calculation -9. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values: +11. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values: ```typescript const profitPercent = this.calculateProfitPercent(trade.entryPrice, exitPrice, trade.direction) const actualRealizedPnL = (closedSizeUSD * profitPercent) / 100 trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK ``` -10. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding. +12. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding. -11. **Execution order matters:** When creating trades via API endpoints, the order MUST be: +13. **Execution order matters:** When creating trades via API endpoints, the order MUST be: 1. Open position + place exit orders 2. Save to database (`createTrade()`) 3. Add to Position Manager (`positionManager.addTrade()`) If Position Manager is added before database save, race conditions occur where monitoring checks before the trade exists in DB. -12. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled. +14. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled. -13. **Drift minimum position sizes:** Actual minimums differ from documentation: +15. **Drift minimum position sizes:** Actual minimums differ from documentation: - SOL-PERP: 0.1 SOL (~$5-15 depending on price) - ETH-PERP: 0.01 ETH (~$38-40 at $4000/ETH) - BTC-PERP: 0.0001 BTC (~$10-12 at $100k/BTC) Always calculate: `minOrderSize × currentPrice` must exceed Drift's $4 minimum. Add buffer for price movement. -14. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits. +16. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits. -15. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets. +17. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets. -16. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes: +18. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes: - 5min charts naturally have lower ADX (12-22 healthy) and ATR (0.2-0.7% healthy) than daily charts - Without timeframe awareness, valid 5min breakouts get blocked as "low quality" - Anti-chop filter applies -20 points for extreme sideways regardless of timeframe - Always pass `timeframe` parameter from TradingView alerts to `scoreSignalQuality()` -17. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money: +19. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money: - Database analysis showed overnight flip-flop losses all had price position 9-94% (chasing extremes) - These trades had valid ADX (16-18) but entered at worst possible time - Quality scoring now penalizes -15 to -30 points for range extremes - Prevents rapid reversals when price is already overextended -18. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts: +20. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts: - Higher timeframes can use ADX 20+ for strong trends - 5min charts need lower threshold to catch valid breakouts - Bot's quality scoring provides second-layer filtering with context-aware metrics - Two-stage filtering (TradingView + bot) prevents both overtrading and missing valid signals -19. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers: +21. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers: - Use `any` type for numeric fields in `$queryRaw` results: `total_pnl: any` - Convert with `Number()` before returning to frontend: `totalPnL: Number(stat.total_pnl) || 0` - Frontend uses `.toFixed()` which doesn't exist on Decimal objects - Applies to all aggregations: SUM(), AVG(), ROUND() - all return Decimal types - Example: `/api/analytics/version-comparison` converts all numeric fields -20. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops: +22. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops: - **Problem:** At $168 SOL, 0.3% = $0.50 wiggle room. Trades with +7-9% MFE exited for losses. - **Fix:** `trailingDistancePercent = (atrAtEntry / currentPrice * 100) × trailingStopAtrMultiplier` - **Config:** `TRAILING_STOP_ATR_MULTIPLIER=1.5`, `MIN=0.25%`, `MAX=0.9%`, `ACTIVATION=0.5%` @@ -1124,14 +1138,14 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **ActiveTrade interface:** Must include `atrAtEntry?: number` field for calculation - See `ATR_TRAILING_STOP_FIX.md` for full details and database analysis -21. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`: +23. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`: - Interface defines what parameters `createTrade()` accepts - Must add new field to interface (e.g., `indicatorVersion?: string`) - Must add field to Prisma create data object in `createTrade()` function - TypeScript build will fail if endpoint passes field not in interface - Example: indicatorVersion tracking required 3-file update (execute route.ts, CreateTradeParams interface, createTrade function) -22. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):** +24. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):** - **Symptom:** Position Manager detects false TP1 hits, moves SL to breakeven prematurely - **Root Cause:** `lib/drift/client.ts` returns `position.size` as BASE ASSET TOKENS (12.28 SOL), not USD ($1,950) - **Bug:** Comparing tokens (12.28) directly to USD ($1,950) → 12.28 < 1,950 × 0.95 = "99.4% reduction" → FALSE TP1! @@ -1149,7 +1163,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **Where it matters:** Position Manager, any code querying Drift positions - **Database evidence:** Trade showed `tp1Hit: true` when 100% still open, `slMovedToBreakeven: true` prematurely -23. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):** +25. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):** - **Symptom:** Telegram notifications showing "⚡ Leverage: 10x" when actual position uses 15x or 20x - **Root Cause:** API response returning `config.leverage` (global default) instead of symbol-specific value - **Fix:** Use actual leverage from `getPositionSizeForSymbol()`: @@ -1163,13 +1177,13 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **Impact:** Misleading notifications, user confusion about actual position risk - **Hierarchy:** Per-symbol ENV (SOLANA_LEVERAGE) → Market config → Global ENV (LEVERAGE) → Defaults -24. **Indicator version tracking (Nov 12, 2025+):** +26. **Indicator version tracking (Nov 12, 2025+):** - Database field `indicatorVersion` tracks which TradingView strategy generated the signal - **v5:** Buy/Sell Signal strategy (pre-Nov 12) - **v6:** HalfTrend + BarColor strategy (Nov 12+) - Used for performance comparison between strategies -26. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):** +27. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):** - **Symptom:** Trades showing 7-8x larger losses than actual ($58 loss when Drift shows $7 loss) - **Root Cause:** Position Manager monitoring loop re-processes external closures multiple times before trade removed from activeTrades Map - **Bug sequence:** @@ -1201,7 +1215,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - Must update `CreateTradeParams` interface when adding new database fields (see pitfall #21) - Analytics endpoint `/api/analytics/version-comparison` compares v5 vs v6 performance -25. **Signal quality threshold adjustment (Nov 12, 2025):** +28. **Signal quality threshold adjustment (Nov 12, 2025):** - **Lowered from 65 → 60** based on data analysis of 161 trades - **Reason:** Score 60-64 tier outperformed higher scores: - 60-64: 2 trades, +$45.78 total, 100% WR, +$22.89 avg @@ -1213,7 +1227,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **Risk:** Small sample size (2 trades) could be outliers, but downside limited - SQL analysis showed clear pattern: stricter filtering was blocking profitable setups -27. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):** +29. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):** - **Symptom:** Positions opened on Drift with NO database record, NO Position Manager tracking, NO TP/SL protection - **Root Cause:** Execute endpoint saved to database AFTER adding to Position Manager, with silent error catch - **Bug sequence:** @@ -1246,7 +1260,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **Documentation:** See `CRITICAL_INCIDENT_UNPROTECTED_POSITION.md` for full incident report - **Rule:** Database persistence ALWAYS comes before in-memory state updates -28. **DNS retry logic (Nov 13, 2025):** +30. **DNS retry logic (Nov 13, 2025):** - **Problem:** Trading bot fails with "fetch failed" errors when DNS resolution temporarily fails for `mainnet.helius-rpc.com` - **Impact:** n8n workflow failures, missed trades, container restart failures - **Root Cause:** `EAI_AGAIN` errors are transient DNS issues that resolve in seconds, but bot treated them as permanent failures @@ -1264,7 +1278,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **Note:** DNS retries use 2s delays (fast recovery), rate limit retries use 5s delays (RPC cooldown) - **Documentation:** See `docs/DNS_RETRY_LOGIC.md` for monitoring queries and metrics -29. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):** +31. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):** - **Symptom:** AI says "position is protected" or "fix is deployed" when container still running old code - **Root Cause:** Conflating "code committed to git" with "code running in production" - **Real Incident:** Database-first fix committed 15:56, declared "working" at 19:42, but container started 15:06 (old code) @@ -1281,7 +1295,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **Impact:** This is a REAL MONEY system - premature declarations cause financial losses - **Documentation:** Added mandatory deployment verification to VERIFICATION MANDATE section -30. **Phantom trade notification workflow breaks (Nov 14, 2025):** +32. **Phantom trade notification workflow breaks (Nov 14, 2025):** - **Symptom:** Phantom trade detected, position opened on Drift, but n8n workflow stops with HTTP 500 error. User NOT notified. - **Root Cause:** Execute endpoint returned HTTP 500 when phantom detected, causing n8n chain to halt before Telegram notification - **Problem:** Unmonitored phantom position on exchange while user is asleep/away = unlimited risk exposure @@ -1298,7 +1312,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK - **Impact:** Protects user from unlimited risk during unavailable hours. Phantom trades are rare edge cases (oracle issues, exchange rejections). - **Database tracking:** `status='phantom'`, `exitReason='manual'`, enables analysis of phantom frequency and patterns -31. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):** +33. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):** - **Symptom:** Flip-flop detection showing "100% price move" when actual movement was 0.2%, allowing trades that should be blocked - **Root Cause:** `currentPrice` parameter not available in check-risk endpoint (trade hasn't opened yet), so calculation used undefined/zero - **Real incident:** Nov 14, 06:05 CET - SHORT allowed with 0.2% flip-flop, lost -$1.56 in 5 minutes