docs: Add Drift SDK memory leak to Common Pitfalls #1

- Documented memory leak fix from Nov 15, 2025
- Symptoms: Heap grows to 4GB+, Telegram timeouts, OOM crash after 10+ hours
- Root cause: WebSocket subscription accumulation in Drift SDK
- Solution: Automatic reconnection every 4 hours
- Renumbered all subsequent pitfalls (2-33)
- Added monitoring guidance and manual control endpoint info
This commit is contained in:
mindesbunister
2025-11-15 09:37:13 +01:00
parent fb4beee418
commit d654ad3e5e

View File

@@ -997,7 +997,21 @@ ORDER BY MIN(adx) DESC;
## Common Pitfalls
1. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):**
- **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
- **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
- **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
- **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
- **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`)
- **Implementation:**
* `scheduleReconnection()` - Sets 4-hour timer after initialization
* `reconnect()` - Unsubscribes, resets state, reinitializes Drift client
* Timer cleared in `disconnect()` to prevent orphaned timers
- **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status)
- **Impact:** System now self-healing, can run indefinitely without manual restarts
- **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...`
2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
- **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK
- **Root Cause CONFIRMED:** Alchemy's rate limiting breaks Drift SDK's burst subscription pattern during initialization
- **Definitive Proof (Nov 14, 21:14 CET):**
@@ -1040,81 +1054,81 @@ ORDER BY MIN(adx) DESC;
- **Investigation Closed:** This is DEFINITIVE. Use Helius. Do not use Alchemy.
- **Test Yourself:** `curl 'http://localhost:3001/api/testing/drift-init?rpc=alchemy'`
2. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
3. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
3. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`
4. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`
4. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)
5. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)
5. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions
6. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions
6. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues
7. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues
7. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully
8. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully
8. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized
9. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized
9. **TP2-as-Runner configuration:**
10. **TP2-as-Runner configuration:**
- `takeProfit2SizePercent: 0` means "TP2 activates trailing stop, no position close"
- This creates runner of remaining % after TP1 (default 25%, configurable via TAKE_PROFIT_1_SIZE_PERCENT)
- `TAKE_PROFIT_2_PERCENT=0.7` sets TP2 trigger price, `TAKE_PROFIT_2_SIZE_PERCENT` should be 0
- Settings UI correctly shows "TP2 activates trailing stop" with dynamic runner % calculation
9. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
11. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
```typescript
const profitPercent = this.calculateProfitPercent(trade.entryPrice, exitPrice, trade.direction)
const actualRealizedPnL = (closedSizeUSD * profitPercent) / 100
trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
```
10. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.
12. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.
11. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
13. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
1. Open position + place exit orders
2. Save to database (`createTrade()`)
3. Add to Position Manager (`positionManager.addTrade()`)
If Position Manager is added before database save, race conditions occur where monitoring checks before the trade exists in DB.
12. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.
14. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.
13. **Drift minimum position sizes:** Actual minimums differ from documentation:
15. **Drift minimum position sizes:** Actual minimums differ from documentation:
- SOL-PERP: 0.1 SOL (~$5-15 depending on price)
- ETH-PERP: 0.01 ETH (~$38-40 at $4000/ETH)
- BTC-PERP: 0.0001 BTC (~$10-12 at $100k/BTC)
Always calculate: `minOrderSize × currentPrice` must exceed Drift's $4 minimum. Add buffer for price movement.
14. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.
16. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.
15. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.
17. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.
16. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
18. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
- 5min charts naturally have lower ADX (12-22 healthy) and ATR (0.2-0.7% healthy) than daily charts
- Without timeframe awareness, valid 5min breakouts get blocked as "low quality"
- Anti-chop filter applies -20 points for extreme sideways regardless of timeframe
- Always pass `timeframe` parameter from TradingView alerts to `scoreSignalQuality()`
17. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
19. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
- Database analysis showed overnight flip-flop losses all had price position 9-94% (chasing extremes)
- These trades had valid ADX (16-18) but entered at worst possible time
- Quality scoring now penalizes -15 to -30 points for range extremes
- Prevents rapid reversals when price is already overextended
18. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
20. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
- Higher timeframes can use ADX 20+ for strong trends
- 5min charts need lower threshold to catch valid breakouts
- Bot's quality scoring provides second-layer filtering with context-aware metrics
- Two-stage filtering (TradingView + bot) prevents both overtrading and missing valid signals
19. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
21. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
- Use `any` type for numeric fields in `$queryRaw` results: `total_pnl: any`
- Convert with `Number()` before returning to frontend: `totalPnL: Number(stat.total_pnl) || 0`
- Frontend uses `.toFixed()` which doesn't exist on Decimal objects
- Applies to all aggregations: SUM(), AVG(), ROUND() - all return Decimal types
- Example: `/api/analytics/version-comparison` converts all numeric fields
20. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
22. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
- **Problem:** At $168 SOL, 0.3% = $0.50 wiggle room. Trades with +7-9% MFE exited for losses.
- **Fix:** `trailingDistancePercent = (atrAtEntry / currentPrice * 100) × trailingStopAtrMultiplier`
- **Config:** `TRAILING_STOP_ATR_MULTIPLIER=1.5`, `MIN=0.25%`, `MAX=0.9%`, `ACTIVATION=0.5%`
@@ -1124,14 +1138,14 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **ActiveTrade interface:** Must include `atrAtEntry?: number` field for calculation
- See `ATR_TRAILING_STOP_FIX.md` for full details and database analysis
21. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
23. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
- Interface defines what parameters `createTrade()` accepts
- Must add new field to interface (e.g., `indicatorVersion?: string`)
- Must add field to Prisma create data object in `createTrade()` function
- TypeScript build will fail if endpoint passes field not in interface
- Example: indicatorVersion tracking required 3-file update (execute route.ts, CreateTradeParams interface, createTrade function)
22. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
24. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
- **Symptom:** Position Manager detects false TP1 hits, moves SL to breakeven prematurely
- **Root Cause:** `lib/drift/client.ts` returns `position.size` as BASE ASSET TOKENS (12.28 SOL), not USD ($1,950)
- **Bug:** Comparing tokens (12.28) directly to USD ($1,950) → 12.28 < 1,950 × 0.95 = "99.4% reduction" → FALSE TP1!
@@ -1149,7 +1163,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Where it matters:** Position Manager, any code querying Drift positions
- **Database evidence:** Trade showed `tp1Hit: true` when 100% still open, `slMovedToBreakeven: true` prematurely
23. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
25. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
- **Symptom:** Telegram notifications showing "⚡ Leverage: 10x" when actual position uses 15x or 20x
- **Root Cause:** API response returning `config.leverage` (global default) instead of symbol-specific value
- **Fix:** Use actual leverage from `getPositionSizeForSymbol()`:
@@ -1163,13 +1177,13 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Impact:** Misleading notifications, user confusion about actual position risk
- **Hierarchy:** Per-symbol ENV (SOLANA_LEVERAGE) → Market config → Global ENV (LEVERAGE) → Defaults
24. **Indicator version tracking (Nov 12, 2025+):**
26. **Indicator version tracking (Nov 12, 2025+):**
- Database field `indicatorVersion` tracks which TradingView strategy generated the signal
- **v5:** Buy/Sell Signal strategy (pre-Nov 12)
- **v6:** HalfTrend + BarColor strategy (Nov 12+)
- Used for performance comparison between strategies
26. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
27. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
- **Symptom:** Trades showing 7-8x larger losses than actual ($58 loss when Drift shows $7 loss)
- **Root Cause:** Position Manager monitoring loop re-processes external closures multiple times before trade removed from activeTrades Map
- **Bug sequence:**
@@ -1201,7 +1215,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- Must update `CreateTradeParams` interface when adding new database fields (see pitfall #21)
- Analytics endpoint `/api/analytics/version-comparison` compares v5 vs v6 performance
25. **Signal quality threshold adjustment (Nov 12, 2025):**
28. **Signal quality threshold adjustment (Nov 12, 2025):**
- **Lowered from 65 → 60** based on data analysis of 161 trades
- **Reason:** Score 60-64 tier outperformed higher scores:
- 60-64: 2 trades, +$45.78 total, 100% WR, +$22.89 avg
@@ -1213,7 +1227,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Risk:** Small sample size (2 trades) could be outliers, but downside limited
- SQL analysis showed clear pattern: stricter filtering was blocking profitable setups
27. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
29. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
- **Symptom:** Positions opened on Drift with NO database record, NO Position Manager tracking, NO TP/SL protection
- **Root Cause:** Execute endpoint saved to database AFTER adding to Position Manager, with silent error catch
- **Bug sequence:**
@@ -1246,7 +1260,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Documentation:** See `CRITICAL_INCIDENT_UNPROTECTED_POSITION.md` for full incident report
- **Rule:** Database persistence ALWAYS comes before in-memory state updates
28. **DNS retry logic (Nov 13, 2025):**
30. **DNS retry logic (Nov 13, 2025):**
- **Problem:** Trading bot fails with "fetch failed" errors when DNS resolution temporarily fails for `mainnet.helius-rpc.com`
- **Impact:** n8n workflow failures, missed trades, container restart failures
- **Root Cause:** `EAI_AGAIN` errors are transient DNS issues that resolve in seconds, but bot treated them as permanent failures
@@ -1264,7 +1278,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Note:** DNS retries use 2s delays (fast recovery), rate limit retries use 5s delays (RPC cooldown)
- **Documentation:** See `docs/DNS_RETRY_LOGIC.md` for monitoring queries and metrics
29. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
31. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
- **Symptom:** AI says "position is protected" or "fix is deployed" when container still running old code
- **Root Cause:** Conflating "code committed to git" with "code running in production"
- **Real Incident:** Database-first fix committed 15:56, declared "working" at 19:42, but container started 15:06 (old code)
@@ -1281,7 +1295,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Impact:** This is a REAL MONEY system - premature declarations cause financial losses
- **Documentation:** Added mandatory deployment verification to VERIFICATION MANDATE section
30. **Phantom trade notification workflow breaks (Nov 14, 2025):**
32. **Phantom trade notification workflow breaks (Nov 14, 2025):**
- **Symptom:** Phantom trade detected, position opened on Drift, but n8n workflow stops with HTTP 500 error. User NOT notified.
- **Root Cause:** Execute endpoint returned HTTP 500 when phantom detected, causing n8n chain to halt before Telegram notification
- **Problem:** Unmonitored phantom position on exchange while user is asleep/away = unlimited risk exposure
@@ -1298,7 +1312,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Impact:** Protects user from unlimited risk during unavailable hours. Phantom trades are rare edge cases (oracle issues, exchange rejections).
- **Database tracking:** `status='phantom'`, `exitReason='manual'`, enables analysis of phantom frequency and patterns
31. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
33. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
- **Symptom:** Flip-flop detection showing "100% price move" when actual movement was 0.2%, allowing trades that should be blocked
- **Root Cause:** `currentPrice` parameter not available in check-risk endpoint (trade hasn't opened yet), so calculation used undefined/zero
- **Real incident:** Nov 14, 06:05 CET - SHORT allowed with 0.2% flip-flop, lost -$1.56 in 5 minutes