docs: Add Drift SDK memory leak to Common Pitfalls #1
- Documented memory leak fix from Nov 15, 2025 - Symptoms: Heap grows to 4GB+, Telegram timeouts, OOM crash after 10+ hours - Root cause: WebSocket subscription accumulation in Drift SDK - Solution: Automatic reconnection every 4 hours - Renumbered all subsequent pitfalls (2-33) - Added monitoring guidance and manual control endpoint info
This commit is contained in:
78
.github/copilot-instructions.md
vendored
78
.github/copilot-instructions.md
vendored
@@ -997,7 +997,21 @@ ORDER BY MIN(adx) DESC;
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
|
||||
1. **DRIFT SDK MEMORY LEAK (CRITICAL - Fixed Nov 15, 2025):**
|
||||
- **Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
|
||||
- **Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
|
||||
- **Manifestation:** Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
|
||||
- **Heap Growth:** Normal ~200MB → 4GB+ after 10 hours → OOM crash
|
||||
- **Solution:** Automatic reconnection every 4 hours (`lib/drift/client.ts`)
|
||||
- **Implementation:**
|
||||
* `scheduleReconnection()` - Sets 4-hour timer after initialization
|
||||
* `reconnect()` - Unsubscribes, resets state, reinitializes Drift client
|
||||
* Timer cleared in `disconnect()` to prevent orphaned timers
|
||||
- **Manual Control:** `/api/drift/reconnect` endpoint (POST with auth, GET for status)
|
||||
- **Impact:** System now self-healing, can run indefinitely without manual restarts
|
||||
- **Monitoring:** Watch for scheduled reconnection logs: `🔄 Scheduled reconnection...`
|
||||
|
||||
2. **WRONG RPC PROVIDER (CRITICAL - CATASTROPHIC SYSTEM FAILURE):**
|
||||
- **FINAL CONCLUSION Nov 14, 2025 (INVESTIGATION COMPLETE):** Helius is the ONLY reliable RPC provider for Drift SDK
|
||||
- **Root Cause CONFIRMED:** Alchemy's rate limiting breaks Drift SDK's burst subscription pattern during initialization
|
||||
- **Definitive Proof (Nov 14, 21:14 CET):**
|
||||
@@ -1040,81 +1054,81 @@ ORDER BY MIN(adx) DESC;
|
||||
- **Investigation Closed:** This is DEFINITIVE. Use Helius. Do not use Alchemy.
|
||||
- **Test Yourself:** `curl 'http://localhost:3001/api/testing/drift-init?rpc=alchemy'`
|
||||
|
||||
2. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
|
||||
3. **Prisma not generated in Docker:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
|
||||
|
||||
3. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`
|
||||
4. **Wrong DATABASE_URL:** Container runtime needs `trading-bot-postgres`, Prisma CLI from host needs `localhost:5432`
|
||||
|
||||
4. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)
|
||||
5. **Symbol format mismatch:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift (applies to ALL endpoints including `/api/trading/close`)
|
||||
|
||||
5. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions
|
||||
6. **Missing reduce-only flag:** Exit orders without `reduceOnly: true` can accidentally open new positions
|
||||
|
||||
6. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues
|
||||
7. **Singleton violations:** Creating multiple DriftClient or Position Manager instances causes connection/state issues
|
||||
|
||||
7. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully
|
||||
8. **Type errors with Prisma:** The Trade type from Prisma is only available AFTER `npx prisma generate` - use explicit types or `// @ts-ignore` carefully
|
||||
|
||||
8. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized
|
||||
9. **Quality score duplication:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints - keep logic synchronized
|
||||
|
||||
9. **TP2-as-Runner configuration:**
|
||||
10. **TP2-as-Runner configuration:**
|
||||
- `takeProfit2SizePercent: 0` means "TP2 activates trailing stop, no position close"
|
||||
- This creates runner of remaining % after TP1 (default 25%, configurable via TAKE_PROFIT_1_SIZE_PERCENT)
|
||||
- `TAKE_PROFIT_2_PERCENT=0.7` sets TP2 trigger price, `TAKE_PROFIT_2_SIZE_PERCENT` should be 0
|
||||
- Settings UI correctly shows "TP2 activates trailing stop" with dynamic runner % calculation
|
||||
|
||||
9. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
|
||||
11. **P&L calculation CRITICAL:** Use actual entry vs exit price calculation, not SDK values:
|
||||
```typescript
|
||||
const profitPercent = this.calculateProfitPercent(trade.entryPrice, exitPrice, trade.direction)
|
||||
const actualRealizedPnL = (closedSizeUSD * profitPercent) / 100
|
||||
trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
```
|
||||
|
||||
10. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.
|
||||
12. **Transaction confirmation CRITICAL:** Both `openPosition()` AND `closePosition()` MUST call `connection.confirmTransaction()` after `placePerpOrder()`. Without this, the SDK returns transaction signatures that aren't confirmed on-chain, causing "phantom trades" or "phantom closes". Always check `confirmation.value.err` before proceeding.
|
||||
|
||||
11. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
|
||||
13. **Execution order matters:** When creating trades via API endpoints, the order MUST be:
|
||||
1. Open position + place exit orders
|
||||
2. Save to database (`createTrade()`)
|
||||
3. Add to Position Manager (`positionManager.addTrade()`)
|
||||
|
||||
If Position Manager is added before database save, race conditions occur where monitoring checks before the trade exists in DB.
|
||||
|
||||
12. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.
|
||||
14. **New trade grace period:** Position Manager skips "external closure" detection for trades <30 seconds old because Drift positions take 5-10 seconds to propagate after opening. Without this grace period, new positions are immediately detected as "closed externally" and cancelled.
|
||||
|
||||
13. **Drift minimum position sizes:** Actual minimums differ from documentation:
|
||||
15. **Drift minimum position sizes:** Actual minimums differ from documentation:
|
||||
- SOL-PERP: 0.1 SOL (~$5-15 depending on price)
|
||||
- ETH-PERP: 0.01 ETH (~$38-40 at $4000/ETH)
|
||||
- BTC-PERP: 0.0001 BTC (~$10-12 at $100k/BTC)
|
||||
|
||||
Always calculate: `minOrderSize × currentPrice` must exceed Drift's $4 minimum. Add buffer for price movement.
|
||||
|
||||
14. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.
|
||||
16. **Exit reason detection bug:** Position Manager was using current price to determine exit reason, but on-chain orders filled at a DIFFERENT price in the past. Now uses `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify whether TP1, TP2, or SL triggered. Prevents profitable trades being mislabeled as "SL" exits.
|
||||
|
||||
15. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.
|
||||
17. **Per-symbol cooldown:** Cooldown period is per-symbol, NOT global. ETH trade at 10:00 does NOT block SOL trade at 10:01. Each coin (SOL/ETH/BTC) has independent cooldown timer to avoid missing opportunities on different assets.
|
||||
|
||||
16. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
|
||||
18. **Timeframe-aware scoring crucial:** Signal quality thresholds MUST adjust for 5min vs higher timeframes:
|
||||
- 5min charts naturally have lower ADX (12-22 healthy) and ATR (0.2-0.7% healthy) than daily charts
|
||||
- Without timeframe awareness, valid 5min breakouts get blocked as "low quality"
|
||||
- Anti-chop filter applies -20 points for extreme sideways regardless of timeframe
|
||||
- Always pass `timeframe` parameter from TradingView alerts to `scoreSignalQuality()`
|
||||
|
||||
17. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
|
||||
19. **Price position chasing causes flip-flops:** Opening longs at 90%+ range or shorts at <10% range reliably loses money:
|
||||
- Database analysis showed overnight flip-flop losses all had price position 9-94% (chasing extremes)
|
||||
- These trades had valid ADX (16-18) but entered at worst possible time
|
||||
- Quality scoring now penalizes -15 to -30 points for range extremes
|
||||
- Prevents rapid reversals when price is already overextended
|
||||
|
||||
18. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
|
||||
20. **TradingView ADX minimum for 5min:** Set ADX filter to 15 (not 20+) in TradingView alerts for 5min charts:
|
||||
- Higher timeframes can use ADX 20+ for strong trends
|
||||
- 5min charts need lower threshold to catch valid breakouts
|
||||
- Bot's quality scoring provides second-layer filtering with context-aware metrics
|
||||
- Two-stage filtering (TradingView + bot) prevents both overtrading and missing valid signals
|
||||
|
||||
19. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
|
||||
21. **Prisma Decimal type handling:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers:
|
||||
- Use `any` type for numeric fields in `$queryRaw` results: `total_pnl: any`
|
||||
- Convert with `Number()` before returning to frontend: `totalPnL: Number(stat.total_pnl) || 0`
|
||||
- Frontend uses `.toFixed()` which doesn't exist on Decimal objects
|
||||
- Applies to all aggregations: SUM(), AVG(), ROUND() - all return Decimal types
|
||||
- Example: `/api/analytics/version-comparison` converts all numeric fields
|
||||
|
||||
20. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
|
||||
22. **ATR-based trailing stop implementation (Nov 11, 2025):** Runner system was using FIXED 0.3% trailing, causing immediate stops:
|
||||
- **Problem:** At $168 SOL, 0.3% = $0.50 wiggle room. Trades with +7-9% MFE exited for losses.
|
||||
- **Fix:** `trailingDistancePercent = (atrAtEntry / currentPrice * 100) × trailingStopAtrMultiplier`
|
||||
- **Config:** `TRAILING_STOP_ATR_MULTIPLIER=1.5`, `MIN=0.25%`, `MAX=0.9%`, `ACTIVATION=0.5%`
|
||||
@@ -1124,14 +1138,14 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **ActiveTrade interface:** Must include `atrAtEntry?: number` field for calculation
|
||||
- See `ATR_TRAILING_STOP_FIX.md` for full details and database analysis
|
||||
|
||||
21. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
|
||||
23. **CreateTradeParams interface sync:** When adding new database fields to Trade model, MUST update `CreateTradeParams` interface in `lib/database/trades.ts`:
|
||||
- Interface defines what parameters `createTrade()` accepts
|
||||
- Must add new field to interface (e.g., `indicatorVersion?: string`)
|
||||
- Must add field to Prisma create data object in `createTrade()` function
|
||||
- TypeScript build will fail if endpoint passes field not in interface
|
||||
- Example: indicatorVersion tracking required 3-file update (execute route.ts, CreateTradeParams interface, createTrade function)
|
||||
|
||||
22. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
|
||||
24. **Position.size tokens vs USD bug (CRITICAL - Fixed Nov 12, 2025):**
|
||||
- **Symptom:** Position Manager detects false TP1 hits, moves SL to breakeven prematurely
|
||||
- **Root Cause:** `lib/drift/client.ts` returns `position.size` as BASE ASSET TOKENS (12.28 SOL), not USD ($1,950)
|
||||
- **Bug:** Comparing tokens (12.28) directly to USD ($1,950) → 12.28 < 1,950 × 0.95 = "99.4% reduction" → FALSE TP1!
|
||||
@@ -1149,7 +1163,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **Where it matters:** Position Manager, any code querying Drift positions
|
||||
- **Database evidence:** Trade showed `tp1Hit: true` when 100% still open, `slMovedToBreakeven: true` prematurely
|
||||
|
||||
23. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
|
||||
25. **Leverage display showing global config instead of symbol-specific (Fixed Nov 12, 2025):**
|
||||
- **Symptom:** Telegram notifications showing "⚡ Leverage: 10x" when actual position uses 15x or 20x
|
||||
- **Root Cause:** API response returning `config.leverage` (global default) instead of symbol-specific value
|
||||
- **Fix:** Use actual leverage from `getPositionSizeForSymbol()`:
|
||||
@@ -1163,13 +1177,13 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **Impact:** Misleading notifications, user confusion about actual position risk
|
||||
- **Hierarchy:** Per-symbol ENV (SOLANA_LEVERAGE) → Market config → Global ENV (LEVERAGE) → Defaults
|
||||
|
||||
24. **Indicator version tracking (Nov 12, 2025+):**
|
||||
26. **Indicator version tracking (Nov 12, 2025+):**
|
||||
- Database field `indicatorVersion` tracks which TradingView strategy generated the signal
|
||||
- **v5:** Buy/Sell Signal strategy (pre-Nov 12)
|
||||
- **v6:** HalfTrend + BarColor strategy (Nov 12+)
|
||||
- Used for performance comparison between strategies
|
||||
|
||||
26. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
|
||||
27. **External closure duplicate updates bug (CRITICAL - Fixed Nov 12, 2025):**
|
||||
- **Symptom:** Trades showing 7-8x larger losses than actual ($58 loss when Drift shows $7 loss)
|
||||
- **Root Cause:** Position Manager monitoring loop re-processes external closures multiple times before trade removed from activeTrades Map
|
||||
- **Bug sequence:**
|
||||
@@ -1201,7 +1215,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- Must update `CreateTradeParams` interface when adding new database fields (see pitfall #21)
|
||||
- Analytics endpoint `/api/analytics/version-comparison` compares v5 vs v6 performance
|
||||
|
||||
25. **Signal quality threshold adjustment (Nov 12, 2025):**
|
||||
28. **Signal quality threshold adjustment (Nov 12, 2025):**
|
||||
- **Lowered from 65 → 60** based on data analysis of 161 trades
|
||||
- **Reason:** Score 60-64 tier outperformed higher scores:
|
||||
- 60-64: 2 trades, +$45.78 total, 100% WR, +$22.89 avg
|
||||
@@ -1213,7 +1227,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **Risk:** Small sample size (2 trades) could be outliers, but downside limited
|
||||
- SQL analysis showed clear pattern: stricter filtering was blocking profitable setups
|
||||
|
||||
27. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
|
||||
29. **Database-First Pattern (CRITICAL - Fixed Nov 13, 2025):**
|
||||
- **Symptom:** Positions opened on Drift with NO database record, NO Position Manager tracking, NO TP/SL protection
|
||||
- **Root Cause:** Execute endpoint saved to database AFTER adding to Position Manager, with silent error catch
|
||||
- **Bug sequence:**
|
||||
@@ -1246,7 +1260,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **Documentation:** See `CRITICAL_INCIDENT_UNPROTECTED_POSITION.md` for full incident report
|
||||
- **Rule:** Database persistence ALWAYS comes before in-memory state updates
|
||||
|
||||
28. **DNS retry logic (Nov 13, 2025):**
|
||||
30. **DNS retry logic (Nov 13, 2025):**
|
||||
- **Problem:** Trading bot fails with "fetch failed" errors when DNS resolution temporarily fails for `mainnet.helius-rpc.com`
|
||||
- **Impact:** n8n workflow failures, missed trades, container restart failures
|
||||
- **Root Cause:** `EAI_AGAIN` errors are transient DNS issues that resolve in seconds, but bot treated them as permanent failures
|
||||
@@ -1264,7 +1278,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **Note:** DNS retries use 2s delays (fast recovery), rate limit retries use 5s delays (RPC cooldown)
|
||||
- **Documentation:** See `docs/DNS_RETRY_LOGIC.md` for monitoring queries and metrics
|
||||
|
||||
29. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
|
||||
31. **Declaring fixes "working" before deployment (CRITICAL - Nov 13, 2025):**
|
||||
- **Symptom:** AI says "position is protected" or "fix is deployed" when container still running old code
|
||||
- **Root Cause:** Conflating "code committed to git" with "code running in production"
|
||||
- **Real Incident:** Database-first fix committed 15:56, declared "working" at 19:42, but container started 15:06 (old code)
|
||||
@@ -1281,7 +1295,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **Impact:** This is a REAL MONEY system - premature declarations cause financial losses
|
||||
- **Documentation:** Added mandatory deployment verification to VERIFICATION MANDATE section
|
||||
|
||||
30. **Phantom trade notification workflow breaks (Nov 14, 2025):**
|
||||
32. **Phantom trade notification workflow breaks (Nov 14, 2025):**
|
||||
- **Symptom:** Phantom trade detected, position opened on Drift, but n8n workflow stops with HTTP 500 error. User NOT notified.
|
||||
- **Root Cause:** Execute endpoint returned HTTP 500 when phantom detected, causing n8n chain to halt before Telegram notification
|
||||
- **Problem:** Unmonitored phantom position on exchange while user is asleep/away = unlimited risk exposure
|
||||
@@ -1298,7 +1312,7 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
- **Impact:** Protects user from unlimited risk during unavailable hours. Phantom trades are rare edge cases (oracle issues, exchange rejections).
|
||||
- **Database tracking:** `status='phantom'`, `exitReason='manual'`, enables analysis of phantom frequency and patterns
|
||||
|
||||
31. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
|
||||
33. **Flip-flop price context using wrong data (CRITICAL - Fixed Nov 14, 2025):**
|
||||
- **Symptom:** Flip-flop detection showing "100% price move" when actual movement was 0.2%, allowing trades that should be blocked
|
||||
- **Root Cause:** `currentPrice` parameter not available in check-risk endpoint (trade hasn't opened yet), so calculation used undefined/zero
|
||||
- **Real incident:** Nov 14, 06:05 CET - SHORT allowed with 0.2% flip-flop, lost -$1.56 in 5 minutes
|
||||
|
||||
Reference in New Issue
Block a user