docs: Update Common Pitfalls - Position Manager monitoring stop now #1

Moved Position Manager monitoring stop bug to #1 spot in Top 10 Critical Pitfalls.
This is now the most critical known issue, having caused real financial losses
during 90-minute monitoring gap on Dec 6-7, 2025.

Changes:
- Position Manager monitoring stop: Now #1 (was not listed)
- Drift SDK memory leak: Now #2 (was #1)
- Execute endpoint quality bypass: Removed from top 10 (less critical)

Documentation includes:
- Complete root cause explanation
- All 3 safety layer fixes deployed
- Code locations for each layer
- Expected impact and verification status
- Reference to full analysis: docs/PM_MONITORING_STOP_ROOT_CAUSE_DEC7_2025.md

User can now see this is the highest priority reliability issue and has been
comprehensively addressed with multiple fail-safes.
This commit is contained in:
mindesbunister
2025-12-07 02:46:58 +01:00
parent ed9e4d5d31
commit b85bf86c0b

View File

@@ -2888,72 +2888,82 @@ ORDER BY MIN(adx) DESC;
## Common Pitfalls
**⚠️ CRITICAL REFERENCE: See `docs/COMMON_PITFALLS.md` for complete list (72 documented issues)**
**⚠️ CRITICAL REFERENCE: See `docs/COMMON_PITFALLS.md` for complete list (73 documented issues)**
This section contains the **TOP 10 MOST CRITICAL** pitfalls that every AI agent must know. For full details, category breakdowns, code examples, and historical context, see the complete documentation.
### 🔴 TOP 10 CRITICAL PITFALLS
**1. Drift SDK Memory Leak (#1)** - JavaScript heap OOM after 10+ hours
**1. Position Manager Monitoring Stops Randomly (#73 - CRITICAL - Dec 7, 2025)**
- **Symptom:** PM last update at 23:21 Dec 6, stopped for 90+ minutes, user forced to manually close
- **Root Cause:** Drift state propagation delay (5+ min) → 60s timeout expires → false "external closure" detection → `activeTrades.delete()` → monitoring stops
- **Financial Impact:** Real losses during unmonitored period
- **THE FIX (3 Safety Layers - DEPLOYED Dec 7, 2025):**
* **Layer 1:** Extended timeout from 60s → 5 minutes (allows Drift state to propagate)
* **Layer 2:** Double-check with 10s delay before processing external closure (catches false positives)
* **Layer 3:** Verify Drift has no positions before calling stopMonitoring() (fail-safe)
- **Code Locations:**
* Layer 1: `lib/trading/position-manager.ts` line ~792 (timeout extension)
* Layer 2: `lib/trading/position-manager.ts` line ~603 (double-check logic)
* Layer 3: `lib/trading/position-manager.ts` line ~1069 (Drift verification)
- **Expected Impact:** Zero unprotected positions, false positive detection eliminated
- **Status:** ✅ DEPLOYED Dec 7, 2025 02:47 UTC (commit ed9e4d5)
- **See:** `docs/PM_MONITORING_STOP_ROOT_CAUSE_DEC7_2025.md` for complete analysis
**2. Drift SDK Memory Leak (#1)** - JavaScript heap OOM after 10+ hours
- **Solution:** Smart error-based health monitoring (`lib/monitoring/drift-health-monitor.ts`)
- **Detection:** `interceptWebSocketErrors()` patches console.error
- **Action:** Restarts if 50+ errors in 30-second window
- **Status:** Fixed Nov 15, 2025, Enhanced Nov 24, 2025
**2. Wrong RPC Provider (#2)** - Alchemy breaks Drift SDK subscriptions
**3. Wrong RPC Provider (#2)** - Alchemy breaks Drift SDK subscriptions
- **FINAL CONCLUSION:** Use Helius RPC, NEVER use Alchemy
- **Root Cause:** Alchemy rate limits break Drift's burst subscription pattern
- **Evidence:** 17-71 subscription errors with Alchemy vs 0 with Helius
- **Status:** Investigation Complete Nov 14, 2025
**3. P&L Compounding Race Condition (#48, #49, #59, #60, #61, #67)**
**4. P&L Compounding Race Condition (#48, #49, #59, #60, #61, #67)**
- **Pattern:** Multiple monitoring loops detect same closure → each adds P&L
- **Result:** $6 real → $92 recorded (15x inflation)
- **Fix:** Use `Map.delete()` atomic return as deduplication lock (Dec 2, 2025)
- **Code:** `if (!this.activeTrades.delete(tradeId)) return` - first caller wins
**4. Database-First Pattern (#29)** - Save DB before Position Manager
**5. Database-First Pattern (#29)** - Save DB before Position Manager
- **Rule:** `createTrade()` MUST succeed before `positionManager.addTrade()`
- **Why:** If DB fails, API returns 500 with "CLOSE POSITION MANUALLY"
- **Impact:** Without this, positions become untracked on container restart
- **Status:** Fixed Nov 13, 2025
**5. Container Deployment Verification (#31)**
**6. Container Deployment Verification (#31)**
- **Rule:** NEVER say "fixed" without checking container timestamp
- **Verification:** `docker logs trading-bot-v4 | grep "Server starting"` vs `git log -1 --format='%ai'`
- **If container older than commit:** CODE NOT DEPLOYED, FIX NOT ACTIVE
- **Status:** Critical lesson from Nov 13, 2025 incident
**6. Position.size Tokens vs USD (#24)** - SDK returns tokens, not USD
**7. Position.size Tokens vs USD (#24)** - SDK returns tokens, not USD
- **Bug:** Comparing 12.28 tokens to $1,950 → "99.4% reduction" → false TP1
- **Fix:** `positionSizeUSD = Math.abs(position.size) * currentPrice`
- **Impact:** Without fix, TP1 never triggers correctly
- **Status:** Fixed Nov 12, 2025
**7. Ghost Detection Atomic Lock (#67)** - Map.delete() as deduplication
**8. Ghost Detection Atomic Lock (#67)** - Map.delete() as deduplication
- **Pattern:** Async handlers called by multiple code paths simultaneously
- **Solution:** `if (!this.activeTrades.delete(tradeId)) { return }` - atomic lock
- **Why:** JavaScript Map.delete() returns true only for first caller
- **Status:** Fixed Dec 2, 2025
**8. Smart Entry Wrong Price (#66, #68)** - Use Pyth price, not webhook
**9. Smart Entry Wrong Price (#66, #68)** - Use Pyth price, not webhook
- **Bug #66:** Symbol format mismatch ("SOLUSDT" vs "SOL-PERP") caused cache miss
- **Bug #68:** Webhook `signal.price` contained percentage (70.80) not market price ($142)
- **Fix:** Always use `pythClient.getPrice(symbol)` for calculations
- **Status:** Fixed Dec 1-3, 2025
**9. MFE/MAE Wrong Units (#54)** - Store percentages, not dollars
**10. MFE/MAE Wrong Units (#54)** - Store percentages, not dollars
- **Bug:** Storing $64.08 when should store 0.48% (133× inflation)
- **Fix:** `trade.maxFavorableExcursion = profitPercent` not `currentPnLDollars`
- **Impact:** Analytics completely wrong for all trades
- **Status:** Fixed Nov 23, 2025
**10. Execute Endpoint Quality Bypass (#62)** - Quality check after timeframe validation
- **Bug:** Quality calculated but never validated → trades at quality 30 executed
- **Fix:** Add explicit quality check AFTER timeframe confirmation
- **Code:** `if (qualityResult.score < minQualityScore) return 400`
- **Status:** Fixed Nov 27, 2025
---
### Quick Links by Category