docs: Update Common Pitfalls - Position Manager monitoring stop now #1
Moved Position Manager monitoring stop bug to #1 spot in Top 10 Critical Pitfalls. This is now the most critical known issue, having caused real financial losses during 90-minute monitoring gap on Dec 6-7, 2025. Changes: - Position Manager monitoring stop: Now #1 (was not listed) - Drift SDK memory leak: Now #2 (was #1) - Execute endpoint quality bypass: Removed from top 10 (less critical) Documentation includes: - Complete root cause explanation - All 3 safety layer fixes deployed - Code locations for each layer - Expected impact and verification status - Reference to full analysis: docs/PM_MONITORING_STOP_ROOT_CAUSE_DEC7_2025.md User can now see this is the highest priority reliability issue and has been comprehensively addressed with multiple fail-safes.
This commit is contained in:
42
.github/copilot-instructions.md
vendored
42
.github/copilot-instructions.md
vendored
@@ -2888,72 +2888,82 @@ ORDER BY MIN(adx) DESC;
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
**⚠️ CRITICAL REFERENCE: See `docs/COMMON_PITFALLS.md` for complete list (72 documented issues)**
|
||||
**⚠️ CRITICAL REFERENCE: See `docs/COMMON_PITFALLS.md` for complete list (73 documented issues)**
|
||||
|
||||
This section contains the **TOP 10 MOST CRITICAL** pitfalls that every AI agent must know. For full details, category breakdowns, code examples, and historical context, see the complete documentation.
|
||||
|
||||
### 🔴 TOP 10 CRITICAL PITFALLS
|
||||
|
||||
**1. Drift SDK Memory Leak (#1)** - JavaScript heap OOM after 10+ hours
|
||||
**1. Position Manager Monitoring Stops Randomly (#73 - CRITICAL - Dec 7, 2025)**
|
||||
- **Symptom:** PM last update at 23:21 Dec 6, stopped for 90+ minutes, user forced to manually close
|
||||
- **Root Cause:** Drift state propagation delay (5+ min) → 60s timeout expires → false "external closure" detection → `activeTrades.delete()` → monitoring stops
|
||||
- **Financial Impact:** Real losses during unmonitored period
|
||||
- **THE FIX (3 Safety Layers - DEPLOYED Dec 7, 2025):**
|
||||
* **Layer 1:** Extended timeout from 60s → 5 minutes (allows Drift state to propagate)
|
||||
* **Layer 2:** Double-check with 10s delay before processing external closure (catches false positives)
|
||||
* **Layer 3:** Verify Drift has no positions before calling stopMonitoring() (fail-safe)
|
||||
- **Code Locations:**
|
||||
* Layer 1: `lib/trading/position-manager.ts` line ~792 (timeout extension)
|
||||
* Layer 2: `lib/trading/position-manager.ts` line ~603 (double-check logic)
|
||||
* Layer 3: `lib/trading/position-manager.ts` line ~1069 (Drift verification)
|
||||
- **Expected Impact:** Zero unprotected positions, false positive detection eliminated
|
||||
- **Status:** ✅ DEPLOYED Dec 7, 2025 02:47 UTC (commit ed9e4d5)
|
||||
- **See:** `docs/PM_MONITORING_STOP_ROOT_CAUSE_DEC7_2025.md` for complete analysis
|
||||
|
||||
**2. Drift SDK Memory Leak (#1)** - JavaScript heap OOM after 10+ hours
|
||||
- **Solution:** Smart error-based health monitoring (`lib/monitoring/drift-health-monitor.ts`)
|
||||
- **Detection:** `interceptWebSocketErrors()` patches console.error
|
||||
- **Action:** Restarts if 50+ errors in 30-second window
|
||||
- **Status:** Fixed Nov 15, 2025, Enhanced Nov 24, 2025
|
||||
|
||||
**2. Wrong RPC Provider (#2)** - Alchemy breaks Drift SDK subscriptions
|
||||
**3. Wrong RPC Provider (#2)** - Alchemy breaks Drift SDK subscriptions
|
||||
- **FINAL CONCLUSION:** Use Helius RPC, NEVER use Alchemy
|
||||
- **Root Cause:** Alchemy rate limits break Drift's burst subscription pattern
|
||||
- **Evidence:** 17-71 subscription errors with Alchemy vs 0 with Helius
|
||||
- **Status:** Investigation Complete Nov 14, 2025
|
||||
|
||||
**3. P&L Compounding Race Condition (#48, #49, #59, #60, #61, #67)**
|
||||
**4. P&L Compounding Race Condition (#48, #49, #59, #60, #61, #67)**
|
||||
- **Pattern:** Multiple monitoring loops detect same closure → each adds P&L
|
||||
- **Result:** $6 real → $92 recorded (15x inflation)
|
||||
- **Fix:** Use `Map.delete()` atomic return as deduplication lock (Dec 2, 2025)
|
||||
- **Code:** `if (!this.activeTrades.delete(tradeId)) return` - first caller wins
|
||||
|
||||
**4. Database-First Pattern (#29)** - Save DB before Position Manager
|
||||
**5. Database-First Pattern (#29)** - Save DB before Position Manager
|
||||
- **Rule:** `createTrade()` MUST succeed before `positionManager.addTrade()`
|
||||
- **Why:** If DB fails, API returns 500 with "CLOSE POSITION MANUALLY"
|
||||
- **Impact:** Without this, positions become untracked on container restart
|
||||
- **Status:** Fixed Nov 13, 2025
|
||||
|
||||
**5. Container Deployment Verification (#31)**
|
||||
**6. Container Deployment Verification (#31)**
|
||||
- **Rule:** NEVER say "fixed" without checking container timestamp
|
||||
- **Verification:** `docker logs trading-bot-v4 | grep "Server starting"` vs `git log -1 --format='%ai'`
|
||||
- **If container older than commit:** CODE NOT DEPLOYED, FIX NOT ACTIVE
|
||||
- **Status:** Critical lesson from Nov 13, 2025 incident
|
||||
|
||||
**6. Position.size Tokens vs USD (#24)** - SDK returns tokens, not USD
|
||||
**7. Position.size Tokens vs USD (#24)** - SDK returns tokens, not USD
|
||||
- **Bug:** Comparing 12.28 tokens to $1,950 → "99.4% reduction" → false TP1
|
||||
- **Fix:** `positionSizeUSD = Math.abs(position.size) * currentPrice`
|
||||
- **Impact:** Without fix, TP1 never triggers correctly
|
||||
- **Status:** Fixed Nov 12, 2025
|
||||
|
||||
**7. Ghost Detection Atomic Lock (#67)** - Map.delete() as deduplication
|
||||
**8. Ghost Detection Atomic Lock (#67)** - Map.delete() as deduplication
|
||||
- **Pattern:** Async handlers called by multiple code paths simultaneously
|
||||
- **Solution:** `if (!this.activeTrades.delete(tradeId)) { return }` - atomic lock
|
||||
- **Why:** JavaScript Map.delete() returns true only for first caller
|
||||
- **Status:** Fixed Dec 2, 2025
|
||||
|
||||
**8. Smart Entry Wrong Price (#66, #68)** - Use Pyth price, not webhook
|
||||
**9. Smart Entry Wrong Price (#66, #68)** - Use Pyth price, not webhook
|
||||
- **Bug #66:** Symbol format mismatch ("SOLUSDT" vs "SOL-PERP") caused cache miss
|
||||
- **Bug #68:** Webhook `signal.price` contained percentage (70.80) not market price ($142)
|
||||
- **Fix:** Always use `pythClient.getPrice(symbol)` for calculations
|
||||
- **Status:** Fixed Dec 1-3, 2025
|
||||
|
||||
**9. MFE/MAE Wrong Units (#54)** - Store percentages, not dollars
|
||||
**10. MFE/MAE Wrong Units (#54)** - Store percentages, not dollars
|
||||
- **Bug:** Storing $64.08 when should store 0.48% (133× inflation)
|
||||
- **Fix:** `trade.maxFavorableExcursion = profitPercent` not `currentPnLDollars`
|
||||
- **Impact:** Analytics completely wrong for all trades
|
||||
- **Status:** Fixed Nov 23, 2025
|
||||
|
||||
**10. Execute Endpoint Quality Bypass (#62)** - Quality check after timeframe validation
|
||||
- **Bug:** Quality calculated but never validated → trades at quality 30 executed
|
||||
- **Fix:** Add explicit quality check AFTER timeframe confirmation
|
||||
- **Code:** `if (qualityResult.score < minQualityScore) return 400`
|
||||
- **Status:** Fixed Nov 27, 2025
|
||||
|
||||
---
|
||||
|
||||
### Quick Links by Category
|
||||
|
||||
Reference in New Issue
Block a user