docs: Add Bug #77 recurrence to TOP 10 CRITICAL PITFALLS

- Added comprehensive Bug #77 documentation as entry #2
- Root cause: handlePriceUpdate() early return when Drift not initialized
- Impact: ,000+ losses from silently failed monitoring (29+ minutes unprotected)
- Fix: Removed early return, monitoring now runs regardless of Drift state
- Verification: Test position shows price checks every 2 seconds
- Prevention: Never add early returns that silently skip critical operations
- Renumbered entries 2-10 to 3-11 to accommodate new critical pitfall
- This bug caused 'the whole time all the development we did was not working'
This commit is contained in:
mindesbunister
2025-12-13 22:53:12 +01:00
parent 01bd730b19
commit db0be03116

View File

@@ -3099,49 +3099,120 @@ This section contains the **TOP 10 MOST CRITICAL** pitfalls that every AI agent
- **Status:** ✅ DEPLOYED Dec 7, 2025 02:47 UTC (commit ed9e4d5)
- **See:** `docs/PM_MONITORING_STOP_ROOT_CAUSE_DEC7_2025.md` for complete analysis
**2. Drift SDK Memory Leak (#1)** - JavaScript heap OOM after 10+ hours
**2. Position Manager Never Actually Checks Prices - handlePriceUpdate() Early Return (#77 RECURRENCE - CRITICAL - Dec 13, 2025)**
- **Symptom:** Telegram trades "added to Position Manager" but never monitored - zero price logs, zero condition checks, silent failure for 29+ minutes
- **User Report:** "so the whole time all the development we did was not working and therefore we have lost 1000$"
- **Financial Impact:** $1,000+ losses from positions completely unmonitored despite logs saying "monitoring started"
- **Real Incident (Dec 13, 2025 20:47-21:17):**
* SOL-PERP SHORT opened via Telegram at 20:47:23
* Position Manager logs: "✅ Trade added to position manager for monitoring"
* Reality: Zero price checks, zero condition checks, zero monitoring
* User forced manual close at 21:17:10 (+$2.17 profit)
* Container restart 21:20:36 lost all evidence logs
- **Root Cause:**
* File: `lib/trading/position-manager.ts` lines 692-696 (before fix)
* Code: `if (!driftService?.isInitialized) { return }` - EARLY RETURN in handlePriceUpdate()
* Drift initializes LAZILY (only on first API call)
* Position Manager starts monitoring IMMEDIATELY after addTrade()
* Result: Pyth price monitor called handlePriceUpdate() every 2 seconds
* But function returned early EVERY TIME → monitoring loop did NOTHING
* **Silent failure:** No errors thrown, isMonitoring=true, but zero actual checks
- **Why Defensive Programming Backfired:**
* Intent: Avoid Drift errors when service not ready
* Reality: Created 29+ minute gap with zero position protection
* Monitoring loop ran but skipped ALL price/condition logic
* Health monitor showed "monitoring active" (flag-based, not functional check)
- **THE FIX (Dec 13, 2025 21:46 UTC - DEPLOYED):**
```typescript
// BEFORE (BROKEN - lines 692-696):
const driftService = getDriftService()
if (!driftService?.isInitialized) {
return // ← SILENT FAILURE - monitoring does nothing!
}
// AFTER (FIXED):
// BUG #77 FIX: Don't skip price checks if Drift not initialized
// Position Manager price checking must run even if external closure detection unavailable
const driftService = getDriftService()
// Continue to price checking logic regardless of Drift state
// External closure detection will fail gracefully if Drift unavailable
```
- **Added Observability:**
* Line 651-658: `console.log('🔍 Price check: ${symbol} @ $${price} (${count} trades)')`
* Shows monitoring loop actually running (not just flag set)
* Logs every 2 seconds when positions active
- **Verification (CONFIRMED WORKING):**
* Test position: SOL-PERP LONG opened via `/api/trading/test`
* Logs: `✅ STARTMON: Position monitoring active, isMonitoring: true`
* Logs: `🔍 Price check: SOL-PERP @ $132.29 (2 trades)` every 2 seconds
* Diagnostic endpoint: `isMonitoring: true, activeTradesCount: 2`
* **FIX VERIFIED** - Price checks running continuously
- **Code Locations:**
* handlePriceUpdate(): `lib/trading/position-manager.ts` line 651-658 (logging)
* Early return REMOVED: `lib/trading/position-manager.ts` line 685-695 (THE FIX)
* addTrade() caller: `lib/trading/position-manager.ts` line 262-312
- **Prevention Rules:**
1. NEVER add early returns in monitoring loops that silently skip critical operations
2. Defensive checks must fail LOUDLY, not silently
3. If Drift unavailable, log warning but continue price checking
4. External closure detection is SUPPLEMENTARY - price checks are PRIMARY
5. Test with Telegram trades specifically (different code path than TradingView)
6. Verify monitoring with runtime logs, not just isMonitoring flag
- **Red Flags Indicating This Bug:**
* Position opened successfully
* Logs say "added to position manager for monitoring"
* Zero `🔍 Price check` logs appearing
* Zero condition check logs (TP1/TP2/SL/trailing)
* Database configSnapshot is NULL (PM state not saved)
* User forced to manually close despite "monitoring active"
- **Git Commit:** 01bd730 "critical: FIX Bug #77 - Position Manager monitoring stopped by Drift init check" (Dec 13, 2025)
- **Deployment:** Dec 13, 2025 21:46 UTC (container trading-bot-v4)
- **Status:** ✅ FIXED AND VERIFIED - Monitoring loop now runs correctly, test position shows price checks every 2s
- **Lesson Learned:** Defensive programming that silently skips critical operations is MORE dangerous than failing loudly. The early return was added to "protect" against Drift errors but instead created a 29-minute unmonitored gap that cost $1,000+. Always fail loudly, never silently skip monitoring.
**3. Drift SDK Memory Leak (#1)** - JavaScript heap OOM after 10+ hours
- **Solution:** Smart error-based health monitoring (`lib/monitoring/drift-health-monitor.ts`)
- **Detection:** `interceptWebSocketErrors()` patches console.error
- **Action:** Restarts if 50+ errors in 30-second window
- **Status:** Fixed Nov 15, 2025, Enhanced Nov 24, 2025
**3. Wrong RPC Provider (#2)** - Alchemy breaks Drift SDK subscriptions
**4. Wrong RPC Provider (#2)** - Alchemy breaks Drift SDK subscriptions
- **FINAL CONCLUSION:** Use Helius RPC, NEVER use Alchemy
- **Root Cause:** Alchemy rate limits break Drift's burst subscription pattern
- **Evidence:** 17-71 subscription errors with Alchemy vs 0 with Helius
- **Status:** Investigation Complete Nov 14, 2025
**4. P&L Compounding Race Condition (#48, #49, #59, #60, #61, #67)**
**5. P&L Compounding Race Condition (#48, #49, #59, #60, #61, #67)**
- **Pattern:** Multiple monitoring loops detect same closure → each adds P&L
- **Result:** $6 real → $92 recorded (15x inflation)
- **Fix:** Use `Map.delete()` atomic return as deduplication lock (Dec 2, 2025)
- **Code:** `if (!this.activeTrades.delete(tradeId)) return` - first caller wins
**5. Database-First Pattern (#29)** - Save DB before Position Manager
**6. Database-First Pattern (#29)** - Save DB before Position Manager
- **Rule:** `createTrade()` MUST succeed before `positionManager.addTrade()`
- **Why:** If DB fails, API returns 500 with "CLOSE POSITION MANUALLY"
- **Impact:** Without this, positions become untracked on container restart
- **Status:** Fixed Nov 13, 2025
**6. Container Deployment Verification (#31)**
**7. Container Deployment Verification (#31)**
- **Rule:** NEVER say "fixed" without checking container timestamp
- **Verification:** `docker logs trading-bot-v4 | grep "Server starting"` vs `git log -1 --format='%ai'`
- **If container older than commit:** CODE NOT DEPLOYED, FIX NOT ACTIVE
- **Status:** Critical lesson from Nov 13, 2025 incident
**7. Position.size Tokens vs USD (#24)** - SDK returns tokens, not USD
**8. Position.size Tokens vs USD (#24)** - SDK returns tokens, not USD
- **Bug:** Comparing 12.28 tokens to $1,950 → "99.4% reduction" → false TP1
- **Fix:** `positionSizeUSD = Math.abs(position.size) * currentPrice`
- **Impact:** Without fix, TP1 never triggers correctly
- **Status:** Fixed Nov 12, 2025
**8. Ghost Detection Atomic Lock (#67)** - Map.delete() as deduplication
**9. Ghost Detection Atomic Lock (#67)** - Map.delete() as deduplication
- **Pattern:** Async handlers called by multiple code paths simultaneously
- **Solution:** `if (!this.activeTrades.delete(tradeId)) { return }` - atomic lock
- **Why:** JavaScript Map.delete() returns true only for first caller
- **Status:** Fixed Dec 2, 2025
**9. Wrong Price in usdToBase() (#81 - ROOT CAUSE - Dec 10, 2025)** - CAUSED $1,000+ LOSSES
**10. Wrong Price in usdToBase() (#81 - ROOT CAUSE - Dec 10, 2025)** - CAUSED $1,000+ LOSSES
- **Bug:** Changed from `usdToBase(usd, price)` to `usdToBase(usd)` using entryPrice for ALL orders
- **Impact:** Wrong token quantities = Drift rejects orders = NULL database signatures = NO risk management
- **Example:** $8,000 at TP1 $141.20 needs 56.66 SOL, but code calculated 57.14 SOL (used $140 entry price)
@@ -3149,7 +3220,7 @@ This section contains the **TOP 10 MOST CRITICAL** pitfalls that every AI agent
- **Original commit:** 4cc294b (Oct 26, 2025) - "All 3 exit orders placed successfully on-chain" (100% success)
- **Status:** ✅ FIXED Dec 10, 2025 14:31 CET (commit 55d780c)
**10. Drift State Verifier Kills Active Positions (#82 - CRITICAL - Dec 10, 2025)** - Automatic retry close on wrong positions
**11. Drift State Verifier Kills Active Positions (#82 - CRITICAL - Dec 10, 2025)** - Automatic retry close on wrong positions
- **Bug:** Verifier detected 6 old closed positions (150-1064 min ago), all showed "15.45 tokens" (user's CURRENT trade!), automatically called closePosition()
- **Impact:** User's manual trade HAD working SL, then Telegram alert "⚠️ Retry close attempted automatically", SL orders immediately disappeared
- **Root Cause:** Lines 279-283 call closePosition() for every mismatch, no verification if Drift position is OLD (should close) vs NEW (active trade)