docs: Document closePosition retry logic bug (Common Pitfall #36)

CRITICAL BUG: Missing retry wrapper caused rate limit storm Real Incident (Nov 15, 16:49 CET): - Trade cmi0il8l30000r607l8aec701 triggered close attempt - closePosition() had NO retryWithBackoff() wrapper - Failed with 429 → Position Manager retried EVERY 2 SECONDS - 100+ close attempts exhausted Helius rate limit - On-chain TP2 filled during storm - External closure detected 8 times: $0.14 → $0.51 (compounding bug) Why This Was Missed: - placeExitOrders() got retry wrapper on Nov 14 - openPosition() still has no wrapper (less critical - runs once) - closePosition() overlooked - MOST CRITICAL because runs in monitoring loop - Position Manager executeExit() catches 429 and returns early - But monitoring continues, retries close every 2s = infinite loop The Fix: - Wrapped closePosition() placePerpOrder() with retryWithBackoff() - 8s base delay, 3 max retries (same as placeExitOrders) - Reduces RPC load by 30-50x during close operations - Container deployed 18:05 CET Nov 15 Impact: Prevents rate limit exhaustion + duplicate external closure updates Files: .github/copilot-instructions.md (added Common Pitfall #36)
2025-11-15 18:07:26 +01:00
parent 54c68b45d2
commit c6b34c45c4
1 changed files with 48 additions and 0 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -1448,6 +1448,54 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
    - **Verification:** After restart, check logs for "Found 0 open trades" (not "Found 1 open trades to restore")
    - **Lesson:** status field is for classification, exitReason is for lifecycle management - both must be set on closure

+36. **closePosition() missing retry logic causes rate limit storm (CRITICAL - Fixed Nov 15, 2025):**
+    - **Symptom:** Position Manager tries to close trade, gets 429 error, retries EVERY 2 SECONDS → 100+ failed attempts → rate limit exhaustion
+    - **Root Cause:** `placeExitOrders()` has `retryWithBackoff()` wrapper (Nov 14 fix), but `closePosition()` did NOT
+    - **Real incident:** Trade cmi0il8l30000r607l8aec701 (Nov 15, 16:49 CET)
+      1. Position Manager tried to close (SL or TP trigger)
+      2. closePosition() called raw `placePerpOrder()` → 429 error
+      3. executeExit() caught 429, returned early (line 935-940)
+      4. Position Manager kept monitoring, retried close EVERY 2 seconds
+      5. Logs show 100+ "❌ Failed to close position: 429" + "⚠️ Rate limited while closing SOL-PERP"
+      6. Meanwhile: On-chain TP2 limit order filled (unaffected by SDK rate limits)
+      7. External closure detected, DB updated 8 TIMES: $0.14 → $0.20 → $0.26 → ... → $0.51
+      8. Container eventually restarted (likely from rate limit exhaustion)
+    - **Why duplicate updates:** Common Pitfall #27 fix (remove from Map before DB update) works UNLESS rate limits cause tons of retries before external closure detection
+    - **Impact:** User saw $0.51 profit in DB, $0.03 on Drift UI (8× compounding vs 1 actual fill)
+    - **Fix:** Wrapped closePosition() with retryWithBackoff() in lib/drift/orders.ts:
+    ```typescript
+    // Line ~567 (BEFORE):
+    const txSig = await driftClient.placePerpOrder(orderParams)
+    
+    // Line ~567 (AFTER):
+    const txSig = await retryWithBackoff(async () => {
+      return await driftClient.placePerpOrder(orderParams)
+    }, 3, 8000) // 8s base delay, 3 max retries (8s → 16s → 32s)
+    ```
+    - **Behavior now:** 3 SDK retries over 56s (8+16+32) + Position Manager natural retry on next monitoring cycle = robust without spam
+    - **RPC load reduction:** 30-50× fewer requests during close operations (3 retries vs 100+ attempts)
+    - **Verification:** Container restarted 18:05 CET Nov 15, code deployed
+    - **Lesson:** EVERY SDK order operation (open, close, cancel, place) MUST have retry wrapper - Position Manager monitoring creates infinite retry loop without it
+    - **Root Cause:** Phantom auto-closure sets `status='phantom'` but leaves `exitReason=NULL`
+    - **Bug:** Startup validator checks `exitReason !== null` (line 122 of init-position-manager.ts), ignores status field
+    - **Consequence:** Phantom trade with exitReason=NULL treated as "open" and restored to Position Manager
+    - **Real incident:** Nov 14 phantom trade (cmhy6xul20067nx077agh260n) caused 232% size mismatch, hundreds of false "🔴 RUNNER STOP LOSS" alerts
+    - **Fix:** When auto-closing phantom trades, MUST set exitReason:
+    ```typescript
+    // In app/api/trading/execute/route.ts (phantom detection):
+    await updateTradeExit({
+      tradeId: trade.id,
+      exitPrice: currentPrice,
+      exitReason: 'manual', // CRITICAL: Must set exitReason for cleanup
+      realizedPnL: actualPnL,
+      status: 'phantom'
+    })
+    ```
+    - **Manual cleanup:** If phantom already exists: `UPDATE "Trade" SET "exitReason" = 'manual' WHERE status = 'phantom' AND "exitReason" IS NULL`
+    - **Impact:** Without exitReason, phantom trades create ghost positions that trigger false alerts and pollute monitoring
+    - **Verification:** After restart, check logs for "Found 0 open trades" (not "Found 1 open trades to restore")
+    - **Lesson:** status field is for classification, exitReason is for lifecycle management - both must be set on closure
+
 ## File Conventions

 - **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)