docs: Document closePosition retry logic bug (Common Pitfall #36)

CRITICAL BUG: Missing retry wrapper caused rate limit storm

Real Incident (Nov 15, 16:49 CET):
- Trade cmi0il8l30000r607l8aec701 triggered close attempt
- closePosition() had NO retryWithBackoff() wrapper
- Failed with 429 → Position Manager retried EVERY 2 SECONDS
- 100+ close attempts exhausted Helius rate limit
- On-chain TP2 filled during storm
- External closure detected 8 times: $0.14 → $0.51 (compounding bug)

Why This Was Missed:
- placeExitOrders() got retry wrapper on Nov 14
- openPosition() still has no wrapper (less critical - runs once)
- closePosition() overlooked - MOST CRITICAL because runs in monitoring loop
- Position Manager executeExit() catches 429 and returns early
- But monitoring continues, retries close every 2s = infinite loop

The Fix:
- Wrapped closePosition() placePerpOrder() with retryWithBackoff()
- 8s base delay, 3 max retries (same as placeExitOrders)
- Reduces RPC load by 30-50x during close operations
- Container deployed 18:05 CET Nov 15

Impact: Prevents rate limit exhaustion + duplicate external closure updates

Files: .github/copilot-instructions.md (added Common Pitfall #36)
This commit is contained in:
mindesbunister
2025-11-15 18:07:26 +01:00
parent 54c68b45d2
commit c6b34c45c4

View File

@@ -1448,6 +1448,54 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Verification:** After restart, check logs for "Found 0 open trades" (not "Found 1 open trades to restore")
- **Lesson:** status field is for classification, exitReason is for lifecycle management - both must be set on closure
36. **closePosition() missing retry logic causes rate limit storm (CRITICAL - Fixed Nov 15, 2025):**
- **Symptom:** Position Manager tries to close trade, gets 429 error, retries EVERY 2 SECONDS → 100+ failed attempts → rate limit exhaustion
- **Root Cause:** `placeExitOrders()` has `retryWithBackoff()` wrapper (Nov 14 fix), but `closePosition()` did NOT
- **Real incident:** Trade cmi0il8l30000r607l8aec701 (Nov 15, 16:49 CET)
1. Position Manager tried to close (SL or TP trigger)
2. closePosition() called raw `placePerpOrder()` → 429 error
3. executeExit() caught 429, returned early (line 935-940)
4. Position Manager kept monitoring, retried close EVERY 2 seconds
5. Logs show 100+ "❌ Failed to close position: 429" + "⚠️ Rate limited while closing SOL-PERP"
6. Meanwhile: On-chain TP2 limit order filled (unaffected by SDK rate limits)
7. External closure detected, DB updated 8 TIMES: $0.14 → $0.20 → $0.26 → ... → $0.51
8. Container eventually restarted (likely from rate limit exhaustion)
- **Why duplicate updates:** Common Pitfall #27 fix (remove from Map before DB update) works UNLESS rate limits cause tons of retries before external closure detection
- **Impact:** User saw $0.51 profit in DB, $0.03 on Drift UI (8× compounding vs 1 actual fill)
- **Fix:** Wrapped closePosition() with retryWithBackoff() in lib/drift/orders.ts:
```typescript
// Line ~567 (BEFORE):
const txSig = await driftClient.placePerpOrder(orderParams)
// Line ~567 (AFTER):
const txSig = await retryWithBackoff(async () => {
return await driftClient.placePerpOrder(orderParams)
}, 3, 8000) // 8s base delay, 3 max retries (8s → 16s → 32s)
```
- **Behavior now:** 3 SDK retries over 56s (8+16+32) + Position Manager natural retry on next monitoring cycle = robust without spam
- **RPC load reduction:** 30-50× fewer requests during close operations (3 retries vs 100+ attempts)
- **Verification:** Container restarted 18:05 CET Nov 15, code deployed
- **Lesson:** EVERY SDK order operation (open, close, cancel, place) MUST have retry wrapper - Position Manager monitoring creates infinite retry loop without it
- **Root Cause:** Phantom auto-closure sets `status='phantom'` but leaves `exitReason=NULL`
- **Bug:** Startup validator checks `exitReason !== null` (line 122 of init-position-manager.ts), ignores status field
- **Consequence:** Phantom trade with exitReason=NULL treated as "open" and restored to Position Manager
- **Real incident:** Nov 14 phantom trade (cmhy6xul20067nx077agh260n) caused 232% size mismatch, hundreds of false "🔴 RUNNER STOP LOSS" alerts
- **Fix:** When auto-closing phantom trades, MUST set exitReason:
```typescript
// In app/api/trading/execute/route.ts (phantom detection):
await updateTradeExit({
tradeId: trade.id,
exitPrice: currentPrice,
exitReason: 'manual', // CRITICAL: Must set exitReason for cleanup
realizedPnL: actualPnL,
status: 'phantom'
})
```
- **Manual cleanup:** If phantom already exists: `UPDATE "Trade" SET "exitReason" = 'manual' WHERE status = 'phantom' AND "exitReason" IS NULL`
- **Impact:** Without exitReason, phantom trades create ghost positions that trigger false alerts and pollute monitoring
- **Verification:** After restart, check logs for "Found 0 open trades" (not "Found 1 open trades to restore")
- **Lesson:** status field is for classification, exitReason is for lifecycle management - both must be set on closure
## File Conventions
- **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)