docs: Document ghost position death spiral fix as Common Pitfall #40

Added comprehensive documentation for 3-layer ghost prevention system:
- Root cause analysis (validation skipped during rate limiting)
- Death spiral explanation (ghosts → rate limits → skipped validation)
- User requirement context (must be fully autonomous)
- Real incident details (Nov 15, 2025)
- Complete solution with code examples for all 3 layers
- Impact and verification notes
- Key lesson: validation logic must never skip during errors

Files changed:
- .github/copilot-instructions.md (added Pitfall #40)
This commit is contained in:
mindesbunister
2025-11-15 23:52:39 +01:00
parent 4779a9f732
commit bbab693cc1

View File

@@ -1672,6 +1672,66 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
- **Alternative solution (NOT used):** Copy .env during Docker build with `COPY --chown=nextjs:nodejs`, but this breaks runtime config updates
- **Lesson:** Docker volume mounts retain host ownership - must plan for writability by setting host file ownership to match container user UID
40. **Ghost position death spiral from skipped validation (CRITICAL - Fixed Nov 15, 2025):**
- **Symptom:** Telegram /status shows 2 open positions when database shows all closed, massive rate limit storms (100+ RPC calls/minute)
- **Root Cause:** Periodic validation (every 5min) SKIPPED when Drift service rate-limited: `⏳ Drift service not ready, skipping validation`
- **Death Spiral:** Ghosts → rate limits → validation skipped → more rate limits → more ghosts
- **Impact:** System unusable, requires manual container restart, user can't be away from laptop
- **User Requirement:** "bot has to work all the time especially when i am not on my laptop" - MUST be fully autonomous
- **Real Incident (Nov 15, 2025):**
* Position Manager tracking 2 ghost positions
* Both positions closed on Drift but still in memory
* Trying to close non-existent positions every 2 seconds
* Rate limit exhaustion prevented validation from running
* Only solution was container restart (not autonomous)
- **Solution: 3-layer protection system**
```typescript
// LAYER 1: Database-based age check (doesn't require RPC)
private async cleanupStalePositions(): Promise<void> {
const sixHoursAgo = Date.now() - (6 * 60 * 60 * 1000)
for (const [tradeId, trade] of this.activeTrades) {
if (trade.entryTime < sixHoursAgo) {
console.log(`🔴 STALE GHOST DETECTED: ${trade.symbol} (age: ${hours}h)`)
await this.handleExternalClosure(trade, 'Stale position cleanup (>6h old)')
}
}
}
// LAYER 2: Death spiral detector in executeExit()
if (errorMsg.includes('429')) {
if (trade.priceCheckCount > 20) { // 20+ failed close attempts (40+ seconds)
console.log(`🔴 DEATH SPIRAL DETECTED: ${trade.symbol}`)
await this.handleExternalClosure(trade, 'Death spiral prevention')
return // Force remove from monitoring
}
}
// LAYER 3: Ghost check during normal monitoring (every 20 price updates)
if (trade.priceCheckCount % 20 === 0) {
const position = await driftService.getPosition(marketConfig.driftMarketIndex)
if (!position || Math.abs(position.size) < 0.01) {
console.log(`🔴 GHOST DETECTED in monitoring loop`)
await this.handleExternalClosure(trade, 'Ghost detected during monitoring')
return
}
}
```
- **Key Changes:**
* validatePositions() now runs database cleanup FIRST (Layer 1) before Drift RPC checks
* Changed skip message from "skipping validation" to "using database-only validation"
* Layer 1 ALWAYS runs (no RPC required) - prevents long-term ghost accumulation (>6h)
* Layer 2 breaks death spirals within 40 seconds of detection
* Layer 3 catches ghosts quickly during normal monitoring (every 40s vs 5min)
- **Impact:**
* System now self-healing - no manual intervention needed
* Ghost positions cleaned within 40-360 seconds (depending on layer)
* Works even during severe rate limiting (Layer 1 doesn't need RPC)
* Telegram /status always accurate
* User can be away - bot handles itself autonomously
- **Verification:** Container restart + new code = no more ghost accumulation possible
- **Lesson:** Critical validation logic must NEVER skip during error conditions - use fallback methods that don't require the failing resource
## File Conventions
- **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)