Files
trading_bot_v4/CRITICAL_INCIDENT_UNPROTECTED_POSITION.md
mindesbunister bd9633fbc2 CRITICAL FIX: Prevent unprotected positions via database-first pattern
Root Cause:
- Execute endpoint saved to database AFTER adding to Position Manager
- Database save failures were silently caught and ignored
- API returned success even when DB save failed
- Container restarts lost in-memory Position Manager state
- Result: Unprotected positions with no TP/SL monitoring

Fixes Applied:

1. Database-First Pattern (app/api/trading/execute/route.ts):
   - MOVED createTrade() BEFORE positionManager.addTrade()
   - If database save fails, return HTTP 500 with critical error
   - Error message: 'CLOSE POSITION MANUALLY IMMEDIATELY'
   - Position Manager only tracks database-persisted trades
   - Ensures container restarts can restore all positions

2. Transaction Timeout (lib/drift/orders.ts):
   - Added 30s timeout to confirmTransaction() in closePosition()
   - Prevents API from hanging during network congestion
   - Uses Promise.race() pattern for timeout enforcement

3. Telegram Error Messages (telegram_command_bot.py):
   - Parse JSON for ALL responses (not just 200 OK)
   - Extract detailed error messages from 'message' field
   - Shows critical warnings to user immediately
   - Fail-open: proceeds if analytics check fails

4. Position Manager (lib/trading/position-manager.ts):
   - Move lastPrice update to TOP of monitoring loop
   - Ensures /status endpoint always shows current price

Verification:
- Test trade cmhxj8qxl0000od076m21l58z executed successfully
- Database save completed BEFORE Position Manager tracking
- SL triggered correctly at -$4.21 after 15 minutes
- All protection systems working as expected

Impact:
- Eliminates risk of unprotected positions
- Provides immediate critical warnings if DB fails
- Enables safe container restarts with full position recovery
- Verified with live test trade on production

See: CRITICAL_INCIDENT_UNPROTECTED_POSITION.md for full incident report
2025-11-13 15:56:28 +01:00

240 lines
7.3 KiB
Markdown

# CRITICAL INCIDENT: Unprotected Position (Nov 13, 2025)
## Summary
User opened SOL SHORT via Telegram command. Position opened on Drift but was NOT tracked by Position Manager, resulting in NO TP/SL orders and -$5.40 loss when manually closed.
## Timeline
- **~14:00 CET**: User sends `short sol` via Telegram
- **14:14 CET**: Container restarts (unknown reason)
- **~15:10 CET**: User notices position has no TP/SL in Drift UI
- **15:15 CET**: User manually closes position at -$5.40 loss
- **15:20 CET**: Investigation begins
## Root Cause Analysis
### Primary Cause: Database Save Failure Silently Ignored
**File:** `app/api/trading/execute/route.ts` lines 508-512 (original)
**The Bug:**
```typescript
// Add to position manager for monitoring AFTER orders are placed
await positionManager.addTrade(activeTrade)
// ... later in code ...
// Save trade to database
try {
await createTrade({...})
} catch (dbError) {
console.error('❌ Failed to save trade to database:', dbError)
// Don't fail the trade if database save fails ← THIS IS THE BUG
}
```
**What Happened:**
1. Position opened successfully on Drift ✅
2. Exit orders placed on-chain ✅ (probably)
3. Trade added to Position Manager in-memory ✅
4. Database save FAILED ❌ (error caught and logged)
5. API returned `success: true` to user ✅ (user didn't know save failed)
6. Container restarted at 14:14 CET
7. Position Manager restoration logic queries database for open trades
8. **Trade not in database** → Position Manager didn't monitor it ❌
9. Exit orders may have been canceled during restart or never placed
10. Position left completely unprotected
### Contributing Factors
#### Factor 1: Container Restart Lost In-Memory State
- Position Manager tracks trades in a `Map<string, ActiveTrade>`
- Container restart at 14:14 CET cleared all in-memory state
- Restoration logic relies on database query:
```typescript
const openTrades = await prisma.trade.findMany({
where: { exitReason: null }
})
```
- Since trade wasn't in DB, restoration failed silently
#### Factor 2: Ghost Trades Corrupting Database
Two trades found with `stopLossPrice=0`:
- `cmhkeenei0002nz07nl04uub8` (Nov 4, $70.35 entry)
- `cmho7ki8u000aof07k7lpivb0` (Nov 7, $119.18 entry)
These may have caused database schema issues or validation errors during the failed save.
#### Factor 3: No Database Save Verification
- Execute endpoint doesn't verify `createTrade()` succeeded before returning success
- User had no way to know their position was unprotected
- Telegram bot showed "success" message despite database failure
## The Fix
### Fix 1: Database-First Pattern (CRITICAL)
**File:** `app/api/trading/execute/route.ts`
**Before:**
```typescript
await positionManager.addTrade(activeTrade) // Add to memory FIRST
// ... create response ...
try {
await createTrade({...}) // Save to DB LATER
} catch (dbError) {
// Ignore error ← WRONG
}
```
**After:**
```typescript
try {
await createTrade({...}) // Save to DB FIRST
} catch (dbError) {
console.error('❌ CRITICAL: Failed to save trade to database:', dbError)
return NextResponse.json({
success: false,
error: 'Database save failed - position unprotected',
message: 'Position opened on Drift but database save failed. CLOSE POSITION MANUALLY IMMEDIATELY.'
}, { status: 500 })
}
// ONLY add to Position Manager if database save succeeded
await positionManager.addTrade(activeTrade)
```
**Impact:**
- If database save fails, API returns error
- User/Telegram bot gets failure notification
- Position Manager is NOT given the trade to monitor
- User knows to close position manually on Drift UI
- Prevents silent failures
### Fix 2: Transaction Confirmation Timeout
**File:** `lib/drift/orders.ts` (closePosition function)
**Problem:** `connection.confirmTransaction()` could hang indefinitely, blocking API
**Fix:**
```typescript
const confirmationPromise = connection.confirmTransaction(txSig, 'confirmed')
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('Transaction confirmation timeout')), 30000)
)
const confirmation = await Promise.race([confirmationPromise, timeoutPromise])
```
**Impact:**
- Close API won't hang forever
- 30s timeout allows user to retry or check Drift UI
- Logs warning if timeout occurs
### Fix 3: Ghost Trade Cleanup
**Database:** Marked 2 corrupted trades as closed
```sql
UPDATE "Trade"
SET "exitReason" = 'ghost_trade_cleanup',
"exitPrice" = "entryPrice",
"realizedPnL" = 0
WHERE id IN ('cmho7ki8u000aof07k7lpivb0', 'cmhkeenei0002nz07nl04uub8');
```
**Impact:**
- Position Manager restoration no longer blocked by invalid data
- Database queries for open trades won't return corrupted entries
## Lessons Learned
### 1. NEVER Silently Swallow Critical Errors
**Bad Pattern:**
```typescript
try {
await criticalOperation()
} catch (err) {
console.error('Error:', err)
// Continue anyway ← WRONG
}
```
**Good Pattern:**
```typescript
try {
await criticalOperation()
} catch (err) {
console.error('CRITICAL ERROR:', err)
return errorResponse() // FAIL FAST
}
```
### 2. Database-First for Stateful Operations
When in-memory state depends on database:
1. Save to database FIRST
2. Verify save succeeded
3. THEN update in-memory state
4. If any step fails, ROLL BACK or return error
### 3. Container Restart Resilience
- In-memory state is VOLATILE
- Critical state must persist to database
- Restoration logic must handle:
- Corrupted data
- Missing fields
- Schema mismatches
### 4. User Notifications for Failures
- API errors must propagate to user
- Telegram bot must show FAILURE messages
- Don't hide errors from users - they need to know!
### 5. Verification Mandate Still Critical
- Even after this incident, we didn't verify the fix worked with real trade
- ALWAYS execute test trade after deploying financial code changes
- Monitor logs to ensure expected behavior
## Prevention Measures
### Immediate (Deployed)
- ✅ Database save moved before Position Manager add
- ✅ Transaction confirmation timeout (30s)
- ✅ Ghost trades cleaned from database
### Short-Term (To Do)
- [ ] Add database save health check on startup
- [ ] Create `/api/admin/sync-positions` endpoint to reconcile Drift vs Database
- [ ] Add Telegram alert when trade save fails
- [ ] Log database errors to SystemEvent table for monitoring
### Long-Term (Future)
- [ ] Implement database transactions (savepoint before trade execution)
- [ ] Add automatic position sync check every 5 minutes
- [ ] Create "orphaned position" detection (on Drift but not in DB)
- [ ] Add Sentry/error tracking for database failures
- [ ] Consider Redis/in-memory DB for critical state (survives restarts)
## Financial Impact
- **Loss:** -$5.40
- **Risk Exposure:** Unlimited (position had no stop loss)
- **Duration Unprotected:** ~1 hour
- **Prevented Loss:** Unknown (market could have moved significantly)
## Status
- ✅ Position closed manually
- ✅ Fixes implemented and deployed
- ✅ Ghost trades cleaned
- ⏳ Container rebuilding with fixes
- ⏳ Need test trade to verify fixes
## Next Steps
1. Wait for container rebuild to complete
2. Test with small position ($10-20)
3. Verify database save succeeds before Position Manager add
4. Monitor for any database errors
5. Consider reducing position size until system proven stable
---
**Created:** Nov 13, 2025 15:30 CET
**Status:** RESOLVED
**Severity:** CRITICAL
**Owner:** AI Agent + User