Commit Graph

6 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
63b94016fe fix: Implement critical risk management fixes for bugs #76, #77, #78, #80
Co-authored-by: mindesbunister <32161838+mindesbunister@users.noreply.github.com>
2025-12-09 22:23:43 +00:00
mindesbunister
1ed909c661 fix: Stop Drift verifier retry loop cancelling orders (Bug #80)
CRITICAL FIX (Dec 9, 2025): Drift state verifier now stops retry loop when close transaction confirms, preventing infinite retries that cancel orders.

Problem:
- Drift state verifier detected 'closed' positions still open on Drift
- Sent close transaction which CONFIRMED on-chain
- But Drift API still showed position (5-minute propagation delay)
- Verifier thought close failed, retried immediately
- Infinite loop: close → confirm → Drift still shows position → retry
- Eventually Position Manager gave up, cancelled ALL orders
- User's position left completely unprotected

Root Cause (Bug #80):
- Solana transaction confirms in ~400ms on-chain
- Drift.getPosition() caches state, takes 5+ minutes to update
- Verifier didn't account for propagation delay
- Kept retrying every 10 minutes because Drift API lagged behind
- Each retry attempt potentially cancelled orders as side effect

Solution:
- Check configSnapshot.retryCloseTime before retrying
- If last retry was <5 minutes ago, SKIP (wait for Drift to catch up)
- Log: 'Skipping retry - last attempt Xs ago (Drift propagation delay)'
- Prevents retry loop while Drift state propagates
- After 5 minutes, can retry if position truly stuck

Impact:
- Orders no longer disappear repeatedly due to retry loop
- Position stays protected with TP1/TP2/SL between retries
- User doesn't need to manually replace orders every 3 minutes
- System respects Drift API propagation delay

Testing:
- Deployed fix, orders placed successfully
- Database synced: tp1OrderTx and tp2OrderTx populated
- Monitoring logs for 'Skipping retry' messages on next verifier run
- Position tracking: 1 active trade, monitoring active

Note: This fixes the symptom (retry loop). Root cause is Drift SDK caching getPosition() results. Real fix would be to query on-chain state directly or increase cache TTL.

Files changed:
- lib/monitoring/drift-state-verifier.ts (added 5-minute skip window)
2025-12-09 21:04:29 +01:00
mindesbunister
4ab7bf58da feat: Drift state verifier double-checking system (WIP - build issues)
CRITICAL: Position Manager stops monitoring randomly
User had to manually close SOL-PERP position after PM stopped at 23:21.

Implemented double-checking system to detect when positions marked
closed in DB are still open on Drift (and vice versa):

1. DriftStateVerifier service (lib/monitoring/drift-state-verifier.ts)
   - Runs every 10 minutes automatically
   - Checks closed trades (24h) vs actual Drift positions
   - Retries close if mismatch found
   - Sends Telegram alerts

2. Manual verification API (app/api/monitoring/verify-drift-state)
   - POST: Force immediate verification check
   - GET: Service status

3. Integrated into startup (lib/startup/init-position-manager.ts)
   - Auto-starts on container boot
   - First check after 2min, then every 10min

STATUS: Build failing due to TypeScript compilation timeout
Need to fix and deploy, then investigate WHY Position Manager stops.

This addresses symptom (stuck positions) but not root cause (PM stopping).
2025-12-07 02:28:10 +01:00
mindesbunister
302511293c feat: Add production logging gating (Phase 1, Task 1.1)
- Created logger utility with environment-based gating (lib/utils/logger.ts)
- Replaced 517 console.log statements with logger.log (71% reduction)
- Fixed import paths in 15 files (resolved comment-trapped imports)
- Added DEBUG_LOGS=false to .env
- Achieves 71% immediate log reduction (517/731 statements)
- Expected 90% reduction in production when deployed

Impact: Reduced I/O blocking, lower log volume in production
Risk: LOW (easy rollback, non-invasive)
Phase: Phase 1, Task 1.1 (Quick Wins - Console.log Production Gating)

Files changed:
- NEW: lib/utils/logger.ts (production-safe logging)
- NEW: scripts/replace-console-logs.js (automation tool)
- Modified: 15 lib/*.ts files (console.log → logger.log)
- Modified: .env (DEBUG_LOGS=false)

Next: Task 1.2 (Image Size Optimization)
2025-12-05 00:32:41 +01:00
mindesbunister
f420d98d55 critical: Make health monitor 3-4x more aggressive to prevent heap crashes
PROBLEM (Nov 27, 2025 - 11:53 UTC):
- accountUnsubscribe errors accumulated 200+ times in 2 seconds
- JavaScript heap out of memory crash BEFORE health monitor could trigger
- Old settings: 50 errors / 30s window / check every 10s = too slow
- Container crashed from memory exhaustion, not clean restart

SOLUTION - 3-4x FASTER RESPONSE:
- Error window: 30s → 10s (3× faster detection)
- Error threshold: 50 → 20 errors (2.5× more sensitive)
- Check frequency: 10s → 3s intervals (3× more frequent)

IMPACT:
- Before: 10-40 seconds to trigger restart
- After: 3-13 seconds to trigger restart (3-4× faster)
- Catches rapid error accumulation BEFORE heap exhaustion
- Clean restart instead of crash-and-recover

REAL INCIDENT TIMELINE:
11:53:43 - Errors start accumulating
11:53:45.606 - FATAL: heap out of memory (2.2 seconds)
11:53:47.803 - Docker restart (not health monitor)

NEW BEHAVIOR:
- 20 errors in 10s = trigger at ~100ms/error rate
- 3s check interval catches problem in 3-13s MAX
- Clean restart before memory leak causes crash

Files Changed:
- lib/monitoring/drift-health-monitor.ts (lines 13-14, 32)
2025-11-27 13:04:14 +01:00
mindesbunister
dc197f52a4 feat: Replace blind 2-hour reconnect with error-based health monitoring
User Request: Replace blind 2-hour restart timer with smart monitoring that only restarts when accountUnsubscribe errors actually occur

Changes:
. Health Monitor (NEW):
- Created lib/monitoring/drift-health-monitor.ts
- Tracks accountUnsubscribe errors in 30-second sliding window
- Triggers container restart via flag file when 50+ errors detected
- Prevents unnecessary restarts when SDK healthy

. Drift Client:
- Removed blind scheduleReconnection() and 2-hour timer
- Added interceptWebSocketErrors() to catch SDK errors
- Patches console.error to monitor for accountUnsubscribe patterns
- Starts health monitor after successful initialization
- Removed unused reconnect() method and reconnectTimer field

. Health API (NEW):
- GET /api/drift/health - Check current error count and health status
- Returns: healthy boolean, errorCount, threshold, message
- Useful for external monitoring and debugging

Impact:
- System only restarts when actual memory leak detected
- Prevents unnecessary downtime every 2 hours
- More targeted response to SDK issues
- Better operational stability

Files:
- lib/monitoring/drift-health-monitor.ts (NEW - 165 lines)
- lib/drift/client.ts (removed timer, added error interception)
- app/api/drift/health/route.ts (NEW - health check endpoint)

Testing:
- Health monitor starts on initialization: 
- API endpoint returns healthy status: 
- No blind reconnection scheduled: 
2025-11-24 16:49:10 +01:00