CRITICAL: Position Manager stops monitoring randomly
User had to manually close SOL-PERP position after PM stopped at 23:21.
Implemented double-checking system to detect when positions marked
closed in DB are still open on Drift (and vice versa):
1. DriftStateVerifier service (lib/monitoring/drift-state-verifier.ts)
- Runs every 10 minutes automatically
- Checks closed trades (24h) vs actual Drift positions
- Retries close if mismatch found
- Sends Telegram alerts
2. Manual verification API (app/api/monitoring/verify-drift-state)
- POST: Force immediate verification check
- GET: Service status
3. Integrated into startup (lib/startup/init-position-manager.ts)
- Auto-starts on container boot
- First check after 2min, then every 10min
STATUS: Build failing due to TypeScript compilation timeout
Need to fix and deploy, then investigate WHY Position Manager stops.
This addresses symptom (stuck positions) but not root cause (PM stopping).
User Request: Replace blind 2-hour restart timer with smart monitoring that only restarts when accountUnsubscribe errors actually occur
Changes:
. Health Monitor (NEW):
- Created lib/monitoring/drift-health-monitor.ts
- Tracks accountUnsubscribe errors in 30-second sliding window
- Triggers container restart via flag file when 50+ errors detected
- Prevents unnecessary restarts when SDK healthy
. Drift Client:
- Removed blind scheduleReconnection() and 2-hour timer
- Added interceptWebSocketErrors() to catch SDK errors
- Patches console.error to monitor for accountUnsubscribe patterns
- Starts health monitor after successful initialization
- Removed unused reconnect() method and reconnectTimer field
. Health API (NEW):
- GET /api/drift/health - Check current error count and health status
- Returns: healthy boolean, errorCount, threshold, message
- Useful for external monitoring and debugging
Impact:
- System only restarts when actual memory leak detected
- Prevents unnecessary downtime every 2 hours
- More targeted response to SDK issues
- Better operational stability
Files:
- lib/monitoring/drift-health-monitor.ts (NEW - 165 lines)
- lib/drift/client.ts (removed timer, added error interception)
- app/api/drift/health/route.ts (NEW - health check endpoint)
Testing:
- Health monitor starts on initialization: ✅
- API endpoint returns healthy status: ✅
- No blind reconnection scheduled: ✅