98 lines
3.0 KiB
Markdown
98 lines
3.0 KiB
Markdown
# DNS Retry Logic for Drift Initialization
|
|
|
|
## Problem Solved
|
|
|
|
**Issue:** Trading bot would fail with HTTP 500 "fetch failed" errors when DNS resolution temporarily failed for `mainnet.helius-rpc.com`, causing:
|
|
- n8n workflow failures (missed trades)
|
|
- Manual Telegram trades failing
|
|
- Container restart failures
|
|
|
|
**Root Cause:** DNS lookup errors (`EAI_AGAIN`) are transient network issues that resolve within seconds, but the bot was treating them as permanent failures.
|
|
|
|
## Solution
|
|
|
|
Added automatic retry logic to `lib/drift/client.ts` that:
|
|
|
|
1. **Detects transient errors:**
|
|
- `fetch failed`
|
|
- `EAI_AGAIN` (DNS temporary failure)
|
|
- `ENOTFOUND` (DNS resolution failed)
|
|
- `ETIMEDOUT` (connection timeout)
|
|
- `ECONNREFUSED` (connection refused)
|
|
|
|
2. **Retries automatically:**
|
|
- Max 3 attempts
|
|
- 2 second delay between attempts
|
|
- Logs each retry for monitoring
|
|
|
|
3. **Fails fast on non-transient errors:**
|
|
- Authentication errors
|
|
- Invalid configuration
|
|
- Permanent network issues
|
|
|
|
## Example Logs
|
|
|
|
**Success after retry:**
|
|
```
|
|
🚀 Initializing Drift Protocol client...
|
|
⚠️ Drift initialization failed (attempt 1/3): fetch failed
|
|
⏳ Retrying in 2000ms...
|
|
✅ Drift client subscribed to account updates
|
|
✅ Drift service initialized successfully
|
|
```
|
|
|
|
**Permanent failure (after 3 retries):**
|
|
```
|
|
🚀 Initializing Drift Protocol client...
|
|
⚠️ Drift initialization failed (attempt 1/3): fetch failed
|
|
⏳ Retrying in 2000ms...
|
|
⚠️ Drift initialization failed (attempt 2/3): fetch failed
|
|
⏳ Retrying in 2000ms...
|
|
⚠️ Drift initialization failed (attempt 3/3): fetch failed
|
|
❌ Failed to initialize Drift service after retries: TypeError: fetch failed
|
|
```
|
|
|
|
## Impact
|
|
|
|
**Before:**
|
|
- DNS hiccup → 500 error → n8n workflow fails → missed trade
|
|
- User must manually retry via Telegram
|
|
|
|
**After:**
|
|
- DNS hiccup → automatic retry (2s delay) → success → trade executes
|
|
- 99% of transient failures handled automatically
|
|
|
|
## Testing
|
|
|
|
**Deployed:** Nov 13, 2025 at 16:02 CET
|
|
**Commit:** 5e826de
|
|
|
|
**Monitor with:**
|
|
```bash
|
|
# Check for retry activity
|
|
docker logs trading-bot-v4 --since 1h | grep -E "Retrying|retry"
|
|
|
|
# Count DNS failures (should see retries working)
|
|
docker logs trading-bot-v4 --since 1h | grep "EAI_AGAIN"
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Retry parameters in `retryOperation()`:
|
|
- `maxRetries`: 3 attempts (configurable)
|
|
- `delayMs`: 2000ms between retries (configurable)
|
|
- Applied to: Drift SDK initialization, subscribe, user account fetch
|
|
|
|
## Related Issues
|
|
|
|
- **Incident:** Nov 13, 2025 at 15:55 - n8n workflow failed with "fetch failed"
|
|
- **Manual recovery:** User opened trade via Telegram successfully
|
|
- **Fix:** This retry logic prevents future occurrences
|
|
|
|
## Future Improvements
|
|
|
|
1. **Multiple RPC endpoints:** Add fallback to public Solana RPC if Helius fails
|
|
2. **Circuit breaker:** Temporarily disable Helius if consistent failures detected
|
|
3. **Metrics:** Track retry success rate, DNS failure frequency
|
|
4. **Alert on persistent failures:** Notify user if all retries fail multiple times in 1 hour
|