Files
trading_bot_v4/docs/DNS_RETRY_LOGIC.md
2025-11-13 16:06:26 +01:00

3.0 KiB

DNS Retry Logic for Drift Initialization

Problem Solved

Issue: Trading bot would fail with HTTP 500 "fetch failed" errors when DNS resolution temporarily failed for mainnet.helius-rpc.com, causing:

  • n8n workflow failures (missed trades)
  • Manual Telegram trades failing
  • Container restart failures

Root Cause: DNS lookup errors (EAI_AGAIN) are transient network issues that resolve within seconds, but the bot was treating them as permanent failures.

Solution

Added automatic retry logic to lib/drift/client.ts that:

  1. Detects transient errors:

    • fetch failed
    • EAI_AGAIN (DNS temporary failure)
    • ENOTFOUND (DNS resolution failed)
    • ETIMEDOUT (connection timeout)
    • ECONNREFUSED (connection refused)
  2. Retries automatically:

    • Max 3 attempts
    • 2 second delay between attempts
    • Logs each retry for monitoring
  3. Fails fast on non-transient errors:

    • Authentication errors
    • Invalid configuration
    • Permanent network issues

Example Logs

Success after retry:

🚀 Initializing Drift Protocol client...
⚠️ Drift initialization failed (attempt 1/3): fetch failed
⏳ Retrying in 2000ms...
✅ Drift client subscribed to account updates
✅ Drift service initialized successfully

Permanent failure (after 3 retries):

🚀 Initializing Drift Protocol client...
⚠️ Drift initialization failed (attempt 1/3): fetch failed
⏳ Retrying in 2000ms...
⚠️ Drift initialization failed (attempt 2/3): fetch failed
⏳ Retrying in 2000ms...
⚠️ Drift initialization failed (attempt 3/3): fetch failed
❌ Failed to initialize Drift service after retries: TypeError: fetch failed

Impact

Before:

  • DNS hiccup → 500 error → n8n workflow fails → missed trade
  • User must manually retry via Telegram

After:

  • DNS hiccup → automatic retry (2s delay) → success → trade executes
  • 99% of transient failures handled automatically

Testing

Deployed: Nov 13, 2025 at 16:02 CET Commit: 5e826de

Monitor with:

# Check for retry activity
docker logs trading-bot-v4 --since 1h | grep -E "Retrying|retry"

# Count DNS failures (should see retries working)
docker logs trading-bot-v4 --since 1h | grep "EAI_AGAIN"

Configuration

Retry parameters in retryOperation():

  • maxRetries: 3 attempts (configurable)
  • delayMs: 2000ms between retries (configurable)
  • Applied to: Drift SDK initialization, subscribe, user account fetch
  • Incident: Nov 13, 2025 at 15:55 - n8n workflow failed with "fetch failed"
  • Manual recovery: User opened trade via Telegram successfully
  • Fix: This retry logic prevents future occurrences

Future Improvements

  1. Multiple RPC endpoints: Add fallback to public Solana RPC if Helius fails
  2. Circuit breaker: Temporarily disable Helius if consistent failures detected
  3. Metrics: Track retry success rate, DNS failure frequency
  4. Alert on persistent failures: Notify user if all retries fail multiple times in 1 hour