docs: Document EPYC cluster SSH timeout fix in Common Pitfalls

- Added Common Pitfall #64: SSH timeout for nested hop scenarios - Documented 30s→60s timeout increase rationale - Explained SSH options: StrictHostKeyChecking, ConnectTimeout, ServerAliveInterval - Included verification data: 23-24 processes per worker at 99% CPU - Provided formula for calculating minimum timeouts for multi-hop SSH - Cross-referenced commit ef371a1 (the actual code fix) - Added future prevention guidance (timeout formulas, SSH multiplexing) This documentation update accompanies the cluster fix deployed earlier.
2025-12-01 09:46:17 +01:00
parent ef371a19b9
commit c343daeb44
1 changed files with 61 additions and 0 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -4746,6 +4746,67 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
      * `app/api/trading/check-risk/route.ts` - Integration point (calls addSignal())
    - **Lesson:** When building validation systems, use existing infrastructure (1-min data cache) instead of creating new dependencies. Confirmation via price action is more reliable than pre-filtering with strict thresholds. Balance between catching winners (0.3% confirms) and avoiding losers (0.4% abandons) requires tuning based on 50-100 validation outcomes.
 64. **EPYC Cluster SSH Timeout - Nested Hop Requires Longer Timeouts (CRITICAL - Fixed Dec 1, 2025):**
    - **Symptom:** Coordinator reports "operation failed" with "SSH command timed out for v9_chunk_000002 on worker1"
    - **Root Cause:** 30-second subprocess timeout insufficient for nested SSH hop (master → worker1 → worker2)
    - **Real incident (Dec 1, 2025):**
      * EPYC cluster with 2 servers (64 cores total) failed to start workers
      * Worker1 (direct SSH): Reachable instantly
      * Worker2 (via SSH hop through worker1): Required 40+ seconds to start processes
      * Coordinator timeout set to 30s → worker2 startup always failed
      * Database showed chunks as "running" but no actual processes existed
    - **Impact:** Distributed parameter exploration completely non-functional, 64 cores sitting idle
    - **Three-part fix (cluster/distributed_coordinator.py):**
      ```python
      # 1. Added SSH options for reliability (lines 302-318)
      ssh_opts = "-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=5"
      # 2. Increased subprocess timeout (lines 396-418)
      result = subprocess.run(
          ssh_cmd,
          shell=True,
          capture_output=True,
          text=True,
          timeout=60  # Increased from 30s to 60s
      )
      # 3. Enhanced error messages
      except subprocess.TimeoutExpired:
          print(f"⚠️ SSH command timed out for {chunk_id} on {worker_name}")
          print(f"   This usually means SSH hop is misconfigured or slow")
      ```
    - **SSH Options Added:**
      * `-o StrictHostKeyChecking=no` → Prevents password prompts on first connection
      * `-o ConnectTimeout=10` → Fails fast if host unreachable (don't wait indefinitely)
      * `-o ServerAliveInterval=5` → Prevents silent connection drops during long operations
    - **Verification (Dec 1, 2025):**
      * Worker1: 23 Python processes at 99% CPU processing chunk 0 (0-2000 combos)
      * Worker2: 24 Python processes at 99% CPU processing chunk 1 (2000-4000 combos)
      * Both workers loading 34,273 rows of SOL/USDT 5-minute data
      * Database correctly showing 2 chunks "running", 1 chunk "pending"
      * Coordinator running in background, monitoring every 60 seconds
    - **Files changed:** `cluster/distributed_coordinator.py` (13 insertions, 5 deletions)
    - **Commit:** ef371a1 "fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options"
    - **Why 60 seconds:** Nested SSH hop has compounding latency:
      * Master → Worker1 connection: ~2-3 seconds
      * Worker1 → Worker2 connection: ~2-3 seconds via hop
      * Start Python process with multiprocessing: ~10-15 seconds
      * Load 34K rows of data: ~5-10 seconds
      * Initialize 22-24 worker processes: ~10-20 seconds
      * Total: 30-50 seconds typical, 60s provides safety margin
    - **Lessons Learned:**
      1. **Nested SSH hops need 2× minimum timeout** - Latency compounds at each hop
      2. **Always use StrictHostKeyChecking=no for automation** - Prevents interactive prompts
      3. **ServerAliveInterval prevents silent hangs** - Especially important for long-running processes
      4. **Database cleanup essential before retries** - Stale "running" chunks prevent new assignments
      5. **Verify actual process existence, not just database status** - Database can lie if previous run failed
      6. **Test SSH connectivity separately before distributed execution** - Catch auth issues early
    - **Future Prevention:**
      * Document minimum timeout formula: `base_timeout = (num_hops × 5s) + (process_startup × 2) + safety_margin(10s)`
      * For 2 hops with heavy startup: `(2 × 5) + (20 × 2) + 10 = 60 seconds`
      * Monitor coordinator logs for timeout patterns, increase if needed
      * Consider SSH multiplexing (ControlMaster) to speed up nested hops
 ## File Conventions
 - **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)