diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 9002645..60c5ff5 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -4746,6 +4746,67 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK * `app/api/trading/check-risk/route.ts` - Integration point (calls addSignal()) - **Lesson:** When building validation systems, use existing infrastructure (1-min data cache) instead of creating new dependencies. Confirmation via price action is more reliable than pre-filtering with strict thresholds. Balance between catching winners (0.3% confirms) and avoiding losers (0.4% abandons) requires tuning based on 50-100 validation outcomes. +64. **EPYC Cluster SSH Timeout - Nested Hop Requires Longer Timeouts (CRITICAL - Fixed Dec 1, 2025):** + - **Symptom:** Coordinator reports "operation failed" with "SSH command timed out for v9_chunk_000002 on worker1" + - **Root Cause:** 30-second subprocess timeout insufficient for nested SSH hop (master → worker1 → worker2) + - **Real incident (Dec 1, 2025):** + * EPYC cluster with 2 servers (64 cores total) failed to start workers + * Worker1 (direct SSH): Reachable instantly + * Worker2 (via SSH hop through worker1): Required 40+ seconds to start processes + * Coordinator timeout set to 30s → worker2 startup always failed + * Database showed chunks as "running" but no actual processes existed + - **Impact:** Distributed parameter exploration completely non-functional, 64 cores sitting idle + - **Three-part fix (cluster/distributed_coordinator.py):** + ```python + # 1. Added SSH options for reliability (lines 302-318) + ssh_opts = "-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=5" + + # 2. Increased subprocess timeout (lines 396-418) + result = subprocess.run( + ssh_cmd, + shell=True, + capture_output=True, + text=True, + timeout=60 # Increased from 30s to 60s + ) + + # 3. Enhanced error messages + except subprocess.TimeoutExpired: + print(f"⚠️ SSH command timed out for {chunk_id} on {worker_name}") + print(f" This usually means SSH hop is misconfigured or slow") + ``` + - **SSH Options Added:** + * `-o StrictHostKeyChecking=no` → Prevents password prompts on first connection + * `-o ConnectTimeout=10` → Fails fast if host unreachable (don't wait indefinitely) + * `-o ServerAliveInterval=5` → Prevents silent connection drops during long operations + - **Verification (Dec 1, 2025):** + * Worker1: 23 Python processes at 99% CPU processing chunk 0 (0-2000 combos) + * Worker2: 24 Python processes at 99% CPU processing chunk 1 (2000-4000 combos) + * Both workers loading 34,273 rows of SOL/USDT 5-minute data + * Database correctly showing 2 chunks "running", 1 chunk "pending" + * Coordinator running in background, monitoring every 60 seconds + - **Files changed:** `cluster/distributed_coordinator.py` (13 insertions, 5 deletions) + - **Commit:** ef371a1 "fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options" + - **Why 60 seconds:** Nested SSH hop has compounding latency: + * Master → Worker1 connection: ~2-3 seconds + * Worker1 → Worker2 connection: ~2-3 seconds via hop + * Start Python process with multiprocessing: ~10-15 seconds + * Load 34K rows of data: ~5-10 seconds + * Initialize 22-24 worker processes: ~10-20 seconds + * Total: 30-50 seconds typical, 60s provides safety margin + - **Lessons Learned:** + 1. **Nested SSH hops need 2× minimum timeout** - Latency compounds at each hop + 2. **Always use StrictHostKeyChecking=no for automation** - Prevents interactive prompts + 3. **ServerAliveInterval prevents silent hangs** - Especially important for long-running processes + 4. **Database cleanup essential before retries** - Stale "running" chunks prevent new assignments + 5. **Verify actual process existence, not just database status** - Database can lie if previous run failed + 6. **Test SSH connectivity separately before distributed execution** - Catch auth issues early + - **Future Prevention:** + * Document minimum timeout formula: `base_timeout = (num_hops × 5s) + (process_startup × 2) + safety_margin(10s)` + * For 2 hops with heavy startup: `(2 × 5) + (20 × 2) + 10 = 60 seconds` + * Monitor coordinator logs for timeout patterns, increase if needed + * Consider SSH multiplexing (ControlMaster) to speed up nested hops + ## File Conventions - **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)