diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index 9002645..60c5ff5 100644
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -4746,6 +4746,67 @@ trade.realizedPnL += actualRealizedPnL  // NOT: result.realizedPnL from SDK
       * `app/api/trading/check-risk/route.ts` - Integration point (calls addSignal())
     - **Lesson:** When building validation systems, use existing infrastructure (1-min data cache) instead of creating new dependencies. Confirmation via price action is more reliable than pre-filtering with strict thresholds. Balance between catching winners (0.3% confirms) and avoiding losers (0.4% abandons) requires tuning based on 50-100 validation outcomes.
 
+64. **EPYC Cluster SSH Timeout - Nested Hop Requires Longer Timeouts (CRITICAL - Fixed Dec 1, 2025):**
+    - **Symptom:** Coordinator reports "operation failed" with "SSH command timed out for v9_chunk_000002 on worker1"
+    - **Root Cause:** 30-second subprocess timeout insufficient for nested SSH hop (master → worker1 → worker2)
+    - **Real incident (Dec 1, 2025):**
+      * EPYC cluster with 2 servers (64 cores total) failed to start workers
+      * Worker1 (direct SSH): Reachable instantly
+      * Worker2 (via SSH hop through worker1): Required 40+ seconds to start processes
+      * Coordinator timeout set to 30s → worker2 startup always failed
+      * Database showed chunks as "running" but no actual processes existed
+    - **Impact:** Distributed parameter exploration completely non-functional, 64 cores sitting idle
+    - **Three-part fix (cluster/distributed_coordinator.py):**
+      ```python
+      # 1. Added SSH options for reliability (lines 302-318)
+      ssh_opts = "-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=5"
+      
+      # 2. Increased subprocess timeout (lines 396-418)
+      result = subprocess.run(
+          ssh_cmd,
+          shell=True,
+          capture_output=True,
+          text=True,
+          timeout=60  # Increased from 30s to 60s
+      )
+      
+      # 3. Enhanced error messages
+      except subprocess.TimeoutExpired:
+          print(f"⚠️ SSH command timed out for {chunk_id} on {worker_name}")
+          print(f"   This usually means SSH hop is misconfigured or slow")
+      ```
+    - **SSH Options Added:**
+      * `-o StrictHostKeyChecking=no` → Prevents password prompts on first connection
+      * `-o ConnectTimeout=10` → Fails fast if host unreachable (don't wait indefinitely)
+      * `-o ServerAliveInterval=5` → Prevents silent connection drops during long operations
+    - **Verification (Dec 1, 2025):**
+      * Worker1: 23 Python processes at 99% CPU processing chunk 0 (0-2000 combos)
+      * Worker2: 24 Python processes at 99% CPU processing chunk 1 (2000-4000 combos)
+      * Both workers loading 34,273 rows of SOL/USDT 5-minute data
+      * Database correctly showing 2 chunks "running", 1 chunk "pending"
+      * Coordinator running in background, monitoring every 60 seconds
+    - **Files changed:** `cluster/distributed_coordinator.py` (13 insertions, 5 deletions)
+    - **Commit:** ef371a1 "fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options"
+    - **Why 60 seconds:** Nested SSH hop has compounding latency:
+      * Master → Worker1 connection: ~2-3 seconds
+      * Worker1 → Worker2 connection: ~2-3 seconds via hop
+      * Start Python process with multiprocessing: ~10-15 seconds
+      * Load 34K rows of data: ~5-10 seconds
+      * Initialize 22-24 worker processes: ~10-20 seconds
+      * Total: 30-50 seconds typical, 60s provides safety margin
+    - **Lessons Learned:**
+      1. **Nested SSH hops need 2× minimum timeout** - Latency compounds at each hop
+      2. **Always use StrictHostKeyChecking=no for automation** - Prevents interactive prompts
+      3. **ServerAliveInterval prevents silent hangs** - Especially important for long-running processes
+      4. **Database cleanup essential before retries** - Stale "running" chunks prevent new assignments
+      5. **Verify actual process existence, not just database status** - Database can lie if previous run failed
+      6. **Test SSH connectivity separately before distributed execution** - Catch auth issues early
+    - **Future Prevention:**
+      * Document minimum timeout formula: `base_timeout = (num_hops × 5s) + (process_startup × 2) + safety_margin(10s)`
+      * For 2 hops with heavy startup: `(2 × 5) + (20 × 2) + 10 = 60 seconds`
+      * Monitor coordinator logs for timeout patterns, increase if needed
+      * Consider SSH multiplexing (ControlMaster) to speed up nested hops
+
 ## File Conventions
 
 - **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)