docs: Document EPYC cluster SSH timeout fix in Common Pitfalls

- Added Common Pitfall #64: SSH timeout for nested hop scenarios
- Documented 30s→60s timeout increase rationale
- Explained SSH options: StrictHostKeyChecking, ConnectTimeout, ServerAliveInterval
- Included verification data: 23-24 processes per worker at 99% CPU
- Provided formula for calculating minimum timeouts for multi-hop SSH
- Cross-referenced commit ef371a1 (the actual code fix)
- Added future prevention guidance (timeout formulas, SSH multiplexing)

This documentation update accompanies the cluster fix deployed earlier.
This commit is contained in:
mindesbunister
2025-12-01 09:46:17 +01:00
parent ef371a19b9
commit c343daeb44

View File

@@ -4746,6 +4746,67 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
* `app/api/trading/check-risk/route.ts` - Integration point (calls addSignal()) * `app/api/trading/check-risk/route.ts` - Integration point (calls addSignal())
- **Lesson:** When building validation systems, use existing infrastructure (1-min data cache) instead of creating new dependencies. Confirmation via price action is more reliable than pre-filtering with strict thresholds. Balance between catching winners (0.3% confirms) and avoiding losers (0.4% abandons) requires tuning based on 50-100 validation outcomes. - **Lesson:** When building validation systems, use existing infrastructure (1-min data cache) instead of creating new dependencies. Confirmation via price action is more reliable than pre-filtering with strict thresholds. Balance between catching winners (0.3% confirms) and avoiding losers (0.4% abandons) requires tuning based on 50-100 validation outcomes.
64. **EPYC Cluster SSH Timeout - Nested Hop Requires Longer Timeouts (CRITICAL - Fixed Dec 1, 2025):**
- **Symptom:** Coordinator reports "operation failed" with "SSH command timed out for v9_chunk_000002 on worker1"
- **Root Cause:** 30-second subprocess timeout insufficient for nested SSH hop (master → worker1 → worker2)
- **Real incident (Dec 1, 2025):**
* EPYC cluster with 2 servers (64 cores total) failed to start workers
* Worker1 (direct SSH): Reachable instantly
* Worker2 (via SSH hop through worker1): Required 40+ seconds to start processes
* Coordinator timeout set to 30s → worker2 startup always failed
* Database showed chunks as "running" but no actual processes existed
- **Impact:** Distributed parameter exploration completely non-functional, 64 cores sitting idle
- **Three-part fix (cluster/distributed_coordinator.py):**
```python
# 1. Added SSH options for reliability (lines 302-318)
ssh_opts = "-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=5"
# 2. Increased subprocess timeout (lines 396-418)
result = subprocess.run(
ssh_cmd,
shell=True,
capture_output=True,
text=True,
timeout=60 # Increased from 30s to 60s
)
# 3. Enhanced error messages
except subprocess.TimeoutExpired:
print(f"⚠️ SSH command timed out for {chunk_id} on {worker_name}")
print(f" This usually means SSH hop is misconfigured or slow")
```
- **SSH Options Added:**
* `-o StrictHostKeyChecking=no` → Prevents password prompts on first connection
* `-o ConnectTimeout=10` → Fails fast if host unreachable (don't wait indefinitely)
* `-o ServerAliveInterval=5` → Prevents silent connection drops during long operations
- **Verification (Dec 1, 2025):**
* Worker1: 23 Python processes at 99% CPU processing chunk 0 (0-2000 combos)
* Worker2: 24 Python processes at 99% CPU processing chunk 1 (2000-4000 combos)
* Both workers loading 34,273 rows of SOL/USDT 5-minute data
* Database correctly showing 2 chunks "running", 1 chunk "pending"
* Coordinator running in background, monitoring every 60 seconds
- **Files changed:** `cluster/distributed_coordinator.py` (13 insertions, 5 deletions)
- **Commit:** ef371a1 "fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options"
- **Why 60 seconds:** Nested SSH hop has compounding latency:
* Master → Worker1 connection: ~2-3 seconds
* Worker1 → Worker2 connection: ~2-3 seconds via hop
* Start Python process with multiprocessing: ~10-15 seconds
* Load 34K rows of data: ~5-10 seconds
* Initialize 22-24 worker processes: ~10-20 seconds
* Total: 30-50 seconds typical, 60s provides safety margin
- **Lessons Learned:**
1. **Nested SSH hops need 2× minimum timeout** - Latency compounds at each hop
2. **Always use StrictHostKeyChecking=no for automation** - Prevents interactive prompts
3. **ServerAliveInterval prevents silent hangs** - Especially important for long-running processes
4. **Database cleanup essential before retries** - Stale "running" chunks prevent new assignments
5. **Verify actual process existence, not just database status** - Database can lie if previous run failed
6. **Test SSH connectivity separately before distributed execution** - Catch auth issues early
- **Future Prevention:**
* Document minimum timeout formula: `base_timeout = (num_hops × 5s) + (process_startup × 2) + safety_margin(10s)`
* For 2 hops with heavy startup: `(2 × 5) + (20 × 2) + 10 = 60 seconds`
* Monitor coordinator logs for timeout patterns, increase if needed
* Consider SSH multiplexing (ControlMaster) to speed up nested hops
## File Conventions ## File Conventions
- **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router) - **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)