docs: Document EPYC cluster SSH timeout fix in Common Pitfalls
- Added Common Pitfall #64: SSH timeout for nested hop scenarios
- Documented 30s→60s timeout increase rationale
- Explained SSH options: StrictHostKeyChecking, ConnectTimeout, ServerAliveInterval
- Included verification data: 23-24 processes per worker at 99% CPU
- Provided formula for calculating minimum timeouts for multi-hop SSH
- Cross-referenced commit ef371a1 (the actual code fix)
- Added future prevention guidance (timeout formulas, SSH multiplexing)
This documentation update accompanies the cluster fix deployed earlier.
This commit is contained in:
61
.github/copilot-instructions.md
vendored
61
.github/copilot-instructions.md
vendored
@@ -4746,6 +4746,67 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
|||||||
* `app/api/trading/check-risk/route.ts` - Integration point (calls addSignal())
|
* `app/api/trading/check-risk/route.ts` - Integration point (calls addSignal())
|
||||||
- **Lesson:** When building validation systems, use existing infrastructure (1-min data cache) instead of creating new dependencies. Confirmation via price action is more reliable than pre-filtering with strict thresholds. Balance between catching winners (0.3% confirms) and avoiding losers (0.4% abandons) requires tuning based on 50-100 validation outcomes.
|
- **Lesson:** When building validation systems, use existing infrastructure (1-min data cache) instead of creating new dependencies. Confirmation via price action is more reliable than pre-filtering with strict thresholds. Balance between catching winners (0.3% confirms) and avoiding losers (0.4% abandons) requires tuning based on 50-100 validation outcomes.
|
||||||
|
|
||||||
|
64. **EPYC Cluster SSH Timeout - Nested Hop Requires Longer Timeouts (CRITICAL - Fixed Dec 1, 2025):**
|
||||||
|
- **Symptom:** Coordinator reports "operation failed" with "SSH command timed out for v9_chunk_000002 on worker1"
|
||||||
|
- **Root Cause:** 30-second subprocess timeout insufficient for nested SSH hop (master → worker1 → worker2)
|
||||||
|
- **Real incident (Dec 1, 2025):**
|
||||||
|
* EPYC cluster with 2 servers (64 cores total) failed to start workers
|
||||||
|
* Worker1 (direct SSH): Reachable instantly
|
||||||
|
* Worker2 (via SSH hop through worker1): Required 40+ seconds to start processes
|
||||||
|
* Coordinator timeout set to 30s → worker2 startup always failed
|
||||||
|
* Database showed chunks as "running" but no actual processes existed
|
||||||
|
- **Impact:** Distributed parameter exploration completely non-functional, 64 cores sitting idle
|
||||||
|
- **Three-part fix (cluster/distributed_coordinator.py):**
|
||||||
|
```python
|
||||||
|
# 1. Added SSH options for reliability (lines 302-318)
|
||||||
|
ssh_opts = "-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=5"
|
||||||
|
|
||||||
|
# 2. Increased subprocess timeout (lines 396-418)
|
||||||
|
result = subprocess.run(
|
||||||
|
ssh_cmd,
|
||||||
|
shell=True,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=60 # Increased from 30s to 60s
|
||||||
|
)
|
||||||
|
|
||||||
|
# 3. Enhanced error messages
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
print(f"⚠️ SSH command timed out for {chunk_id} on {worker_name}")
|
||||||
|
print(f" This usually means SSH hop is misconfigured or slow")
|
||||||
|
```
|
||||||
|
- **SSH Options Added:**
|
||||||
|
* `-o StrictHostKeyChecking=no` → Prevents password prompts on first connection
|
||||||
|
* `-o ConnectTimeout=10` → Fails fast if host unreachable (don't wait indefinitely)
|
||||||
|
* `-o ServerAliveInterval=5` → Prevents silent connection drops during long operations
|
||||||
|
- **Verification (Dec 1, 2025):**
|
||||||
|
* Worker1: 23 Python processes at 99% CPU processing chunk 0 (0-2000 combos)
|
||||||
|
* Worker2: 24 Python processes at 99% CPU processing chunk 1 (2000-4000 combos)
|
||||||
|
* Both workers loading 34,273 rows of SOL/USDT 5-minute data
|
||||||
|
* Database correctly showing 2 chunks "running", 1 chunk "pending"
|
||||||
|
* Coordinator running in background, monitoring every 60 seconds
|
||||||
|
- **Files changed:** `cluster/distributed_coordinator.py` (13 insertions, 5 deletions)
|
||||||
|
- **Commit:** ef371a1 "fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options"
|
||||||
|
- **Why 60 seconds:** Nested SSH hop has compounding latency:
|
||||||
|
* Master → Worker1 connection: ~2-3 seconds
|
||||||
|
* Worker1 → Worker2 connection: ~2-3 seconds via hop
|
||||||
|
* Start Python process with multiprocessing: ~10-15 seconds
|
||||||
|
* Load 34K rows of data: ~5-10 seconds
|
||||||
|
* Initialize 22-24 worker processes: ~10-20 seconds
|
||||||
|
* Total: 30-50 seconds typical, 60s provides safety margin
|
||||||
|
- **Lessons Learned:**
|
||||||
|
1. **Nested SSH hops need 2× minimum timeout** - Latency compounds at each hop
|
||||||
|
2. **Always use StrictHostKeyChecking=no for automation** - Prevents interactive prompts
|
||||||
|
3. **ServerAliveInterval prevents silent hangs** - Especially important for long-running processes
|
||||||
|
4. **Database cleanup essential before retries** - Stale "running" chunks prevent new assignments
|
||||||
|
5. **Verify actual process existence, not just database status** - Database can lie if previous run failed
|
||||||
|
6. **Test SSH connectivity separately before distributed execution** - Catch auth issues early
|
||||||
|
- **Future Prevention:**
|
||||||
|
* Document minimum timeout formula: `base_timeout = (num_hops × 5s) + (process_startup × 2) + safety_margin(10s)`
|
||||||
|
* For 2 hops with heavy startup: `(2 × 5) + (20 × 2) + 10 = 60 seconds`
|
||||||
|
* Monitor coordinator logs for timeout patterns, increase if needed
|
||||||
|
* Consider SSH multiplexing (ControlMaster) to speed up nested hops
|
||||||
|
|
||||||
## File Conventions
|
## File Conventions
|
||||||
|
|
||||||
- **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)
|
- **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)
|
||||||
|
|||||||
Reference in New Issue
Block a user