mindesbunister
323ef03f5f
critical: Fix SSH timeout + resumption logic bugs
...
**SSH Command Fix:**
- CRITICAL: Removed && after background command (&)
- Pattern: 'cmd & echo Started' works, 'cmd && echo' waits forever
- Manually tested: Works perfectly on direct SSH
- Result: Chunk 0 now starts successfully on worker1 (24 processes running)
**Resumption Logic Fix:**
- CRITICAL: Only count completed/running chunks, not pending
- Query: Added 'AND status IN (completed, running)' filter
- Result: Starts from chunk 0 when no chunks complete (was skipping to chunk 3)
**Database Cleanup:**
- CRITICAL: Delete pending/failed chunks on coordinator start
- Prevents UNIQUE constraint errors on retry
- Result: Clean slate allows coordinator to assign chunks fresh
**Verification:**
- ✅ Chunk v9_chunk_000000: status='running', assigned_worker='worker1'
- ✅ Worker1: 24 Python processes running backtester
- ✅ Database: Cleaned 3 pending chunks, created 1 running chunk
- ⚠️ Worker2: SSH hop still timing out (separate infrastructure issue)
Files changed:
- cluster/distributed_coordinator.py (3 critical fixes: line 388-401, 514-533, 507-514)
2025-12-01 12:56:35 +01:00
mindesbunister
ef371a19b9
fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options
...
CRITICAL FIX (Dec 1, 2025): Cluster start was failing with 'operation failed'
Problem:
- SSH commands timing out after 30s (too short for 2-hop SSH to worker2)
- Missing SSH options caused prompts/delays
- Result: Coordinator failed to start worker processes
Solution:
- Increased timeout from 30s to 60s for nested SSH hops
- Added SSH options: -o StrictHostKeyChecking=no -o ConnectTimeout=10
- Applied options to both ssh_command() and worker startup commands
Verification (Dec 1, 09:40):
- Worker1: 23 processes running (chunk 0-2000)
- Worker2: 24 processes running (chunk 2000-4000)
- Cluster status: ACTIVE with 2 workers
- Both chunks processing successfully
Files changed:
- cluster/distributed_coordinator.py (lines 302-314, 388-414)
2025-12-01 09:41:42 +01:00
mindesbunister
83b4915d98
fix: Reduce coordinator chunk_size from 10k to 2k for small explorations
...
- Changed default chunk_size from 10,000 to 2,000
- Fixes bug where coordinator exited immediately for 4,096 combo exploration
- Coordinator was calculating: chunk 1 starts at 10,000 > 4,096 total = 'all done'
- Now creates 2-3 appropriately-sized chunks for distribution
- Verified: Workers now start and process assigned chunks
- Status: ✅ Docker rebuilt and deployed to port 3001
2025-11-30 22:07:59 +01:00
mindesbunister
b77282b560
feat: Add EPYC cluster distributed sweep with web UI
...
New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights
Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation
Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked
Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100 "
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide
Status: Deployed and operational
2025-11-30 13:02:18 +01:00