Commit Graph

3 Commits

Author SHA1 Message Date
mindesbunister
ef371a19b9 fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options
CRITICAL FIX (Dec 1, 2025): Cluster start was failing with 'operation failed'

Problem:
- SSH commands timing out after 30s (too short for 2-hop SSH to worker2)
- Missing SSH options caused prompts/delays
- Result: Coordinator failed to start worker processes

Solution:
- Increased timeout from 30s to 60s for nested SSH hops
- Added SSH options: -o StrictHostKeyChecking=no -o ConnectTimeout=10
- Applied options to both ssh_command() and worker startup commands

Verification (Dec 1, 09:40):
- Worker1: 23 processes running (chunk 0-2000)
- Worker2: 24 processes running (chunk 2000-4000)
- Cluster status: ACTIVE with 2 workers
- Both chunks processing successfully

Files changed:
- cluster/distributed_coordinator.py (lines 302-314, 388-414)
2025-12-01 09:41:42 +01:00
mindesbunister
83b4915d98 fix: Reduce coordinator chunk_size from 10k to 2k for small explorations
- Changed default chunk_size from 10,000 to 2,000
- Fixes bug where coordinator exited immediately for 4,096 combo exploration
- Coordinator was calculating: chunk 1 starts at 10,000 > 4,096 total = 'all done'
- Now creates 2-3 appropriately-sized chunks for distribution
- Verified: Workers now start and process assigned chunks
- Status:  Docker rebuilt and deployed to port 3001
2025-11-30 22:07:59 +01:00
mindesbunister
b77282b560 feat: Add EPYC cluster distributed sweep with web UI
New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights

Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation

Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked

Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100"
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide

Status: Deployed and operational
2025-11-30 13:02:18 +01:00