mindesbunister
83b4915d98
fix: Reduce coordinator chunk_size from 10k to 2k for small explorations
...
- Changed default chunk_size from 10,000 to 2,000
- Fixes bug where coordinator exited immediately for 4,096 combo exploration
- Coordinator was calculating: chunk 1 starts at 10,000 > 4,096 total = 'all done'
- Now creates 2-3 appropriately-sized chunks for distribution
- Verified: Workers now start and process assigned chunks
- Status: ✅ Docker rebuilt and deployed to port 3001
2025-11-30 22:07:59 +01:00
mindesbunister
b77282b560
feat: Add EPYC cluster distributed sweep with web UI
...
New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights
Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation
Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked
Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100 "
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide
Status: Deployed and operational
2025-11-30 13:02:18 +01:00
mindesbunister
2a8e04fe57
feat: Continuous optimization cluster for 2 EPYC servers
...
- Master controller with job queue and result aggregation
- Worker scripts for parallel backtesting (22 workers per server)
- SQLite database for strategy ranking and performance tracking
- File-based job queue (simple, robust, survives crashes)
- Auto-setup script for both EPYC servers
- Status dashboard for monitoring progress
- Comprehensive deployment guide
Architecture:
- Master: Job generation, worker coordination, result collection
- Worker 1 (pve-nu-monitor01): AMD EPYC 7282, 22 parallel jobs
- Worker 2 (srv-bd-host01): AMD EPYC 7302, 22 parallel jobs
- Total capacity: ~49,000 backtests/day (44 cores @ 70%)
Initial focus: v9 parameter refinement (27 configurations)
Target: Find strategies >00/1k P&L (current baseline 92/1k)
Files:
- cluster/master.py: Main controller (570 lines)
- cluster/worker.py: Worker execution script (220 lines)
- cluster/setup_cluster.sh: Automated deployment
- cluster/status.py: Real-time status dashboard
- cluster/README.md: Operational documentation
- cluster/DEPLOYMENT.md: Step-by-step deployment guide
2025-11-29 22:34:52 +01:00