Files
trading_bot_v4/cluster/monitor_bd_host01.sh
mindesbunister b77282b560 feat: Add EPYC cluster distributed sweep with web UI
New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights

Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation

Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked

Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100"
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide

Status: Deployed and operational
2025-11-30 13:02:18 +01:00

37 lines
1.2 KiB
Bash
Executable File

#!/bin/bash
# Monitor bd-host01 worker progress
echo "=================================="
echo "BD-HOST01 WORKER MONITOR"
echo "=================================="
echo
echo "=== CPU Usage ==="
ssh root@10.10.254.106 "ssh root@10.20.254.100 'top -bn1 | grep \"Cpu(s)\"'"
echo
echo "=== Load Average ==="
ssh root@10.10.254.106 "ssh root@10.20.254.100 'uptime'"
echo
echo "=== Worker Processes ==="
WORKER_COUNT=$(ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep distributed_worker | grep -v grep | wc -l'")
echo "Active workers: $WORKER_COUNT"
echo
echo "=== Output Files ==="
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ls -lh /home/backtest_dual/backtest/chunk_*_results.csv 2>/dev/null || echo \"Still processing - no results file yet\"'"
echo
echo "=== Latest Log Lines ==="
ssh root@10.10.254.106 "ssh root@10.20.254.100 'tail -10 /tmp/v9_chunk_000000.log'"
echo
if [ "$WORKER_COUNT" -eq 0 ]; then
echo "⚠️ Worker finished or crashed!"
echo "Check full log: ssh root@10.10.254.106 \"ssh root@10.20.254.100 'cat /tmp/v9_chunk_000000.log'\""
else
echo "✅ Worker is running - processing 10,000 parameter combinations"
echo " This will take 10-30 minutes depending on complexity"
fi