New Features: - Distributed coordinator orchestrates 2x AMD EPYC 16-core servers - 64 total cores processing 12M parameter combinations (70% CPU limit) - Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106 - Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100) - Web UI at /cluster shows real-time status and AI recommendations - API endpoint /api/cluster/status serves cluster metrics - Auto-refresh every 30s with top strategies and actionable insights Files Added: - cluster/distributed_coordinator.py (510 lines) - Main orchestrator - cluster/distributed_worker.py (271 lines) - Worker1 script - cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script - cluster/monitor_bd_host01.sh - Monitoring script - app/api/cluster/status/route.ts (274 lines) - API endpoint - app/cluster/page.tsx (258 lines) - Web UI - cluster/CLUSTER_SETUP.md - Complete setup and access documentation Technical Details: - SQLite database tracks chunk assignments - 10,000 combinations per chunk (1,195 total chunks) - Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC) - SSH/SCP for deployment and result collection - Handles 2-hop SSH for bd-host01 access - Results in CSV format with top strategies ranked Access Documentation: - Worker1: ssh root@10.10.254.106 - Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100" - Web UI: http://localhost:3001/cluster - See CLUSTER_SETUP.md for complete guide Status: Deployed and operational
37 lines
1.2 KiB
Bash
Executable File
37 lines
1.2 KiB
Bash
Executable File
#!/bin/bash
|
|
# Monitor bd-host01 worker progress
|
|
|
|
echo "=================================="
|
|
echo "BD-HOST01 WORKER MONITOR"
|
|
echo "=================================="
|
|
echo
|
|
|
|
echo "=== CPU Usage ==="
|
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'top -bn1 | grep \"Cpu(s)\"'"
|
|
echo
|
|
|
|
echo "=== Load Average ==="
|
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'uptime'"
|
|
echo
|
|
|
|
echo "=== Worker Processes ==="
|
|
WORKER_COUNT=$(ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep distributed_worker | grep -v grep | wc -l'")
|
|
echo "Active workers: $WORKER_COUNT"
|
|
echo
|
|
|
|
echo "=== Output Files ==="
|
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ls -lh /home/backtest_dual/backtest/chunk_*_results.csv 2>/dev/null || echo \"Still processing - no results file yet\"'"
|
|
echo
|
|
|
|
echo "=== Latest Log Lines ==="
|
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'tail -10 /tmp/v9_chunk_000000.log'"
|
|
echo
|
|
|
|
if [ "$WORKER_COUNT" -eq 0 ]; then
|
|
echo "⚠️ Worker finished or crashed!"
|
|
echo "Check full log: ssh root@10.10.254.106 \"ssh root@10.20.254.100 'cat /tmp/v9_chunk_000000.log'\""
|
|
else
|
|
echo "✅ Worker is running - processing 10,000 parameter combinations"
|
|
echo " This will take 10-30 minutes depending on complexity"
|
|
fi
|