feat: Add EPYC cluster distributed sweep with web UI

New Features: - Distributed coordinator orchestrates 2x AMD EPYC 16-core servers - 64 total cores processing 12M parameter combinations (70% CPU limit) - Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106 - Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100) - Web UI at /cluster shows real-time status and AI recommendations - API endpoint /api/cluster/status serves cluster metrics - Auto-refresh every 30s with top strategies and actionable insights Files Added: - cluster/distributed_coordinator.py (510 lines) - Main orchestrator - cluster/distributed_worker.py (271 lines) - Worker1 script - cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script - cluster/monitor_bd_host01.sh - Monitoring script - app/api/cluster/status/route.ts (274 lines) - API endpoint - app/cluster/page.tsx (258 lines) - Web UI - cluster/CLUSTER_SETUP.md - Complete setup and access documentation Technical Details: - SQLite database tracks chunk assignments - 10,000 combinations per chunk (1,195 total chunks) - Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC) - SSH/SCP for deployment and result collection - Handles 2-hop SSH for bd-host01 access - Results in CSV format with top strategies ranked Access Documentation: - Worker1: ssh root@10.10.254.106 - Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100" - Web UI: http://localhost:3001/cluster - See CLUSTER_SETUP.md for complete guide Status: Deployed and operational
2025-11-30 13:02:18 +01:00
parent 2a8e04fe57
commit b77282b560
9 changed files with 2190 additions and 0 deletions
--- a/cluster/CLUSTER_SETUP.md
+++ b/cluster/CLUSTER_SETUP.md
@@ -0,0 +1,339 @@
+# EPYC Cluster Setup and Access Guide
+
+## Overview
+Two AMD EPYC 16-core servers running distributed parameter exploration for trading bot optimization.
+
+**Total Capacity:** 64 cores processing 12M parameter combinations
+
+---
+
+## Server Access
+
+### Worker1: pve-nu-monitor01 (Direct SSH)
+```bash
+# Direct access from srvdocker02
+ssh root@10.10.254.106
+
+# Specs
+- Hostname: pve-nu-monitor01
+- IP: 10.10.254.106
+- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
+- Location: /home/comprehensive_sweep/backtester/
+```
+
+### Worker2: bd-host01 (SSH Hop Required)
+```bash
+# Access via 2-hop through worker1
+ssh root@10.10.254.106 "ssh root@10.20.254.100 'COMMAND'"
+
+# SCP via 2-hop
+scp FILE root@10.10.254.106:/tmp/
+ssh root@10.10.254.106 "scp /tmp/FILE root@10.20.254.100:/path/"
+
+# Specs
+- Hostname: bd-host01
+- IP: 10.20.254.100 (only accessible from worker1)
+- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
+- Location: /home/backtest_dual/backtest/
+```
+
+### Coordinator: srvdocker02 (Local)
+```bash
+# Running on trading bot server
+cd /home/icke/traderv4/cluster/
+
+# Specs
+- Hostname: srvdocker02
+- Role: Orchestrates distributed sweep, hosts trading bot
+- Database: SQLite at /home/icke/traderv4/cluster/exploration.db
+```
+
+---
+
+## Directory Structure
+
+### Worker1 Structure
+```
+/home/comprehensive_sweep/backtester/
+├── data/
+│   └── solusdt_5m_aug_nov.csv        # OHLCV data
+├── indicators/
+│   └── money_line.py                 # Money Line indicator
+├── scripts/
+│   └── distributed_worker.py         # Worker script
+├── simulator.py                       # Backtesting engine
+├── data_loader.py                     # Data loading utilities
+└── .venv/                            # Python environment
+```
+
+### Worker2 Structure
+```
+/home/backtest_dual/backtest/
+├── backtester/
+│   ├── data/
+│   │   └── solusdt_5m.csv            # OHLCV data (copied from worker1)
+│   ├── indicators/
+│   │   └── money_line.py
+│   ├── scripts/
+│   │   └── distributed_worker.py     # Modified for bd-host01
+│   ├── simulator.py
+│   └── data_loader.py
+└── .venv/                            # Python environment
+```
+
+### Coordinator Structure
+```
+/home/icke/traderv4/cluster/
+├── distributed_coordinator.py        # Main orchestrator
+├── distributed_worker.py             # Worker script (template for worker1)
+├── distributed_worker_bd_clean.py    # Worker script (template for worker2)
+├── monitor_bd_host01.sh             # Monitoring script
+├── exploration.db                    # Chunk tracking database
+└── chunk_*.json                      # Chunk specifications
+```
+
+---
+
+## How It Works
+
+### 1. Coordinator (srvdocker02)
+- Splits 12M parameter space into chunks (10,000 combos each)
+- Stores chunk assignments in SQLite database
+- Deploys chunk specs and worker scripts via SSH/SCP
+- Starts workers via SSH with nohup (background execution)
+- Monitors chunk completion and collects results
+
+### 2. Workers (EPYCs)
+- Each processes assigned chunks independently
+- Uses multiprocessing.Pool with **70% CPU limit** (22 cores)
+- Outputs results to CSV files in their workspace
+- Logs progress to /tmp/v9_chunk_XXXXXX.log
+
+### 3. Results Collection
+- Workers save to: `chunk_v9_chunk_XXXXXX_results.csv`
+- Coordinator can fetch results via SCP
+- Trading bot API endpoint serves results to web UI
+
+---
+
+## Common Operations
+
+### Start Distributed Sweep
+```bash
+cd /home/icke/traderv4/cluster/
+
+# Clear old chunks and start fresh
+rm -f exploration.db
+nohup python3 distributed_coordinator.py > sweep.log 2>&1 &
+
+# Monitor progress
+tail -f sweep.log
+```
+
+### Monitor Worker Status
+```bash
+# Check worker1
+ssh root@10.10.254.106 "top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l"
+
+# Check worker2 (via hop)
+ssh root@10.10.254.106 "ssh root@10.20.254.100 'top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l'"
+
+# Use monitoring script
+/home/icke/traderv4/cluster/monitor_bd_host01.sh
+```
+
+### Fetch Results
+```bash
+# Worker1 results
+scp root@10.10.254.106:/home/comprehensive_sweep/backtester/chunk_*_results.csv ./
+
+# Worker2 results (2-hop)
+ssh root@10.10.254.106 "scp root@10.20.254.100:/home/backtest_dual/backtest/chunk_*_results.csv /tmp/"
+scp root@10.10.254.106:/tmp/chunk_*_results.csv ./
+```
+
+### View Results in Web UI
+```bash
+# Access cluster status page
+http://localhost:3001/cluster
+# or
+https://tradervone.v4.dedyn.io/cluster
+
+# Shows:
+- Real-time CPU usage and worker status
+- Exploration progress
+- Top 5 strategies with parameters
+- AI recommendations for next actions
+```
+
+### Kill All Workers
+```bash
+# Kill worker1
+ssh root@10.10.254.106 "pkill -f distributed_worker"
+
+# Kill worker2
+ssh root@10.10.254.106 "ssh root@10.20.254.100 'pkill -f distributed_worker'"
+
+# Kill coordinator
+pkill -f distributed_coordinator
+```
+
+---
+
+## CPU Limit Configuration
+
+### Why 70%?
+- Prevents server overload
+- Leaves headroom for system operations
+- Balances throughput vs stability
+
+### Implementation
+Both worker scripts limit CPU via multiprocessing.Pool:
+```python
+# In distributed_worker.py and distributed_worker_bd_clean.py
+max_workers = max(1, int(num_workers * 0.7))  # 70% of 32 cores = 22
+
+with mp.Pool(processes=max_workers) as pool:
+    # Processing happens here
+```
+
+**Expected CPU Usage:** 67-72% user time on each EPYC
+
+---
+
+## Troubleshooting
+
+### Worker Not Starting
+```bash
+# Check worker logs
+ssh root@10.10.254.106 "tail -100 /tmp/v9_chunk_*.log"
+ssh root@10.10.254.106 "ssh root@10.20.254.100 'tail -100 /tmp/v9_chunk_*.log'"
+
+# Common issues:
+# 1. Import errors - check sys.path and module structure
+# 2. Data file missing - verify solusdt_5m*.csv exists
+# 3. Virtual env activation failed - check .venv/bin/activate path
+```
+
+### SSH Hop Issues (Worker2)
+```bash
+# Test 2-hop connectivity
+ssh root@10.10.254.106 "ssh root@10.20.254.100 'echo SUCCESS'"
+
+# If fails, check:
+# - Worker1 can reach worker2: ssh root@10.10.254.106 "ping -c 3 10.20.254.100"
+# - SSH keys are set up between worker1 and worker2
+```
+
+### Python Bytecode Cache Issues
+```bash
+# Clear .pyc files if code changes don't take effect
+find /home/icke/traderv4/cluster -name "*.pyc" -delete
+find /home/icke/traderv4/cluster -name "__pycache__" -type d -exec rm -rf {} +
+```
+
+### Database Lock Issues
+```bash
+# If coordinator fails to start due to DB lock
+cd /home/icke/traderv4/cluster/
+pkill -f distributed_coordinator  # Kill any running coordinators
+rm -f exploration.db               # Delete database
+# Then restart coordinator
+```
+
+---
+
+## Parameter Space
+
+**Total Combinations:** 11,943,936
+
+**14 Parameters:**
+1. flip_threshold: 0.4, 0.5, 0.6, 0.7 (4 values)
+2. ma_gap: 0.20, 0.30, 0.40, 0.50 (4 values)
+3. adx_min: 18, 21, 24, 27 (4 values)
+4. long_pos_max: 60, 65, 70, 75 (4 values)
+5. short_pos_min: 20, 25, 30, 35 (4 values)
+6. cooldown: 1, 2, 3, 4 (4 values)
+7. position_size: 0.1-1.0 in 0.1 increments (10 values)
+8. tp1_mult: 1.5-3.0 in 0.5 increments (4 values)
+9. tp2_mult: 3.0-6.0 in 1.0 increments (4 values)
+10. sl_mult: 2.0-4.0 in 0.5 increments (5 values)
+11. tp1_close_pct: 0.5-0.8 in 0.1 increments (4 values)
+12. trailing_mult: 1.0-2.5 in 0.5 increments (4 values)
+13. vol_min: 0.8-1.4 in 0.2 increments (4 values)
+14. max_bars: 10, 15, 20, 25 (4 values)
+
+**Chunk Size:** 10,000 combinations
+**Total Chunks:** 1,195
+
+---
+
+## Web UI Integration
+
+### API Endpoint
+```typescript
+// GET /api/cluster/status
+// Returns:
+{
+  cluster: {
+    totalCores: 64,
+    activeCores: 45,
+    cpuUsage: 70.5,
+    activeWorkers: 2,
+    status: "active"
+  },
+  workers: [...],
+  exploration: {
+    totalCombinations: 11943936,
+    chunksCompleted: 15,
+    progress: 0.0126
+  },
+  topStrategies: [...],
+  recommendation: "AI-generated action items"
+}
+```
+
+### Frontend Page
+- Location: `/home/icke/traderv4/app/cluster/page.tsx`
+- Auto-refreshes every 30 seconds
+- Shows real-time cluster status
+- Displays top strategies with parameters
+- Provides AI recommendations
+
+---
+
+## Files Created/Modified
+
+**New Files:**
+- `cluster/distributed_coordinator.py` - Main orchestrator (510 lines)
+- `cluster/distributed_worker.py` - Worker script for worker1 (271 lines)
+- `cluster/distributed_worker_bd_clean.py` - Worker script for worker2 (275 lines)
+- `cluster/monitor_bd_host01.sh` - Monitoring script
+- `app/api/cluster/status/route.ts` - API endpoint for web UI (274 lines)
+- `app/cluster/page.tsx` - Web UI page (258 lines)
+- `cluster/CLUSTER_SETUP.md` - This documentation
+
+**Modified Files:**
+- Docker rebuilt with new API endpoint and cluster page
+
+---
+
+## Next Steps
+
+1. **Monitor first chunk completion** (~10-30 min)
+2. **Analyze top strategies** via web UI at `/cluster`
+3. **Scale to full sweep** - all 1,195 chunks across both EPYCs
+4. **Implement best parameters** in production trading bot
+5. **Iterate** - refine grid based on results
+
+---
+
+## Notes
+
+- **70% CPU limit ensures system stability** while maximizing throughput
+- **Coordinator is stateless** - stores all state in SQLite, can restart anytime
+- **Workers are autonomous** - process chunks independently, no coordination needed
+- **Results are immutable** - each chunk produces one CSV, never overwritten
+- **Web UI provides actionable insights** - no manual CSV analysis needed
+
+**Last Updated:** November 30, 2025