feat: Add EPYC cluster distributed sweep with web UI
New Features: - Distributed coordinator orchestrates 2x AMD EPYC 16-core servers - 64 total cores processing 12M parameter combinations (70% CPU limit) - Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106 - Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100) - Web UI at /cluster shows real-time status and AI recommendations - API endpoint /api/cluster/status serves cluster metrics - Auto-refresh every 30s with top strategies and actionable insights Files Added: - cluster/distributed_coordinator.py (510 lines) - Main orchestrator - cluster/distributed_worker.py (271 lines) - Worker1 script - cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script - cluster/monitor_bd_host01.sh - Monitoring script - app/api/cluster/status/route.ts (274 lines) - API endpoint - app/cluster/page.tsx (258 lines) - Web UI - cluster/CLUSTER_SETUP.md - Complete setup and access documentation Technical Details: - SQLite database tracks chunk assignments - 10,000 combinations per chunk (1,195 total chunks) - Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC) - SSH/SCP for deployment and result collection - Handles 2-hop SSH for bd-host01 access - Results in CSV format with top strategies ranked Access Documentation: - Worker1: ssh root@10.10.254.106 - Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100" - Web UI: http://localhost:3001/cluster - See CLUSTER_SETUP.md for complete guide Status: Deployed and operational
This commit is contained in:
339
cluster/CLUSTER_SETUP.md
Normal file
339
cluster/CLUSTER_SETUP.md
Normal file
@@ -0,0 +1,339 @@
|
||||
# EPYC Cluster Setup and Access Guide
|
||||
|
||||
## Overview
|
||||
Two AMD EPYC 16-core servers running distributed parameter exploration for trading bot optimization.
|
||||
|
||||
**Total Capacity:** 64 cores processing 12M parameter combinations
|
||||
|
||||
---
|
||||
|
||||
## Server Access
|
||||
|
||||
### Worker1: pve-nu-monitor01 (Direct SSH)
|
||||
```bash
|
||||
# Direct access from srvdocker02
|
||||
ssh root@10.10.254.106
|
||||
|
||||
# Specs
|
||||
- Hostname: pve-nu-monitor01
|
||||
- IP: 10.10.254.106
|
||||
- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
|
||||
- Location: /home/comprehensive_sweep/backtester/
|
||||
```
|
||||
|
||||
### Worker2: bd-host01 (SSH Hop Required)
|
||||
```bash
|
||||
# Access via 2-hop through worker1
|
||||
ssh root@10.10.254.106 "ssh root@10.20.254.100 'COMMAND'"
|
||||
|
||||
# SCP via 2-hop
|
||||
scp FILE root@10.10.254.106:/tmp/
|
||||
ssh root@10.10.254.106 "scp /tmp/FILE root@10.20.254.100:/path/"
|
||||
|
||||
# Specs
|
||||
- Hostname: bd-host01
|
||||
- IP: 10.20.254.100 (only accessible from worker1)
|
||||
- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
|
||||
- Location: /home/backtest_dual/backtest/
|
||||
```
|
||||
|
||||
### Coordinator: srvdocker02 (Local)
|
||||
```bash
|
||||
# Running on trading bot server
|
||||
cd /home/icke/traderv4/cluster/
|
||||
|
||||
# Specs
|
||||
- Hostname: srvdocker02
|
||||
- Role: Orchestrates distributed sweep, hosts trading bot
|
||||
- Database: SQLite at /home/icke/traderv4/cluster/exploration.db
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Directory Structure
|
||||
|
||||
### Worker1 Structure
|
||||
```
|
||||
/home/comprehensive_sweep/backtester/
|
||||
├── data/
|
||||
│ └── solusdt_5m_aug_nov.csv # OHLCV data
|
||||
├── indicators/
|
||||
│ └── money_line.py # Money Line indicator
|
||||
├── scripts/
|
||||
│ └── distributed_worker.py # Worker script
|
||||
├── simulator.py # Backtesting engine
|
||||
├── data_loader.py # Data loading utilities
|
||||
└── .venv/ # Python environment
|
||||
```
|
||||
|
||||
### Worker2 Structure
|
||||
```
|
||||
/home/backtest_dual/backtest/
|
||||
├── backtester/
|
||||
│ ├── data/
|
||||
│ │ └── solusdt_5m.csv # OHLCV data (copied from worker1)
|
||||
│ ├── indicators/
|
||||
│ │ └── money_line.py
|
||||
│ ├── scripts/
|
||||
│ │ └── distributed_worker.py # Modified for bd-host01
|
||||
│ ├── simulator.py
|
||||
│ └── data_loader.py
|
||||
└── .venv/ # Python environment
|
||||
```
|
||||
|
||||
### Coordinator Structure
|
||||
```
|
||||
/home/icke/traderv4/cluster/
|
||||
├── distributed_coordinator.py # Main orchestrator
|
||||
├── distributed_worker.py # Worker script (template for worker1)
|
||||
├── distributed_worker_bd_clean.py # Worker script (template for worker2)
|
||||
├── monitor_bd_host01.sh # Monitoring script
|
||||
├── exploration.db # Chunk tracking database
|
||||
└── chunk_*.json # Chunk specifications
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
### 1. Coordinator (srvdocker02)
|
||||
- Splits 12M parameter space into chunks (10,000 combos each)
|
||||
- Stores chunk assignments in SQLite database
|
||||
- Deploys chunk specs and worker scripts via SSH/SCP
|
||||
- Starts workers via SSH with nohup (background execution)
|
||||
- Monitors chunk completion and collects results
|
||||
|
||||
### 2. Workers (EPYCs)
|
||||
- Each processes assigned chunks independently
|
||||
- Uses multiprocessing.Pool with **70% CPU limit** (22 cores)
|
||||
- Outputs results to CSV files in their workspace
|
||||
- Logs progress to /tmp/v9_chunk_XXXXXX.log
|
||||
|
||||
### 3. Results Collection
|
||||
- Workers save to: `chunk_v9_chunk_XXXXXX_results.csv`
|
||||
- Coordinator can fetch results via SCP
|
||||
- Trading bot API endpoint serves results to web UI
|
||||
|
||||
---
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Start Distributed Sweep
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster/
|
||||
|
||||
# Clear old chunks and start fresh
|
||||
rm -f exploration.db
|
||||
nohup python3 distributed_coordinator.py > sweep.log 2>&1 &
|
||||
|
||||
# Monitor progress
|
||||
tail -f sweep.log
|
||||
```
|
||||
|
||||
### Monitor Worker Status
|
||||
```bash
|
||||
# Check worker1
|
||||
ssh root@10.10.254.106 "top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l"
|
||||
|
||||
# Check worker2 (via hop)
|
||||
ssh root@10.10.254.106 "ssh root@10.20.254.100 'top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l'"
|
||||
|
||||
# Use monitoring script
|
||||
/home/icke/traderv4/cluster/monitor_bd_host01.sh
|
||||
```
|
||||
|
||||
### Fetch Results
|
||||
```bash
|
||||
# Worker1 results
|
||||
scp root@10.10.254.106:/home/comprehensive_sweep/backtester/chunk_*_results.csv ./
|
||||
|
||||
# Worker2 results (2-hop)
|
||||
ssh root@10.10.254.106 "scp root@10.20.254.100:/home/backtest_dual/backtest/chunk_*_results.csv /tmp/"
|
||||
scp root@10.10.254.106:/tmp/chunk_*_results.csv ./
|
||||
```
|
||||
|
||||
### View Results in Web UI
|
||||
```bash
|
||||
# Access cluster status page
|
||||
http://localhost:3001/cluster
|
||||
# or
|
||||
https://tradervone.v4.dedyn.io/cluster
|
||||
|
||||
# Shows:
|
||||
- Real-time CPU usage and worker status
|
||||
- Exploration progress
|
||||
- Top 5 strategies with parameters
|
||||
- AI recommendations for next actions
|
||||
```
|
||||
|
||||
### Kill All Workers
|
||||
```bash
|
||||
# Kill worker1
|
||||
ssh root@10.10.254.106 "pkill -f distributed_worker"
|
||||
|
||||
# Kill worker2
|
||||
ssh root@10.10.254.106 "ssh root@10.20.254.100 'pkill -f distributed_worker'"
|
||||
|
||||
# Kill coordinator
|
||||
pkill -f distributed_coordinator
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CPU Limit Configuration
|
||||
|
||||
### Why 70%?
|
||||
- Prevents server overload
|
||||
- Leaves headroom for system operations
|
||||
- Balances throughput vs stability
|
||||
|
||||
### Implementation
|
||||
Both worker scripts limit CPU via multiprocessing.Pool:
|
||||
```python
|
||||
# In distributed_worker.py and distributed_worker_bd_clean.py
|
||||
max_workers = max(1, int(num_workers * 0.7)) # 70% of 32 cores = 22
|
||||
|
||||
with mp.Pool(processes=max_workers) as pool:
|
||||
# Processing happens here
|
||||
```
|
||||
|
||||
**Expected CPU Usage:** 67-72% user time on each EPYC
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Worker Not Starting
|
||||
```bash
|
||||
# Check worker logs
|
||||
ssh root@10.10.254.106 "tail -100 /tmp/v9_chunk_*.log"
|
||||
ssh root@10.10.254.106 "ssh root@10.20.254.100 'tail -100 /tmp/v9_chunk_*.log'"
|
||||
|
||||
# Common issues:
|
||||
# 1. Import errors - check sys.path and module structure
|
||||
# 2. Data file missing - verify solusdt_5m*.csv exists
|
||||
# 3. Virtual env activation failed - check .venv/bin/activate path
|
||||
```
|
||||
|
||||
### SSH Hop Issues (Worker2)
|
||||
```bash
|
||||
# Test 2-hop connectivity
|
||||
ssh root@10.10.254.106 "ssh root@10.20.254.100 'echo SUCCESS'"
|
||||
|
||||
# If fails, check:
|
||||
# - Worker1 can reach worker2: ssh root@10.10.254.106 "ping -c 3 10.20.254.100"
|
||||
# - SSH keys are set up between worker1 and worker2
|
||||
```
|
||||
|
||||
### Python Bytecode Cache Issues
|
||||
```bash
|
||||
# Clear .pyc files if code changes don't take effect
|
||||
find /home/icke/traderv4/cluster -name "*.pyc" -delete
|
||||
find /home/icke/traderv4/cluster -name "__pycache__" -type d -exec rm -rf {} +
|
||||
```
|
||||
|
||||
### Database Lock Issues
|
||||
```bash
|
||||
# If coordinator fails to start due to DB lock
|
||||
cd /home/icke/traderv4/cluster/
|
||||
pkill -f distributed_coordinator # Kill any running coordinators
|
||||
rm -f exploration.db # Delete database
|
||||
# Then restart coordinator
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Parameter Space
|
||||
|
||||
**Total Combinations:** 11,943,936
|
||||
|
||||
**14 Parameters:**
|
||||
1. flip_threshold: 0.4, 0.5, 0.6, 0.7 (4 values)
|
||||
2. ma_gap: 0.20, 0.30, 0.40, 0.50 (4 values)
|
||||
3. adx_min: 18, 21, 24, 27 (4 values)
|
||||
4. long_pos_max: 60, 65, 70, 75 (4 values)
|
||||
5. short_pos_min: 20, 25, 30, 35 (4 values)
|
||||
6. cooldown: 1, 2, 3, 4 (4 values)
|
||||
7. position_size: 0.1-1.0 in 0.1 increments (10 values)
|
||||
8. tp1_mult: 1.5-3.0 in 0.5 increments (4 values)
|
||||
9. tp2_mult: 3.0-6.0 in 1.0 increments (4 values)
|
||||
10. sl_mult: 2.0-4.0 in 0.5 increments (5 values)
|
||||
11. tp1_close_pct: 0.5-0.8 in 0.1 increments (4 values)
|
||||
12. trailing_mult: 1.0-2.5 in 0.5 increments (4 values)
|
||||
13. vol_min: 0.8-1.4 in 0.2 increments (4 values)
|
||||
14. max_bars: 10, 15, 20, 25 (4 values)
|
||||
|
||||
**Chunk Size:** 10,000 combinations
|
||||
**Total Chunks:** 1,195
|
||||
|
||||
---
|
||||
|
||||
## Web UI Integration
|
||||
|
||||
### API Endpoint
|
||||
```typescript
|
||||
// GET /api/cluster/status
|
||||
// Returns:
|
||||
{
|
||||
cluster: {
|
||||
totalCores: 64,
|
||||
activeCores: 45,
|
||||
cpuUsage: 70.5,
|
||||
activeWorkers: 2,
|
||||
status: "active"
|
||||
},
|
||||
workers: [...],
|
||||
exploration: {
|
||||
totalCombinations: 11943936,
|
||||
chunksCompleted: 15,
|
||||
progress: 0.0126
|
||||
},
|
||||
topStrategies: [...],
|
||||
recommendation: "AI-generated action items"
|
||||
}
|
||||
```
|
||||
|
||||
### Frontend Page
|
||||
- Location: `/home/icke/traderv4/app/cluster/page.tsx`
|
||||
- Auto-refreshes every 30 seconds
|
||||
- Shows real-time cluster status
|
||||
- Displays top strategies with parameters
|
||||
- Provides AI recommendations
|
||||
|
||||
---
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
**New Files:**
|
||||
- `cluster/distributed_coordinator.py` - Main orchestrator (510 lines)
|
||||
- `cluster/distributed_worker.py` - Worker script for worker1 (271 lines)
|
||||
- `cluster/distributed_worker_bd_clean.py` - Worker script for worker2 (275 lines)
|
||||
- `cluster/monitor_bd_host01.sh` - Monitoring script
|
||||
- `app/api/cluster/status/route.ts` - API endpoint for web UI (274 lines)
|
||||
- `app/cluster/page.tsx` - Web UI page (258 lines)
|
||||
- `cluster/CLUSTER_SETUP.md` - This documentation
|
||||
|
||||
**Modified Files:**
|
||||
- Docker rebuilt with new API endpoint and cluster page
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Monitor first chunk completion** (~10-30 min)
|
||||
2. **Analyze top strategies** via web UI at `/cluster`
|
||||
3. **Scale to full sweep** - all 1,195 chunks across both EPYCs
|
||||
4. **Implement best parameters** in production trading bot
|
||||
5. **Iterate** - refine grid based on results
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
- **70% CPU limit ensures system stability** while maximizing throughput
|
||||
- **Coordinator is stateless** - stores all state in SQLite, can restart anytime
|
||||
- **Workers are autonomous** - process chunks independently, no coordination needed
|
||||
- **Results are immutable** - each chunk produces one CSV, never overwritten
|
||||
- **Web UI provides actionable insights** - no manual CSV analysis needed
|
||||
|
||||
**Last Updated:** November 30, 2025
|
||||
Reference in New Issue
Block a user