Files
trading_bot_v4/cluster/CLUSTER_SETUP.md
mindesbunister b77282b560 feat: Add EPYC cluster distributed sweep with web UI
New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights

Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation

Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked

Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100"
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide

Status: Deployed and operational
2025-11-30 13:02:18 +01:00

9.2 KiB

EPYC Cluster Setup and Access Guide

Overview

Two AMD EPYC 16-core servers running distributed parameter exploration for trading bot optimization.

Total Capacity: 64 cores processing 12M parameter combinations


Server Access

Worker1: pve-nu-monitor01 (Direct SSH)

# Direct access from srvdocker02
ssh root@10.10.254.106

# Specs
- Hostname: pve-nu-monitor01
- IP: 10.10.254.106
- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
- Location: /home/comprehensive_sweep/backtester/

Worker2: bd-host01 (SSH Hop Required)

# Access via 2-hop through worker1
ssh root@10.10.254.106 "ssh root@10.20.254.100 'COMMAND'"

# SCP via 2-hop
scp FILE root@10.10.254.106:/tmp/
ssh root@10.10.254.106 "scp /tmp/FILE root@10.20.254.100:/path/"

# Specs
- Hostname: bd-host01
- IP: 10.20.254.100 (only accessible from worker1)
- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
- Location: /home/backtest_dual/backtest/

Coordinator: srvdocker02 (Local)

# Running on trading bot server
cd /home/icke/traderv4/cluster/

# Specs
- Hostname: srvdocker02
- Role: Orchestrates distributed sweep, hosts trading bot
- Database: SQLite at /home/icke/traderv4/cluster/exploration.db

Directory Structure

Worker1 Structure

/home/comprehensive_sweep/backtester/
├── data/
│   └── solusdt_5m_aug_nov.csv        # OHLCV data
├── indicators/
│   └── money_line.py                 # Money Line indicator
├── scripts/
│   └── distributed_worker.py         # Worker script
├── simulator.py                       # Backtesting engine
├── data_loader.py                     # Data loading utilities
└── .venv/                            # Python environment

Worker2 Structure

/home/backtest_dual/backtest/
├── backtester/
│   ├── data/
│   │   └── solusdt_5m.csv            # OHLCV data (copied from worker1)
│   ├── indicators/
│   │   └── money_line.py
│   ├── scripts/
│   │   └── distributed_worker.py     # Modified for bd-host01
│   ├── simulator.py
│   └── data_loader.py
└── .venv/                            # Python environment

Coordinator Structure

/home/icke/traderv4/cluster/
├── distributed_coordinator.py        # Main orchestrator
├── distributed_worker.py             # Worker script (template for worker1)
├── distributed_worker_bd_clean.py    # Worker script (template for worker2)
├── monitor_bd_host01.sh             # Monitoring script
├── exploration.db                    # Chunk tracking database
└── chunk_*.json                      # Chunk specifications

How It Works

1. Coordinator (srvdocker02)

  • Splits 12M parameter space into chunks (10,000 combos each)
  • Stores chunk assignments in SQLite database
  • Deploys chunk specs and worker scripts via SSH/SCP
  • Starts workers via SSH with nohup (background execution)
  • Monitors chunk completion and collects results

2. Workers (EPYCs)

  • Each processes assigned chunks independently
  • Uses multiprocessing.Pool with 70% CPU limit (22 cores)
  • Outputs results to CSV files in their workspace
  • Logs progress to /tmp/v9_chunk_XXXXXX.log

3. Results Collection

  • Workers save to: chunk_v9_chunk_XXXXXX_results.csv
  • Coordinator can fetch results via SCP
  • Trading bot API endpoint serves results to web UI

Common Operations

Start Distributed Sweep

cd /home/icke/traderv4/cluster/

# Clear old chunks and start fresh
rm -f exploration.db
nohup python3 distributed_coordinator.py > sweep.log 2>&1 &

# Monitor progress
tail -f sweep.log

Monitor Worker Status

# Check worker1
ssh root@10.10.254.106 "top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l"

# Check worker2 (via hop)
ssh root@10.10.254.106 "ssh root@10.20.254.100 'top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l'"

# Use monitoring script
/home/icke/traderv4/cluster/monitor_bd_host01.sh

Fetch Results

# Worker1 results
scp root@10.10.254.106:/home/comprehensive_sweep/backtester/chunk_*_results.csv ./

# Worker2 results (2-hop)
ssh root@10.10.254.106 "scp root@10.20.254.100:/home/backtest_dual/backtest/chunk_*_results.csv /tmp/"
scp root@10.10.254.106:/tmp/chunk_*_results.csv ./

View Results in Web UI

# Access cluster status page
http://localhost:3001/cluster
# or
https://tradervone.v4.dedyn.io/cluster

# Shows:
- Real-time CPU usage and worker status
- Exploration progress
- Top 5 strategies with parameters
- AI recommendations for next actions

Kill All Workers

# Kill worker1
ssh root@10.10.254.106 "pkill -f distributed_worker"

# Kill worker2
ssh root@10.10.254.106 "ssh root@10.20.254.100 'pkill -f distributed_worker'"

# Kill coordinator
pkill -f distributed_coordinator

CPU Limit Configuration

Why 70%?

  • Prevents server overload
  • Leaves headroom for system operations
  • Balances throughput vs stability

Implementation

Both worker scripts limit CPU via multiprocessing.Pool:

# In distributed_worker.py and distributed_worker_bd_clean.py
max_workers = max(1, int(num_workers * 0.7))  # 70% of 32 cores = 22

with mp.Pool(processes=max_workers) as pool:
    # Processing happens here

Expected CPU Usage: 67-72% user time on each EPYC


Troubleshooting

Worker Not Starting

# Check worker logs
ssh root@10.10.254.106 "tail -100 /tmp/v9_chunk_*.log"
ssh root@10.10.254.106 "ssh root@10.20.254.100 'tail -100 /tmp/v9_chunk_*.log'"

# Common issues:
# 1. Import errors - check sys.path and module structure
# 2. Data file missing - verify solusdt_5m*.csv exists
# 3. Virtual env activation failed - check .venv/bin/activate path

SSH Hop Issues (Worker2)

# Test 2-hop connectivity
ssh root@10.10.254.106 "ssh root@10.20.254.100 'echo SUCCESS'"

# If fails, check:
# - Worker1 can reach worker2: ssh root@10.10.254.106 "ping -c 3 10.20.254.100"
# - SSH keys are set up between worker1 and worker2

Python Bytecode Cache Issues

# Clear .pyc files if code changes don't take effect
find /home/icke/traderv4/cluster -name "*.pyc" -delete
find /home/icke/traderv4/cluster -name "__pycache__" -type d -exec rm -rf {} +

Database Lock Issues

# If coordinator fails to start due to DB lock
cd /home/icke/traderv4/cluster/
pkill -f distributed_coordinator  # Kill any running coordinators
rm -f exploration.db               # Delete database
# Then restart coordinator

Parameter Space

Total Combinations: 11,943,936

14 Parameters:

  1. flip_threshold: 0.4, 0.5, 0.6, 0.7 (4 values)
  2. ma_gap: 0.20, 0.30, 0.40, 0.50 (4 values)
  3. adx_min: 18, 21, 24, 27 (4 values)
  4. long_pos_max: 60, 65, 70, 75 (4 values)
  5. short_pos_min: 20, 25, 30, 35 (4 values)
  6. cooldown: 1, 2, 3, 4 (4 values)
  7. position_size: 0.1-1.0 in 0.1 increments (10 values)
  8. tp1_mult: 1.5-3.0 in 0.5 increments (4 values)
  9. tp2_mult: 3.0-6.0 in 1.0 increments (4 values)
  10. sl_mult: 2.0-4.0 in 0.5 increments (5 values)
  11. tp1_close_pct: 0.5-0.8 in 0.1 increments (4 values)
  12. trailing_mult: 1.0-2.5 in 0.5 increments (4 values)
  13. vol_min: 0.8-1.4 in 0.2 increments (4 values)
  14. max_bars: 10, 15, 20, 25 (4 values)

Chunk Size: 10,000 combinations Total Chunks: 1,195


Web UI Integration

API Endpoint

// GET /api/cluster/status
// Returns:
{
  cluster: {
    totalCores: 64,
    activeCores: 45,
    cpuUsage: 70.5,
    activeWorkers: 2,
    status: "active"
  },
  workers: [...],
  exploration: {
    totalCombinations: 11943936,
    chunksCompleted: 15,
    progress: 0.0126
  },
  topStrategies: [...],
  recommendation: "AI-generated action items"
}

Frontend Page

  • Location: /home/icke/traderv4/app/cluster/page.tsx
  • Auto-refreshes every 30 seconds
  • Shows real-time cluster status
  • Displays top strategies with parameters
  • Provides AI recommendations

Files Created/Modified

New Files:

  • cluster/distributed_coordinator.py - Main orchestrator (510 lines)
  • cluster/distributed_worker.py - Worker script for worker1 (271 lines)
  • cluster/distributed_worker_bd_clean.py - Worker script for worker2 (275 lines)
  • cluster/monitor_bd_host01.sh - Monitoring script
  • app/api/cluster/status/route.ts - API endpoint for web UI (274 lines)
  • app/cluster/page.tsx - Web UI page (258 lines)
  • cluster/CLUSTER_SETUP.md - This documentation

Modified Files:

  • Docker rebuilt with new API endpoint and cluster page

Next Steps

  1. Monitor first chunk completion (~10-30 min)
  2. Analyze top strategies via web UI at /cluster
  3. Scale to full sweep - all 1,195 chunks across both EPYCs
  4. Implement best parameters in production trading bot
  5. Iterate - refine grid based on results

Notes

  • 70% CPU limit ensures system stability while maximizing throughput
  • Coordinator is stateless - stores all state in SQLite, can restart anytime
  • Workers are autonomous - process chunks independently, no coordination needed
  • Results are immutable - each chunk produces one CSV, never overwritten
  • Web UI provides actionable insights - no manual CSV analysis needed

Last Updated: November 30, 2025