Files

mindesbunister b77282b560 feat: Add EPYC cluster distributed sweep with web UI

New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights

Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation

Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked

Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100"
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide

Status: Deployed and operational

2025-11-30 13:02:18 +01:00

9.2 KiB

Raw Blame History

EPYC Cluster Setup and Access Guide

Overview

Two AMD EPYC 16-core servers running distributed parameter exploration for trading bot optimization.

Total Capacity: 64 cores processing 12M parameter combinations

Server Access

Worker1: pve-nu-monitor01 (Direct SSH)

# Direct access from srvdocker02
ssh root@10.10.254.106

# Specs
- Hostname: pve-nu-monitor01
- IP: 10.10.254.106
- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
- Location: /home/comprehensive_sweep/backtester/

Worker2: bd-host01 (SSH Hop Required)

# Access via 2-hop through worker1
ssh root@10.10.254.106 "ssh root@10.20.254.100 'COMMAND'"

# SCP via 2-hop
scp FILE root@10.10.254.106:/tmp/
ssh root@10.10.254.106 "scp /tmp/FILE root@10.20.254.100:/path/"

# Specs
- Hostname: bd-host01
- IP: 10.20.254.100 (only accessible from worker1)
- CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading)
- Location: /home/backtest_dual/backtest/

Coordinator: srvdocker02 (Local)

# Running on trading bot server
cd /home/icke/traderv4/cluster/

# Specs
- Hostname: srvdocker02
- Role: Orchestrates distributed sweep, hosts trading bot
- Database: SQLite at /home/icke/traderv4/cluster/exploration.db

Directory Structure

Worker1 Structure

/home/comprehensive_sweep/backtester/
├── data/
│   └── solusdt_5m_aug_nov.csv        # OHLCV data
├── indicators/
│   └── money_line.py                 # Money Line indicator
├── scripts/
│   └── distributed_worker.py         # Worker script
├── simulator.py                       # Backtesting engine
├── data_loader.py                     # Data loading utilities
└── .venv/                            # Python environment

Worker2 Structure

/home/backtest_dual/backtest/
├── backtester/
│   ├── data/
│   │   └── solusdt_5m.csv            # OHLCV data (copied from worker1)
│   ├── indicators/
│   │   └── money_line.py
│   ├── scripts/
│   │   └── distributed_worker.py     # Modified for bd-host01
│   ├── simulator.py
│   └── data_loader.py
└── .venv/                            # Python environment

Coordinator Structure

/home/icke/traderv4/cluster/
├── distributed_coordinator.py        # Main orchestrator
├── distributed_worker.py             # Worker script (template for worker1)
├── distributed_worker_bd_clean.py    # Worker script (template for worker2)
├── monitor_bd_host01.sh             # Monitoring script
├── exploration.db                    # Chunk tracking database
└── chunk_*.json                      # Chunk specifications

How It Works

1. Coordinator (srvdocker02)

Splits 12M parameter space into chunks (10,000 combos each)
Stores chunk assignments in SQLite database
Deploys chunk specs and worker scripts via SSH/SCP
Starts workers via SSH with nohup (background execution)
Monitors chunk completion and collects results

2. Workers (EPYCs)

Each processes assigned chunks independently
Uses multiprocessing.Pool with 70% CPU limit (22 cores)
Outputs results to CSV files in their workspace
Logs progress to /tmp/v9_chunk_XXXXXX.log

3. Results Collection

Workers save to: chunk_v9_chunk_XXXXXX_results.csv
Coordinator can fetch results via SCP
Trading bot API endpoint serves results to web UI

Common Operations

Start Distributed Sweep

cd /home/icke/traderv4/cluster/

# Clear old chunks and start fresh
rm -f exploration.db
nohup python3 distributed_coordinator.py > sweep.log 2>&1 &

# Monitor progress
tail -f sweep.log

Monitor Worker Status

# Check worker1
ssh root@10.10.254.106 "top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l"

# Check worker2 (via hop)
ssh root@10.10.254.106 "ssh root@10.20.254.100 'top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l'"

# Use monitoring script
/home/icke/traderv4/cluster/monitor_bd_host01.sh

Fetch Results

# Worker1 results
scp root@10.10.254.106:/home/comprehensive_sweep/backtester/chunk_*_results.csv ./

# Worker2 results (2-hop)
ssh root@10.10.254.106 "scp root@10.20.254.100:/home/backtest_dual/backtest/chunk_*_results.csv /tmp/"
scp root@10.10.254.106:/tmp/chunk_*_results.csv ./

View Results in Web UI

# Access cluster status page
http://localhost:3001/cluster
# or
https://tradervone.v4.dedyn.io/cluster

# Shows:
- Real-time CPU usage and worker status
- Exploration progress
- Top 5 strategies with parameters
- AI recommendations for next actions

Kill All Workers

# Kill worker1
ssh root@10.10.254.106 "pkill -f distributed_worker"

# Kill worker2
ssh root@10.10.254.106 "ssh root@10.20.254.100 'pkill -f distributed_worker'"

# Kill coordinator
pkill -f distributed_coordinator

CPU Limit Configuration

Why 70%?

Prevents server overload
Leaves headroom for system operations
Balances throughput vs stability

Implementation

Both worker scripts limit CPU via multiprocessing.Pool:

# In distributed_worker.py and distributed_worker_bd_clean.py
max_workers = max(1, int(num_workers * 0.7))  # 70% of 32 cores = 22

with mp.Pool(processes=max_workers) as pool:
    # Processing happens here

Expected CPU Usage: 67-72% user time on each EPYC

Troubleshooting

Worker Not Starting

# Check worker logs
ssh root@10.10.254.106 "tail -100 /tmp/v9_chunk_*.log"
ssh root@10.10.254.106 "ssh root@10.20.254.100 'tail -100 /tmp/v9_chunk_*.log'"

# Common issues:
# 1. Import errors - check sys.path and module structure
# 2. Data file missing - verify solusdt_5m*.csv exists
# 3. Virtual env activation failed - check .venv/bin/activate path

SSH Hop Issues (Worker2)

# Test 2-hop connectivity
ssh root@10.10.254.106 "ssh root@10.20.254.100 'echo SUCCESS'"

# If fails, check:
# - Worker1 can reach worker2: ssh root@10.10.254.106 "ping -c 3 10.20.254.100"
# - SSH keys are set up between worker1 and worker2

Python Bytecode Cache Issues

# Clear .pyc files if code changes don't take effect
find /home/icke/traderv4/cluster -name "*.pyc" -delete
find /home/icke/traderv4/cluster -name "__pycache__" -type d -exec rm -rf {} +

Database Lock Issues

# If coordinator fails to start due to DB lock
cd /home/icke/traderv4/cluster/
pkill -f distributed_coordinator  # Kill any running coordinators
rm -f exploration.db               # Delete database
# Then restart coordinator

Parameter Space

Total Combinations: 11,943,936

14 Parameters:

flip_threshold: 0.4, 0.5, 0.6, 0.7 (4 values)
ma_gap: 0.20, 0.30, 0.40, 0.50 (4 values)
adx_min: 18, 21, 24, 27 (4 values)
long_pos_max: 60, 65, 70, 75 (4 values)
short_pos_min: 20, 25, 30, 35 (4 values)
cooldown: 1, 2, 3, 4 (4 values)
position_size: 0.1-1.0 in 0.1 increments (10 values)
tp1_mult: 1.5-3.0 in 0.5 increments (4 values)
tp2_mult: 3.0-6.0 in 1.0 increments (4 values)
sl_mult: 2.0-4.0 in 0.5 increments (5 values)
tp1_close_pct: 0.5-0.8 in 0.1 increments (4 values)
trailing_mult: 1.0-2.5 in 0.5 increments (4 values)
vol_min: 0.8-1.4 in 0.2 increments (4 values)
max_bars: 10, 15, 20, 25 (4 values)

Chunk Size: 10,000 combinations Total Chunks: 1,195

Web UI Integration

API Endpoint

// GET /api/cluster/status
// Returns:
{
  cluster: {
    totalCores: 64,
    activeCores: 45,
    cpuUsage: 70.5,
    activeWorkers: 2,
    status: "active"
  },
  workers: [...],
  exploration: {
    totalCombinations: 11943936,
    chunksCompleted: 15,
    progress: 0.0126
  },
  topStrategies: [...],
  recommendation: "AI-generated action items"
}

Frontend Page

Location: /home/icke/traderv4/app/cluster/page.tsx
Auto-refreshes every 30 seconds
Shows real-time cluster status
Displays top strategies with parameters
Provides AI recommendations

Files Created/Modified

New Files:

cluster/distributed_coordinator.py - Main orchestrator (510 lines)
cluster/distributed_worker.py - Worker script for worker1 (271 lines)
cluster/distributed_worker_bd_clean.py - Worker script for worker2 (275 lines)
cluster/monitor_bd_host01.sh - Monitoring script
app/api/cluster/status/route.ts - API endpoint for web UI (274 lines)
app/cluster/page.tsx - Web UI page (258 lines)
cluster/CLUSTER_SETUP.md - This documentation

Modified Files:

Docker rebuilt with new API endpoint and cluster page

Next Steps

Monitor first chunk completion (~10-30 min)
Analyze top strategies via web UI at /cluster
Scale to full sweep - all 1,195 chunks across both EPYCs
Implement best parameters in production trading bot
Iterate - refine grid based on results

Notes

70% CPU limit ensures system stability while maximizing throughput
Coordinator is stateless - stores all state in SQLite, can restart anytime
Workers are autonomous - process chunks independently, no coordination needed
Results are immutable - each chunk produces one CSV, never overwritten
Web UI provides actionable insights - no manual CSV analysis needed

Last Updated: November 30, 2025

9.2 KiB Raw Blame History