Distributed Computing & EPYC Cluster
Infrastructure for large-scale parameter optimization and backtesting.
This directory contains documentation for the EPYC cluster setup, distributed backtesting coordination, and multi-server infrastructure.
🖥️ Cluster Documentation
EPYC Server Setup
EPYC_SETUP_COMPREHENSIVE.md- Complete setup guide- Hardware: AMD EPYC 7282 16-Core Processor (Debian 12 Bookworm)
- Python environment: 3.11.2 with pandas 2.3.3, numpy 2.3.5
- SSH configuration: Nested hop (master → worker1 → worker2)
- Package deployment: tar.gz transfer with virtual environment
- Status: ✅ OPERATIONAL (24 workers processing 65,536 combos)
Distributed Architecture
DUAL_SWEEP_README.md- Parallel sweep execution- Coordinator: Assigns chunks to workers
- Workers: Execute parameter combinations in parallel
- Database: SQLite exploration.db for state tracking
- Results: CSV files with top N configurations
- Use case: v9 exhaustive parameter optimization (Nov 28-29, 2025)
Cluster Control
CLUSTER_START_BUTTON_FIX.md- Web UI integration- Dashboard: http://localhost:3001/cluster
- Start/Stop buttons with status detection
- Database-first status (SSH supplementary)
- Real-time progress tracking
- Status: ✅ DEPLOYED (Nov 30, 2025)
🏗️ Cluster Architecture
Physical Infrastructure
Master Server (local development machine)
├── Coordinator Process (assigns chunks)
├── Database (exploration.db)
└── Web Dashboard (Next.js)
↓ [SSH]
Worker1 (EPYC 10.10.254.106)
├── 12 worker processes
├── 64GB RAM
└── Direct SSH connection
↓ [SSH ProxyJump]
Worker2 (EPYC 10.20.254.100)
├── 12 worker processes
├── 64GB RAM
└── Via worker1 hop
Data Flow
1. Coordinator creates chunks (2,000 combos each)
↓
2. Marks chunk status='pending' in database
↓
3. Worker queries database for pending chunks
↓
4. Coordinator assigns chunk to worker via SSH
↓
5. Worker updates status='running'
↓
6. Worker processes combinations in parallel
↓
7. Worker saves results to strategies table
↓
8. Worker updates status='completed'
↓
9. Coordinator assigns next pending chunk
↓
10. Dashboard shows real-time progress
Database Schema
-- chunks table: Work distribution
CREATE TABLE chunks (
id TEXT PRIMARY KEY, -- v9_chunk_000000
start_combo INTEGER, -- 0, 2000, 4000, etc.
end_combo INTEGER, -- 2000, 4000, 6000, etc.
status TEXT, -- 'pending', 'running', 'completed'
assigned_worker TEXT, -- 'worker1', 'worker2'
started_at INTEGER,
completed_at INTEGER
);
-- strategies table: Results storage
CREATE TABLE strategies (
id INTEGER PRIMARY KEY,
chunk_id TEXT,
params TEXT, -- JSON of parameter values
pnl REAL,
win_rate REAL,
profit_factor REAL,
max_drawdown REAL,
total_trades INTEGER
);
🚀 Using the Cluster
Starting a Sweep
# 1. Prepare package on master
cd /home/icke/traderv4/backtester
tar -czf backtest_v9_sweep.tar.gz data/ backtester_core.py v9_moneyline_ma_gap.py moneyline_core.py
# 2. Transfer to EPYC workers
scp backtest_v9_sweep.tar.gz root@10.10.254.106:/home/backtest/
ssh root@10.10.254.106 "scp backtest_v9_sweep.tar.gz root@10.20.254.100:/home/backtest/"
# 3. Extract on workers
ssh root@10.10.254.106 "cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz"
ssh root@10.10.254.106 "ssh root@10.20.254.100 'cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz'"
# 4. Start via web dashboard or CLI
# Web: http://localhost:3001/cluster → Click "Start Cluster"
# CLI: cd /home/icke/traderv4/cluster && python v9_advanced_coordinator.py
Monitoring Progress
# Dashboard
curl -s http://localhost:3001/api/cluster/status | jq
# Database query
sqlite3 cluster/exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;"
# Worker processes
ssh root@10.10.254.106 "ps aux | grep [p]ython | grep backtest | wc -l"
Collecting Results
# Results saved to cluster/results/
ls -lh cluster/results/sweep_v9_*.csv
# Top 100 configurations by P&L
sqlite3 cluster/exploration.db "SELECT params, pnl, win_rate FROM strategies ORDER BY pnl DESC LIMIT 100;"
🔧 Configuration
Coordinator Settings
# cluster/v9_advanced_coordinator.py
WORKERS = {
'worker1': {'host': '10.10.254.106', 'port': 22},
'worker2': {'host': '10.20.254.100', 'port': 22, 'proxy_jump': '10.10.254.106'}
}
CHUNK_SIZE = 2000 # Combinations per chunk
MAX_WORKERS = 24 # 12 per server
CHECK_INTERVAL = 60 # Status check frequency (seconds)
Worker Settings
# cluster/distributed_worker.py
NUM_PROCESSES = 12 # Parallel backtests
BATCH_SIZE = 100 # Save results every N combos
TIMEOUT = 120 # Per-combo timeout (seconds)
SSH Configuration
# ~/.ssh/config
Host worker1
HostName 10.10.254.106
User root
StrictHostKeyChecking no
ServerAliveInterval 30
Host worker2
HostName 10.20.254.100
User root
ProxyJump worker1
StrictHostKeyChecking no
ServerAliveInterval 30
🐛 Common Issues
SSH Timeout Errors
Symptom: "SSH command timed out for worker2" Root Cause: Nested hop requires 60s timeout (not 30s) Fix: Common Pitfall #64 - Increase subprocess timeout
result = subprocess.run(ssh_cmd, timeout=60) # Not 30
Database Lock Errors
Symptom: "database is locked" Root Cause: Multiple workers writing simultaneously Fix: Use WAL mode + increase busy_timeout
connection.execute('PRAGMA journal_mode=WAL')
connection.execute('PRAGMA busy_timeout=10000')
Worker Not Processing
Symptom: Chunk status='running' but no worker processes Root Cause: Worker crashed or SSH session died Fix: Cleanup database + restart
# Mark stuck chunks as pending
sqlite3 exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL WHERE status='running' AND started_at < (strftime('%s','now') - 600);"
Status Shows "Idle" When Running
Symptom: Dashboard shows idle despite workers running Root Cause: SSH detection timing out, database not queried first Fix: Database-first status detection (Common Pitfall #71)
// Check database BEFORE SSH
const hasRunningChunks = explorationData.chunks.running > 0
if (hasRunningChunks) clusterStatus = 'active'
📊 Performance Metrics
v9 Exhaustive Sweep (65,536 combos):
- Duration: ~29 hours (24 workers)
- Speed: 1.60s per combo (4× faster than 6 local workers)
- Throughput: ~37 combos/minute across cluster
- Data processed: 139,678 OHLCV rows × 65,536 combos = 9.16B calculations
- Results: Top 100 saved to CSV (~10KB file)
Cost Analysis:
- Local (6 workers): 72 hours estimated
- EPYC (24 workers): 29 hours actual
- Time savings: 43 hours (60% faster)
- Resource utilization: 64 cores utilized vs 6 local
📝 Adding Cluster Features
When to Use Cluster:
- Parameter sweeps >10,000 combinations
- Backtests requiring >24 hours on local machine
- Multi-strategy comparison (need parallel execution)
- Production validation (test many configs simultaneously)
Scaling Guidelines:
- <1,000 combos: Local machine sufficient
- 1,000-10,000: Single EPYC server (12 workers)
- 10,000-100,000: Both EPYC servers (24 workers)
- 100,000+: Consider cloud scaling (AWS Batch, etc.)
⚠️ Important Notes
Data Transfer:
- Always compress packages:
tar -czf(1.9MB → 1.1MB) - Verify checksums after transfer
- Use rsync for incremental updates
Process Management:
- Always use
nohuporscreenfor long-running coordinators - Workers auto-terminate when chunks complete
- Coordinator sends Telegram notification on completion
Database Safety:
- SQLite WAL mode prevents most lock errors
- Backup exploration.db before major sweeps
- Never edit chunks table manually while coordinator running
SSH Reliability:
- ServerAliveInterval prevents silent disconnects
- StrictHostKeyChecking=no avoids interactive prompts
- ProxyJump handles nested hops automatically
See ../README.md for overall documentation structure.