# Distributed Computing & EPYC Cluster **Infrastructure for large-scale parameter optimization and backtesting.** This directory contains documentation for the EPYC cluster setup, distributed backtesting coordination, and multi-server infrastructure. --- ## 🖥️ Cluster Documentation ### **EPYC Server Setup** - `EPYC_SETUP_COMPREHENSIVE.md` - **Complete setup guide** - Hardware: AMD EPYC 7282 16-Core Processor (Debian 12 Bookworm) - Python environment: 3.11.2 with pandas 2.3.3, numpy 2.3.5 - SSH configuration: Nested hop (master → worker1 → worker2) - Package deployment: tar.gz transfer with virtual environment - **Status:** ✅ OPERATIONAL (24 workers processing 65,536 combos) ### **Distributed Architecture** - `DUAL_SWEEP_README.md` - **Parallel sweep execution** - Coordinator: Assigns chunks to workers - Workers: Execute parameter combinations in parallel - Database: SQLite exploration.db for state tracking - Results: CSV files with top N configurations - **Use case:** v9 exhaustive parameter optimization (Nov 28-29, 2025) ### **Cluster Control** - `CLUSTER_START_BUTTON_FIX.md` - **Web UI integration** - Dashboard: http://localhost:3001/cluster - Start/Stop buttons with status detection - Database-first status (SSH supplementary) - Real-time progress tracking - **Status:** ✅ DEPLOYED (Nov 30, 2025) --- ## 🏗️ Cluster Architecture ### **Physical Infrastructure** ``` Master Server (local development machine) ├── Coordinator Process (assigns chunks) ├── Database (exploration.db) └── Web Dashboard (Next.js) ↓ [SSH] Worker1 (EPYC 10.10.254.106) ├── 12 worker processes ├── 64GB RAM └── Direct SSH connection ↓ [SSH ProxyJump] Worker2 (EPYC 10.20.254.100) ├── 12 worker processes ├── 64GB RAM └── Via worker1 hop ``` ### **Data Flow** ``` 1. Coordinator creates chunks (2,000 combos each) ↓ 2. Marks chunk status='pending' in database ↓ 3. Worker queries database for pending chunks ↓ 4. Coordinator assigns chunk to worker via SSH ↓ 5. Worker updates status='running' ↓ 6. Worker processes combinations in parallel ↓ 7. Worker saves results to strategies table ↓ 8. Worker updates status='completed' ↓ 9. Coordinator assigns next pending chunk ↓ 10. Dashboard shows real-time progress ``` ### **Database Schema** ```sql -- chunks table: Work distribution CREATE TABLE chunks ( id TEXT PRIMARY KEY, -- v9_chunk_000000 start_combo INTEGER, -- 0, 2000, 4000, etc. end_combo INTEGER, -- 2000, 4000, 6000, etc. status TEXT, -- 'pending', 'running', 'completed' assigned_worker TEXT, -- 'worker1', 'worker2' started_at INTEGER, completed_at INTEGER ); -- strategies table: Results storage CREATE TABLE strategies ( id INTEGER PRIMARY KEY, chunk_id TEXT, params TEXT, -- JSON of parameter values pnl REAL, win_rate REAL, profit_factor REAL, max_drawdown REAL, total_trades INTEGER ); ``` --- ## 🚀 Using the Cluster ### **Starting a Sweep** ```bash # 1. Prepare package on master cd /home/icke/traderv4/backtester tar -czf backtest_v9_sweep.tar.gz data/ backtester_core.py v9_moneyline_ma_gap.py moneyline_core.py # 2. Transfer to EPYC workers scp backtest_v9_sweep.tar.gz root@10.10.254.106:/home/backtest/ ssh root@10.10.254.106 "scp backtest_v9_sweep.tar.gz root@10.20.254.100:/home/backtest/" # 3. Extract on workers ssh root@10.10.254.106 "cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz" ssh root@10.10.254.106 "ssh root@10.20.254.100 'cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz'" # 4. Start via web dashboard or CLI # Web: http://localhost:3001/cluster → Click "Start Cluster" # CLI: cd /home/icke/traderv4/cluster && python v9_advanced_coordinator.py ``` ### **Monitoring Progress** ```bash # Dashboard curl -s http://localhost:3001/api/cluster/status | jq # Database query sqlite3 cluster/exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;" # Worker processes ssh root@10.10.254.106 "ps aux | grep [p]ython | grep backtest | wc -l" ``` ### **Collecting Results** ```bash # Results saved to cluster/results/ ls -lh cluster/results/sweep_v9_*.csv # Top 100 configurations by P&L sqlite3 cluster/exploration.db "SELECT params, pnl, win_rate FROM strategies ORDER BY pnl DESC LIMIT 100;" ``` --- ## 🔧 Configuration ### **Coordinator Settings** ```python # cluster/v9_advanced_coordinator.py WORKERS = { 'worker1': {'host': '10.10.254.106', 'port': 22}, 'worker2': {'host': '10.20.254.100', 'port': 22, 'proxy_jump': '10.10.254.106'} } CHUNK_SIZE = 2000 # Combinations per chunk MAX_WORKERS = 24 # 12 per server CHECK_INTERVAL = 60 # Status check frequency (seconds) ``` ### **Worker Settings** ```python # cluster/distributed_worker.py NUM_PROCESSES = 12 # Parallel backtests BATCH_SIZE = 100 # Save results every N combos TIMEOUT = 120 # Per-combo timeout (seconds) ``` ### **SSH Configuration** ```bash # ~/.ssh/config Host worker1 HostName 10.10.254.106 User root StrictHostKeyChecking no ServerAliveInterval 30 Host worker2 HostName 10.20.254.100 User root ProxyJump worker1 StrictHostKeyChecking no ServerAliveInterval 30 ``` --- ## 🐛 Common Issues ### **SSH Timeout Errors** **Symptom:** "SSH command timed out for worker2" **Root Cause:** Nested hop requires 60s timeout (not 30s) **Fix:** Common Pitfall #64 - Increase subprocess timeout ```python result = subprocess.run(ssh_cmd, timeout=60) # Not 30 ``` ### **Database Lock Errors** **Symptom:** "database is locked" **Root Cause:** Multiple workers writing simultaneously **Fix:** Use WAL mode + increase busy_timeout ```python connection.execute('PRAGMA journal_mode=WAL') connection.execute('PRAGMA busy_timeout=10000') ``` ### **Worker Not Processing** **Symptom:** Chunk status='running' but no worker processes **Root Cause:** Worker crashed or SSH session died **Fix:** Cleanup database + restart ```bash # Mark stuck chunks as pending sqlite3 exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL WHERE status='running' AND started_at < (strftime('%s','now') - 600);" ``` ### **Status Shows "Idle" When Running** **Symptom:** Dashboard shows idle despite workers running **Root Cause:** SSH detection timing out, database not queried first **Fix:** Database-first status detection (Common Pitfall #71) ```typescript // Check database BEFORE SSH const hasRunningChunks = explorationData.chunks.running > 0 if (hasRunningChunks) clusterStatus = 'active' ``` --- ## 📊 Performance Metrics **v9 Exhaustive Sweep (65,536 combos):** - **Duration:** ~29 hours (24 workers) - **Speed:** 1.60s per combo (4× faster than 6 local workers) - **Throughput:** ~37 combos/minute across cluster - **Data processed:** 139,678 OHLCV rows × 65,536 combos = 9.16B calculations - **Results:** Top 100 saved to CSV (~10KB file) **Cost Analysis:** - Local (6 workers): 72 hours estimated - EPYC (24 workers): 29 hours actual - **Time savings:** 43 hours (60% faster) - **Resource utilization:** 64 cores utilized vs 6 local --- ## 📝 Adding Cluster Features **When to Use Cluster:** - Parameter sweeps >10,000 combinations - Backtests requiring >24 hours on local machine - Multi-strategy comparison (need parallel execution) - Production validation (test many configs simultaneously) **Scaling Guidelines:** - **<1,000 combos:** Local machine sufficient - **1,000-10,000:** Single EPYC server (12 workers) - **10,000-100,000:** Both EPYC servers (24 workers) - **100,000+:** Consider cloud scaling (AWS Batch, etc.) --- ## ⚠️ Important Notes **Data Transfer:** - Always compress packages: `tar -czf` (1.9MB → 1.1MB) - Verify checksums after transfer - Use rsync for incremental updates **Process Management:** - Always use `nohup` or `screen` for long-running coordinators - Workers auto-terminate when chunks complete - Coordinator sends Telegram notification on completion **Database Safety:** - SQLite WAL mode prevents most lock errors - Backup exploration.db before major sweeps - Never edit chunks table manually while coordinator running **SSH Reliability:** - ServerAliveInterval prevents silent disconnects - StrictHostKeyChecking=no avoids interactive prompts - ProxyJump handles nested hops automatically --- See `../README.md` for overall documentation structure.