# EPYC Cluster Setup and Access Guide ## Overview Two AMD EPYC 16-core servers running distributed parameter exploration for trading bot optimization. **Total Capacity:** 64 cores processing 12M parameter combinations --- ## Server Access ### Worker1: pve-nu-monitor01 (Direct SSH) ```bash # Direct access from srvdocker02 ssh root@10.10.254.106 # Specs - Hostname: pve-nu-monitor01 - IP: 10.10.254.106 - CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading) - Location: /home/comprehensive_sweep/backtester/ ``` ### Worker2: bd-host01 (SSH Hop Required) ```bash # Access via 2-hop through worker1 ssh root@10.10.254.106 "ssh root@10.20.254.100 'COMMAND'" # SCP via 2-hop scp FILE root@10.10.254.106:/tmp/ ssh root@10.10.254.106 "scp /tmp/FILE root@10.20.254.100:/path/" # Specs - Hostname: bd-host01 - IP: 10.20.254.100 (only accessible from worker1) - CPU: AMD EPYC 7282 16-Core Processor (32 cores with hyperthreading) - Location: /home/backtest_dual/backtest/ ``` ### Coordinator: srvdocker02 (Local) ```bash # Running on trading bot server cd /home/icke/traderv4/cluster/ # Specs - Hostname: srvdocker02 - Role: Orchestrates distributed sweep, hosts trading bot - Database: SQLite at /home/icke/traderv4/cluster/exploration.db ``` --- ## Directory Structure ### Worker1 Structure ``` /home/comprehensive_sweep/backtester/ ├── data/ │ └── solusdt_5m_aug_nov.csv # OHLCV data ├── indicators/ │ └── money_line.py # Money Line indicator ├── scripts/ │ └── distributed_worker.py # Worker script ├── simulator.py # Backtesting engine ├── data_loader.py # Data loading utilities └── .venv/ # Python environment ``` ### Worker2 Structure ``` /home/backtest_dual/backtest/ ├── backtester/ │ ├── data/ │ │ └── solusdt_5m.csv # OHLCV data (copied from worker1) │ ├── indicators/ │ │ └── money_line.py │ ├── scripts/ │ │ └── distributed_worker.py # Modified for bd-host01 │ ├── simulator.py │ └── data_loader.py └── .venv/ # Python environment ``` ### Coordinator Structure ``` /home/icke/traderv4/cluster/ ├── distributed_coordinator.py # Main orchestrator ├── distributed_worker.py # Worker script (template for worker1) ├── distributed_worker_bd_clean.py # Worker script (template for worker2) ├── monitor_bd_host01.sh # Monitoring script ├── exploration.db # Chunk tracking database └── chunk_*.json # Chunk specifications ``` --- ## How It Works ### 1. Coordinator (srvdocker02) - Splits 12M parameter space into chunks (10,000 combos each) - Stores chunk assignments in SQLite database - Deploys chunk specs and worker scripts via SSH/SCP - Starts workers via SSH with nohup (background execution) - Monitors chunk completion and collects results ### 2. Workers (EPYCs) - Each processes assigned chunks independently - Uses multiprocessing.Pool with **70% CPU limit** (22 cores) - Outputs results to CSV files in their workspace - Logs progress to /tmp/v9_chunk_XXXXXX.log ### 3. Results Collection - Workers save to: `chunk_v9_chunk_XXXXXX_results.csv` - Coordinator can fetch results via SCP - Trading bot API endpoint serves results to web UI --- ## Common Operations ### Start Distributed Sweep ```bash cd /home/icke/traderv4/cluster/ # Clear old chunks and start fresh rm -f exploration.db nohup python3 distributed_coordinator.py > sweep.log 2>&1 & # Monitor progress tail -f sweep.log ``` ### Monitor Worker Status ```bash # Check worker1 ssh root@10.10.254.106 "top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l" # Check worker2 (via hop) ssh root@10.10.254.106 "ssh root@10.20.254.100 'top -bn1 | grep Cpu && ps aux | grep distributed_worker | wc -l'" # Use monitoring script /home/icke/traderv4/cluster/monitor_bd_host01.sh ``` ### Fetch Results ```bash # Worker1 results scp root@10.10.254.106:/home/comprehensive_sweep/backtester/chunk_*_results.csv ./ # Worker2 results (2-hop) ssh root@10.10.254.106 "scp root@10.20.254.100:/home/backtest_dual/backtest/chunk_*_results.csv /tmp/" scp root@10.10.254.106:/tmp/chunk_*_results.csv ./ ``` ### View Results in Web UI ```bash # Access cluster status page http://localhost:3001/cluster # or https://tradervone.v4.dedyn.io/cluster # Shows: - Real-time CPU usage and worker status - Exploration progress - Top 5 strategies with parameters - AI recommendations for next actions ``` ### Kill All Workers ```bash # Kill worker1 ssh root@10.10.254.106 "pkill -f distributed_worker" # Kill worker2 ssh root@10.10.254.106 "ssh root@10.20.254.100 'pkill -f distributed_worker'" # Kill coordinator pkill -f distributed_coordinator ``` --- ## CPU Limit Configuration ### Why 70%? - Prevents server overload - Leaves headroom for system operations - Balances throughput vs stability ### Implementation Both worker scripts limit CPU via multiprocessing.Pool: ```python # In distributed_worker.py and distributed_worker_bd_clean.py max_workers = max(1, int(num_workers * 0.7)) # 70% of 32 cores = 22 with mp.Pool(processes=max_workers) as pool: # Processing happens here ``` **Expected CPU Usage:** 67-72% user time on each EPYC --- ## Troubleshooting ### Worker Not Starting ```bash # Check worker logs ssh root@10.10.254.106 "tail -100 /tmp/v9_chunk_*.log" ssh root@10.10.254.106 "ssh root@10.20.254.100 'tail -100 /tmp/v9_chunk_*.log'" # Common issues: # 1. Import errors - check sys.path and module structure # 2. Data file missing - verify solusdt_5m*.csv exists # 3. Virtual env activation failed - check .venv/bin/activate path ``` ### SSH Hop Issues (Worker2) ```bash # Test 2-hop connectivity ssh root@10.10.254.106 "ssh root@10.20.254.100 'echo SUCCESS'" # If fails, check: # - Worker1 can reach worker2: ssh root@10.10.254.106 "ping -c 3 10.20.254.100" # - SSH keys are set up between worker1 and worker2 ``` ### Python Bytecode Cache Issues ```bash # Clear .pyc files if code changes don't take effect find /home/icke/traderv4/cluster -name "*.pyc" -delete find /home/icke/traderv4/cluster -name "__pycache__" -type d -exec rm -rf {} + ``` ### Database Lock Issues ```bash # If coordinator fails to start due to DB lock cd /home/icke/traderv4/cluster/ pkill -f distributed_coordinator # Kill any running coordinators rm -f exploration.db # Delete database # Then restart coordinator ``` --- ## Parameter Space **Total Combinations:** 11,943,936 **14 Parameters:** 1. flip_threshold: 0.4, 0.5, 0.6, 0.7 (4 values) 2. ma_gap: 0.20, 0.30, 0.40, 0.50 (4 values) 3. adx_min: 18, 21, 24, 27 (4 values) 4. long_pos_max: 60, 65, 70, 75 (4 values) 5. short_pos_min: 20, 25, 30, 35 (4 values) 6. cooldown: 1, 2, 3, 4 (4 values) 7. position_size: 0.1-1.0 in 0.1 increments (10 values) 8. tp1_mult: 1.5-3.0 in 0.5 increments (4 values) 9. tp2_mult: 3.0-6.0 in 1.0 increments (4 values) 10. sl_mult: 2.0-4.0 in 0.5 increments (5 values) 11. tp1_close_pct: 0.5-0.8 in 0.1 increments (4 values) 12. trailing_mult: 1.0-2.5 in 0.5 increments (4 values) 13. vol_min: 0.8-1.4 in 0.2 increments (4 values) 14. max_bars: 10, 15, 20, 25 (4 values) **Chunk Size:** 10,000 combinations **Total Chunks:** 1,195 --- ## Web UI Integration ### API Endpoint ```typescript // GET /api/cluster/status // Returns: { cluster: { totalCores: 64, activeCores: 45, cpuUsage: 70.5, activeWorkers: 2, status: "active" }, workers: [...], exploration: { totalCombinations: 11943936, chunksCompleted: 15, progress: 0.0126 }, topStrategies: [...], recommendation: "AI-generated action items" } ``` ### Frontend Page - Location: `/home/icke/traderv4/app/cluster/page.tsx` - Auto-refreshes every 30 seconds - Shows real-time cluster status - Displays top strategies with parameters - Provides AI recommendations --- ## Files Created/Modified **New Files:** - `cluster/distributed_coordinator.py` - Main orchestrator (510 lines) - `cluster/distributed_worker.py` - Worker script for worker1 (271 lines) - `cluster/distributed_worker_bd_clean.py` - Worker script for worker2 (275 lines) - `cluster/monitor_bd_host01.sh` - Monitoring script - `app/api/cluster/status/route.ts` - API endpoint for web UI (274 lines) - `app/cluster/page.tsx` - Web UI page (258 lines) - `cluster/CLUSTER_SETUP.md` - This documentation **Modified Files:** - Docker rebuilt with new API endpoint and cluster page --- ## Next Steps 1. **Monitor first chunk completion** (~10-30 min) 2. **Analyze top strategies** via web UI at `/cluster` 3. **Scale to full sweep** - all 1,195 chunks across both EPYCs 4. **Implement best parameters** in production trading bot 5. **Iterate** - refine grid based on results --- ## Notes - **70% CPU limit ensures system stability** while maximizing throughput - **Coordinator is stateless** - stores all state in SQLite, can restart anytime - **Workers are autonomous** - process chunks independently, no coordination needed - **Results are immutable** - each chunk produces one CSV, never overwritten - **Web UI provides actionable insights** - no manual CSV analysis needed **Last Updated:** November 30, 2025