# Distributed Continuous Optimization Cluster **24/7 automated strategy discovery** across 2 EPYC servers (64 cores total). Explores entire indicator/parameter space to find the absolute best trading approach. ## 🏗️ Architecture **Three-Component Distributed System:** 1. **Coordinator** (`distributed_coordinator.py`) - Master orchestrator running on srvdocker02 - Defines parameter grid (14 dimensions, ~500k combinations) - Splits work into chunks (e.g., 10,000 combos per chunk) - Deploys worker script to EPYC servers via SSH/SCP - Assigns chunks to idle workers dynamically - Collects CSV results and imports to SQLite database - Tracks progress (completed/running/pending chunks) 2. **Worker** (`distributed_worker.py`) - Runs on EPYC servers - Integrates with existing `/home/comprehensive_sweep/backtester/` infrastructure - Uses proven `simulator.py` vectorized engine and `MoneyLineInputs` class - Loads chunk spec (start_idx, end_idx from total parameter grid) - Generates parameter combinations via `itertools.product()` - Runs multiprocessing sweep with `mp.cpu_count()` workers - Saves results to CSV (same format as comprehensive_sweep.py) 3. **Monitor** (`exploration_status.py`) - Real-time status dashboard - SSH worker health checks (active distributed_worker.py processes) - Chunk progress tracking (total/completed/running/pending) - Top 10 strategies leaderboard (P&L, trades, WR, PF, DD) - Best configuration details (full parameters) - Watch mode for continuous monitoring (30s refresh) **Infrastructure:** - **Worker 1:** pve-nu-monitor01 (10.10.254.106) - EPYC 7282 32 threads, 62GB RAM - **Worker 2:** pve-srvmon01 (10.20.254.100 via worker1 2-hop SSH) - EPYC 7302 32 threads, 31GB RAM - **Combined:** 64 cores, ~108,000 backtests/day capacity (proven: 65,536 in 29h) - **Existing Backtester:** `/home/comprehensive_sweep/backtester/` with simulator.py, indicators/, data/ - **Data:** `solusdt_5m.csv` - Binance 5-minute OHLCV (Nov 2024 - Nov 2025) - **Database:** `exploration.db` SQLite with strategies/chunks/phases tables ## 🚀 Quick Start ### 1. Test with Small Chunk (RECOMMENDED FIRST) Verify system works before large-scale deployment: ```bash cd /home/icke/traderv4/cluster # Modify distributed_coordinator.py temporarily (lines 120-135) # Reduce parameter ranges to 2-3 values per dimension # Total: ~500-1000 combinations for testing # Run test python3 distributed_coordinator.py --chunk-size 100 # Monitor in separate terminal python3 exploration_status.py --watch ``` **Expected:** 5-10 chunks complete in 30-60 minutes, all results in `exploration.db` **Verify:** - SSH commands execute successfully - Worker script deploys to `/home/comprehensive_sweep/backtester/scripts/` - CSV results appear in `cluster/distributed_results/` - Database populated with strategies (check with `sqlite3 exploration.db "SELECT COUNT(*) FROM strategies"`) - Monitoring dashboard shows accurate worker/chunk status ### 2. Run Full v9 Parameter Sweep After test succeeds, explore full parameter space: ```bash cd /home/icke/traderv4/cluster # Restore full parameter ranges in distributed_coordinator.py # Total: ~500,000 combinations (4^8 * 3^3 * 1 ≈ 500k) # Start exploration (runs in background) nohup python3 distributed_coordinator.py --chunk-size 10000 > sweep.log 2>&1 & # Monitor progress python3 exploration_status.py --watch # OR watch -n 60 'python3 exploration_status.py' # Check logs tail -f sweep.log ``` **Expected Results:** - Duration: ~3.5 hours with 64 cores - Find 5-10 configurations with P&L > $250/1k (baseline: $192/1k) - Quality filters: 700+ trades, 50-70% WR, PF ≥ 1.2 ### 3. Query Top Strategies ```bash # Top 20 performers sqlite3 cluster/exploration.db <= 700 AND win_rate >= 0.50 AND win_rate <= 0.70 AND profit_factor >= 1.2 ORDER BY pnl_per_1k DESC LIMIT 20; EOF ``` ## 📊 Parameter Space (14 Dimensions) **v9 Money Line Configuration:** ```python ParameterGrid( flip_thresholds=[0.4, 0.5, 0.6, 0.7], # EMA flip confirmation (4 values) ma_gaps=[0.20, 0.30, 0.40, 0.50], # MA50-MA200 convergence bonus (4 values) adx_mins=[18, 21, 24, 27], # ADX requirement for momentum filter (4 values) long_pos_maxs=[60, 65, 70, 75], # Price position for LONG momentum (4 values) short_pos_mins=[20, 25, 30, 35], # Price position for SHORT momentum (4 values) cooldowns=[1, 2, 3, 4], # Bars between signals (4 values) position_sizes=[1.0], # Full position (1 value fixed) tp1_multipliers=[1.5, 2.0, 2.5], # TP1 as ATR multiple (3 values) tp2_multipliers=[3.0, 4.0, 5.0], # TP2 as ATR multiple (3 values) sl_multipliers=[2.0, 3.0, 4.0], # SL as ATR multiple (3 values) tp1_close_percents=[0.5, 0.6, 0.7, 0.75], # TP1 close % (4 values) trailing_multipliers=[1.0, 1.5, 2.0], # Trailing stop multiplier (3 values) vol_mins=[0.8, 1.0, 1.2], # Minimum volume ratio (3 values) max_bars_list=[100, 150, 200] # Max bars in position (3 values) ) # Total: 4×4×4×4×4×4×1×3×3×3×4×3×3×3 ≈ 497,664 combinations ``` ## 🎯 Quality Filters **Applied to all strategy results:** - **Minimum trades:** 700+ (statistical significance) - **Win rate range:** 50-70% (realistic, avoids overfitting) - **Profit factor:** ≥ 1.2 (solid edge) - **Max drawdown:** Tracked but no hard limit (informational) **Why these filters:** - Trade count validates statistical robustness - WR range prevents curve-fitting (>70% = overfit, <50% = coin flip) - PF threshold ensures strategy has actual edge ## 📈 Expected Results **Current Baseline (v9 default parameters):** - P&L: $192 per $1k capital - Trades: ~700 - Win Rate: ~61% - Profit Factor: ~1.4 **Optimization Goals:** - **Target:** >$250/1k P&L (30% improvement) - **Stretch:** >$300/1k P&L (56% improvement) - **Expected:** Find 5-10 configurations meeting quality filters with P&L > $250/1k **Why achievable:** - 500k combinations vs 27 tested in narrow sweep - Full parameter space exploration vs limited grid - Proven infrastructure (65,536 backtests completed successfully) ## 🔄 Continuous Exploration Roadmap **Phase 1: v9 Money Line Parameter Optimization (~500k combos, 3.5h)** - Status: READY TO RUN - Goal: Find optimal flip_threshold, ma_gap, momentum filters - Expected: >$250/1k P&L **Phase 2: RSI Divergence Integration (~100k combos, 45min)** - Add RSI divergence detection - Combine with v9 momentum filter - Parameters: RSI lookback, divergence strength threshold - Goal: Catch trend reversals early **Phase 3: Volume Profile Analysis (~200k combos, 1.5h)** - Volume profile zones (POC, VAH, VAL) - Order flow imbalance detection - Parameters: Profile window, entry threshold, confirmation bars - Goal: Better entry timing **Phase 4: Multi-Timeframe Confirmation (~150k combos, 1h)** - 5min + 15min + 1H alignment - Higher timeframe trend filter - Parameters: Timeframes to use, alignment strictness - Goal: Reduce false signals **Phase 5: Hybrid Indicators (~50k combos, 30min)** - Combine best performers from Phase 1-4 - Test cross-strategy synergy - Goal: Break $300/1k barrier **Phase 6: ML-Based Optimization (~100k+ combos, 1h+)** - Feature engineering from top strategies - Gradient boosting / random forest - Genetic algorithm parameter tuning - Goal: Discover non-obvious patterns ## 📁 File Structure ``` cluster/ ├── distributed_coordinator.py # Master orchestrator (650 lines) ├── distributed_worker.py # Worker script (350 lines) ├── exploration_status.py # Monitoring dashboard (200 lines) ├── exploration.db # SQLite results database ├── distributed_results/ # CSV results from workers │ ├── worker1_chunk_0.csv │ ├── worker1_chunk_1.csv │ └── worker2_chunk_0.csv └── README.md # This file /home/comprehensive_sweep/backtester/ (on EPYC servers) ├── simulator.py # Core vectorized engine ├── indicators/ │ ├── money_line.py # MoneyLineInputs class │ └── ... ├── data/ │ └── solusdt_5m.csv # Binance 5-minute OHLCV ├── scripts/ │ ├── comprehensive_sweep.py # Original multiprocessing sweep │ └── distributed_worker.py # Deployed by coordinator └── .venv/ # Python 3.11.2, pandas, numpy ``` ## 💾 Database Schema ### strategies table ```sql CREATE TABLE strategies ( id INTEGER PRIMARY KEY AUTOINCREMENT, phase_id INTEGER, -- Which exploration phase (1=v9, 2=RSI, etc.) params_json TEXT NOT NULL, -- JSON parameter configuration pnl_per_1k REAL, -- Performance metric ($ PnL per $1k) trades INTEGER, -- Total trades in backtest win_rate REAL, -- Decimal win rate (0.61 = 61%) profit_factor REAL, -- Gross profit / gross loss max_drawdown REAL, -- Largest peak-to-trough decline (decimal) tested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (phase_id) REFERENCES phases(id) ); CREATE INDEX idx_strategies_pnl ON strategies(pnl_per_1k DESC); CREATE INDEX idx_strategies_trades ON strategies(trades); ``` ### chunks table ```sql CREATE TABLE chunks ( id INTEGER PRIMARY KEY AUTOINCREMENT, phase_id INTEGER, worker_id TEXT, -- 'worker1' or 'worker2' start_idx INTEGER, -- Start index in parameter grid end_idx INTEGER, -- End index (exclusive) total_combos INTEGER, -- Total in this chunk status TEXT DEFAULT 'pending', -- pending/running/completed/failed assigned_at TIMESTAMP, completed_at TIMESTAMP, result_file TEXT, -- Path to CSV result file FOREIGN KEY (phase_id) REFERENCES phases(id) ); CREATE INDEX idx_chunks_status ON chunks(status); ``` ### phases table ```sql CREATE TABLE phases ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL, -- 'v9_optimization', 'rsi_divergence', etc. description TEXT, total_combinations INTEGER, -- Total parameter combinations started_at TIMESTAMP, completed_at TIMESTAMP ); ``` ## 🔧 Troubleshooting ### SSH Connection Issues **Symptom:** "Connection refused" or timeout errors **Solutions:** ```bash # Test Worker 1 connectivity ssh root@10.10.254.106 'echo "Worker 1 OK"' # Test Worker 2 (2-hop) connectivity ssh root@10.10.254.106 'ssh root@10.20.254.100 "echo Worker 2 OK"' # Check SSH keys ssh-add -l # Verify authorized_keys on workers ssh root@10.10.254.106 'cat ~/.ssh/authorized_keys' ``` ### Path/Import Errors on Workers **Symptom:** "ModuleNotFoundError" or "FileNotFoundError" **Solutions:** ```bash # Verify backtester exists on Worker 1 ssh root@10.10.254.106 'ls -lah /home/comprehensive_sweep/backtester/' # Check Python environment ssh root@10.10.254.106 'cd /home/comprehensive_sweep && source .venv/bin/activate && python --version' # Verify data file ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/data/solusdt_5m.csv' # Check distributed_worker.py deployment ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/scripts/distributed_worker.py' ``` ### Worker Processes Stuck/Hung **Symptom:** exploration_status.py shows "running" but no progress **Solutions:** ```bash # Check worker processes ssh root@10.10.254.106 'ps aux | grep distributed_worker' # Check worker CPU usage (should be near 100% on 32 cores) ssh root@10.10.254.106 'top -bn1 | head -20' # Kill hung worker (coordinator will reassign chunk) ssh root@10.10.254.106 'pkill -f distributed_worker.py' # Check worker logs ssh root@10.10.254.106 'tail -50 /home/comprehensive_sweep/backtester/scripts/worker_*.log' ``` ### Database Locked/Corrupt **Symptom:** "database is locked" errors **Solutions:** ```bash # Check for stale locks cd /home/icke/traderv4/cluster fuser exploration.db # Backup and rebuild cp exploration.db exploration.db.backup sqlite3 exploration.db "VACUUM;" # Verify integrity sqlite3 exploration.db "PRAGMA integrity_check;" ``` ### Results Not Importing **Symptom:** CSVs in distributed_results/ but database empty **Solutions:** ```bash # Check CSV format head -20 cluster/distributed_results/worker1_chunk_0.csv # Manual import test python3 -c " import sqlite3 import pandas as pd df = pd.read_csv('cluster/distributed_results/worker1_chunk_0.csv') print(f'Loaded {len(df)} results') print(df.columns.tolist()) print(df.head()) " # Check coordinator logs for import errors grep -i "error\|exception" sweep.log | tail -20 ``` ## ⚡ Performance Tuning ### Chunk Size Trade-offs **Small chunks (1,000-5,000):** - ✅ Better load balancing - ✅ Faster feedback loop - ❌ More SSH/SCP overhead - ❌ More database writes **Large chunks (10,000-20,000):** - ✅ Less overhead - ✅ Fewer database transactions - ❌ Less granular progress tracking - ❌ Wasted work if chunk fails **Recommended:** 10,000 combos per chunk (good balance) ### Worker Concurrency **Current:** Uses `mp.cpu_count()` (32 workers per EPYC) **To reduce CPU load:** ```python # In distributed_worker.py line ~280 # Change from: workers = mp.cpu_count() # To: workers = int(mp.cpu_count() * 0.7) # 70% utilization (22 workers) ``` ### Database Optimization **For large result sets (>100k strategies):** ```bash # Add indexes if queries slow sqlite3 cluster/exploration.db < threshold) 5. **Walk-Forward Analysis** - Test strategies on rolling time windows 6. **Multi-Asset Support** - Extend to ETH, BTC, other Drift markets 7. **Auto-Deployment** - Push top strategies to production after validation --- **Questions?** Check main project documentation or ask in development chat. **Ready to start?** Run test sweep first: `python3 cluster/distributed_coordinator.py --chunk-size 100`