CRITICAL FIX (Dec 1, 2025): Cluster start was failing with 'operation failed' Problem: - SSH commands timing out after 30s (too short for 2-hop SSH to worker2) - Missing SSH options caused prompts/delays - Result: Coordinator failed to start worker processes Solution: - Increased timeout from 30s to 60s for nested SSH hops - Added SSH options: -o StrictHostKeyChecking=no -o ConnectTimeout=10 - Applied options to both ssh_command() and worker startup commands Verification (Dec 1, 09:40): - Worker1: 23 processes running (chunk 0-2000) - Worker2: 24 processes running (chunk 2000-4000) - Cluster status: ACTIVE with 2 workers - Both chunks processing successfully Files changed: - cluster/distributed_coordinator.py (lines 302-314, 388-414)
Distributed Continuous Optimization Cluster
24/7 automated strategy discovery across 2 EPYC servers (64 cores total). Explores entire indicator/parameter space to find the absolute best trading approach.
🏗️ Architecture
Three-Component Distributed System:
-
Coordinator (
distributed_coordinator.py) - Master orchestrator running on srvdocker02- Defines parameter grid (14 dimensions, ~500k combinations)
- Splits work into chunks (e.g., 10,000 combos per chunk)
- Deploys worker script to EPYC servers via SSH/SCP
- Assigns chunks to idle workers dynamically
- Collects CSV results and imports to SQLite database
- Tracks progress (completed/running/pending chunks)
-
Worker (
distributed_worker.py) - Runs on EPYC servers- Integrates with existing
/home/comprehensive_sweep/backtester/infrastructure - Uses proven
simulator.pyvectorized engine andMoneyLineInputsclass - Loads chunk spec (start_idx, end_idx from total parameter grid)
- Generates parameter combinations via
itertools.product() - Runs multiprocessing sweep with
mp.cpu_count()workers - Saves results to CSV (same format as comprehensive_sweep.py)
- Integrates with existing
-
Monitor (
exploration_status.py) - Real-time status dashboard- SSH worker health checks (active distributed_worker.py processes)
- Chunk progress tracking (total/completed/running/pending)
- Top 10 strategies leaderboard (P&L, trades, WR, PF, DD)
- Best configuration details (full parameters)
- Watch mode for continuous monitoring (30s refresh)
Infrastructure:
- Worker 1: pve-nu-monitor01 (10.10.254.106) - EPYC 7282 32 threads, 62GB RAM
- Worker 2: pve-srvmon01 (10.20.254.100 via worker1 2-hop SSH) - EPYC 7302 32 threads, 31GB RAM
- Combined: 64 cores, ~108,000 backtests/day capacity (proven: 65,536 in 29h)
- Existing Backtester:
/home/comprehensive_sweep/backtester/with simulator.py, indicators/, data/ - Data:
solusdt_5m.csv- Binance 5-minute OHLCV (Nov 2024 - Nov 2025) - Database:
exploration.dbSQLite with strategies/chunks/phases tables
🚀 Quick Start
1. Test with Small Chunk (RECOMMENDED FIRST)
Verify system works before large-scale deployment:
cd /home/icke/traderv4/cluster
# Modify distributed_coordinator.py temporarily (lines 120-135)
# Reduce parameter ranges to 2-3 values per dimension
# Total: ~500-1000 combinations for testing
# Run test
python3 distributed_coordinator.py --chunk-size 100
# Monitor in separate terminal
python3 exploration_status.py --watch
Expected: 5-10 chunks complete in 30-60 minutes, all results in exploration.db
Verify:
- SSH commands execute successfully
- Worker script deploys to
/home/comprehensive_sweep/backtester/scripts/ - CSV results appear in
cluster/distributed_results/ - Database populated with strategies (check with
sqlite3 exploration.db "SELECT COUNT(*) FROM strategies") - Monitoring dashboard shows accurate worker/chunk status
2. Run Full v9 Parameter Sweep
After test succeeds, explore full parameter space:
cd /home/icke/traderv4/cluster
# Restore full parameter ranges in distributed_coordinator.py
# Total: ~500,000 combinations (4^8 * 3^3 * 1 ≈ 500k)
# Start exploration (runs in background)
nohup python3 distributed_coordinator.py --chunk-size 10000 > sweep.log 2>&1 &
# Monitor progress
python3 exploration_status.py --watch
# OR
watch -n 60 'python3 exploration_status.py'
# Check logs
tail -f sweep.log
Expected Results:
- Duration: ~3.5 hours with 64 cores
- Find 5-10 configurations with P&L > $250/1k (baseline: $192/1k)
- Quality filters: 700+ trades, 50-70% WR, PF ≥ 1.2
3. Query Top Strategies
# Top 20 performers
sqlite3 cluster/exploration.db <<EOF
SELECT
params_json,
printf('$%.2f', pnl_per_1k) as pnl,
trades,
printf('%.1f%%', win_rate * 100) as wr,
printf('%.2f', profit_factor) as pf,
printf('%.1f%%', max_drawdown * 100) as dd,
DATE(tested_at) as tested
FROM strategies
WHERE trades >= 700
AND win_rate >= 0.50
AND win_rate <= 0.70
AND profit_factor >= 1.2
ORDER BY pnl_per_1k DESC
LIMIT 20;
EOF
📊 Parameter Space (14 Dimensions)
v9 Money Line Configuration:
ParameterGrid(
flip_thresholds=[0.4, 0.5, 0.6, 0.7], # EMA flip confirmation (4 values)
ma_gaps=[0.20, 0.30, 0.40, 0.50], # MA50-MA200 convergence bonus (4 values)
adx_mins=[18, 21, 24, 27], # ADX requirement for momentum filter (4 values)
long_pos_maxs=[60, 65, 70, 75], # Price position for LONG momentum (4 values)
short_pos_mins=[20, 25, 30, 35], # Price position for SHORT momentum (4 values)
cooldowns=[1, 2, 3, 4], # Bars between signals (4 values)
position_sizes=[1.0], # Full position (1 value fixed)
tp1_multipliers=[1.5, 2.0, 2.5], # TP1 as ATR multiple (3 values)
tp2_multipliers=[3.0, 4.0, 5.0], # TP2 as ATR multiple (3 values)
sl_multipliers=[2.0, 3.0, 4.0], # SL as ATR multiple (3 values)
tp1_close_percents=[0.5, 0.6, 0.7, 0.75], # TP1 close % (4 values)
trailing_multipliers=[1.0, 1.5, 2.0], # Trailing stop multiplier (3 values)
vol_mins=[0.8, 1.0, 1.2], # Minimum volume ratio (3 values)
max_bars_list=[100, 150, 200] # Max bars in position (3 values)
)
# Total: 4×4×4×4×4×4×1×3×3×3×4×3×3×3 ≈ 497,664 combinations
🎯 Quality Filters
Applied to all strategy results:
- Minimum trades: 700+ (statistical significance)
- Win rate range: 50-70% (realistic, avoids overfitting)
- Profit factor: ≥ 1.2 (solid edge)
- Max drawdown: Tracked but no hard limit (informational)
Why these filters:
- Trade count validates statistical robustness
- WR range prevents curve-fitting (>70% = overfit, <50% = coin flip)
- PF threshold ensures strategy has actual edge
📈 Expected Results
Current Baseline (v9 default parameters):
- P&L: $192 per $1k capital
- Trades: ~700
- Win Rate: ~61%
- Profit Factor: ~1.4
Optimization Goals:
- Target: >$250/1k P&L (30% improvement)
- Stretch: >$300/1k P&L (56% improvement)
- Expected: Find 5-10 configurations meeting quality filters with P&L > $250/1k
Why achievable:
- 500k combinations vs 27 tested in narrow sweep
- Full parameter space exploration vs limited grid
- Proven infrastructure (65,536 backtests completed successfully)
🔄 Continuous Exploration Roadmap
Phase 1: v9 Money Line Parameter Optimization (~500k combos, 3.5h)
- Status: READY TO RUN
- Goal: Find optimal flip_threshold, ma_gap, momentum filters
- Expected: >$250/1k P&L
Phase 2: RSI Divergence Integration (~100k combos, 45min)
- Add RSI divergence detection
- Combine with v9 momentum filter
- Parameters: RSI lookback, divergence strength threshold
- Goal: Catch trend reversals early
Phase 3: Volume Profile Analysis (~200k combos, 1.5h)
- Volume profile zones (POC, VAH, VAL)
- Order flow imbalance detection
- Parameters: Profile window, entry threshold, confirmation bars
- Goal: Better entry timing
Phase 4: Multi-Timeframe Confirmation (~150k combos, 1h)
- 5min + 15min + 1H alignment
- Higher timeframe trend filter
- Parameters: Timeframes to use, alignment strictness
- Goal: Reduce false signals
Phase 5: Hybrid Indicators (~50k combos, 30min)
- Combine best performers from Phase 1-4
- Test cross-strategy synergy
- Goal: Break $300/1k barrier
Phase 6: ML-Based Optimization (~100k+ combos, 1h+)
- Feature engineering from top strategies
- Gradient boosting / random forest
- Genetic algorithm parameter tuning
- Goal: Discover non-obvious patterns
📁 File Structure
cluster/
├── distributed_coordinator.py # Master orchestrator (650 lines)
├── distributed_worker.py # Worker script (350 lines)
├── exploration_status.py # Monitoring dashboard (200 lines)
├── exploration.db # SQLite results database
├── distributed_results/ # CSV results from workers
│ ├── worker1_chunk_0.csv
│ ├── worker1_chunk_1.csv
│ └── worker2_chunk_0.csv
└── README.md # This file
/home/comprehensive_sweep/backtester/ (on EPYC servers)
├── simulator.py # Core vectorized engine
├── indicators/
│ ├── money_line.py # MoneyLineInputs class
│ └── ...
├── data/
│ └── solusdt_5m.csv # Binance 5-minute OHLCV
├── scripts/
│ ├── comprehensive_sweep.py # Original multiprocessing sweep
│ └── distributed_worker.py # Deployed by coordinator
└── .venv/ # Python 3.11.2, pandas, numpy
💾 Database Schema
strategies table
CREATE TABLE strategies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
phase_id INTEGER, -- Which exploration phase (1=v9, 2=RSI, etc.)
params_json TEXT NOT NULL, -- JSON parameter configuration
pnl_per_1k REAL, -- Performance metric ($ PnL per $1k)
trades INTEGER, -- Total trades in backtest
win_rate REAL, -- Decimal win rate (0.61 = 61%)
profit_factor REAL, -- Gross profit / gross loss
max_drawdown REAL, -- Largest peak-to-trough decline (decimal)
tested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_strategies_pnl ON strategies(pnl_per_1k DESC);
CREATE INDEX idx_strategies_trades ON strategies(trades);
chunks table
CREATE TABLE chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
phase_id INTEGER,
worker_id TEXT, -- 'worker1' or 'worker2'
start_idx INTEGER, -- Start index in parameter grid
end_idx INTEGER, -- End index (exclusive)
total_combos INTEGER, -- Total in this chunk
status TEXT DEFAULT 'pending', -- pending/running/completed/failed
assigned_at TIMESTAMP,
completed_at TIMESTAMP,
result_file TEXT, -- Path to CSV result file
FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_chunks_status ON chunks(status);
phases table
CREATE TABLE phases (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL, -- 'v9_optimization', 'rsi_divergence', etc.
description TEXT,
total_combinations INTEGER, -- Total parameter combinations
started_at TIMESTAMP,
completed_at TIMESTAMP
);
🔧 Troubleshooting
SSH Connection Issues
Symptom: "Connection refused" or timeout errors
Solutions:
# Test Worker 1 connectivity
ssh root@10.10.254.106 'echo "Worker 1 OK"'
# Test Worker 2 (2-hop) connectivity
ssh root@10.10.254.106 'ssh root@10.20.254.100 "echo Worker 2 OK"'
# Check SSH keys
ssh-add -l
# Verify authorized_keys on workers
ssh root@10.10.254.106 'cat ~/.ssh/authorized_keys'
Path/Import Errors on Workers
Symptom: "ModuleNotFoundError" or "FileNotFoundError"
Solutions:
# Verify backtester exists on Worker 1
ssh root@10.10.254.106 'ls -lah /home/comprehensive_sweep/backtester/'
# Check Python environment
ssh root@10.10.254.106 'cd /home/comprehensive_sweep && source .venv/bin/activate && python --version'
# Verify data file
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/data/solusdt_5m.csv'
# Check distributed_worker.py deployment
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/scripts/distributed_worker.py'
Worker Processes Stuck/Hung
Symptom: exploration_status.py shows "running" but no progress
Solutions:
# Check worker processes
ssh root@10.10.254.106 'ps aux | grep distributed_worker'
# Check worker CPU usage (should be near 100% on 32 cores)
ssh root@10.10.254.106 'top -bn1 | head -20'
# Kill hung worker (coordinator will reassign chunk)
ssh root@10.10.254.106 'pkill -f distributed_worker.py'
# Check worker logs
ssh root@10.10.254.106 'tail -50 /home/comprehensive_sweep/backtester/scripts/worker_*.log'
Database Locked/Corrupt
Symptom: "database is locked" errors
Solutions:
# Check for stale locks
cd /home/icke/traderv4/cluster
fuser exploration.db
# Backup and rebuild
cp exploration.db exploration.db.backup
sqlite3 exploration.db "VACUUM;"
# Verify integrity
sqlite3 exploration.db "PRAGMA integrity_check;"
Results Not Importing
Symptom: CSVs in distributed_results/ but database empty
Solutions:
# Check CSV format
head -20 cluster/distributed_results/worker1_chunk_0.csv
# Manual import test
python3 -c "
import sqlite3
import pandas as pd
df = pd.read_csv('cluster/distributed_results/worker1_chunk_0.csv')
print(f'Loaded {len(df)} results')
print(df.columns.tolist())
print(df.head())
"
# Check coordinator logs for import errors
grep -i "error\|exception" sweep.log | tail -20
⚡ Performance Tuning
Chunk Size Trade-offs
Small chunks (1,000-5,000):
- ✅ Better load balancing
- ✅ Faster feedback loop
- ❌ More SSH/SCP overhead
- ❌ More database writes
Large chunks (10,000-20,000):
- ✅ Less overhead
- ✅ Fewer database transactions
- ❌ Less granular progress tracking
- ❌ Wasted work if chunk fails
Recommended: 10,000 combos per chunk (good balance)
Worker Concurrency
Current: Uses mp.cpu_count() (32 workers per EPYC)
To reduce CPU load:
# In distributed_worker.py line ~280
# Change from:
workers = mp.cpu_count()
# To:
workers = int(mp.cpu_count() * 0.7) # 70% utilization (22 workers)
Database Optimization
For large result sets (>100k strategies):
# Add indexes if queries slow
sqlite3 cluster/exploration.db <<EOF
CREATE INDEX IF NOT EXISTS idx_strategies_phase ON strategies(phase_id);
CREATE INDEX IF NOT EXISTS idx_strategies_wr ON strategies(win_rate);
CREATE INDEX IF NOT EXISTS idx_strategies_pf ON strategies(profit_factor);
ANALYZE;
EOF
✅ Best Practices
- Always test with small chunk first (100-1000 combos) before full sweep
- Monitor regularly with
exploration_status.py --watchduring runs - Backup database before major changes:
cp exploration.db exploration.db.backup - Review top strategies after each phase completion
- Archive old results if disk space low (CSV files can be deleted after import)
- Validate quality filters - adjust if too strict/lenient based on results
- Check worker logs if progress stalls:
ssh root@10.10.254.106 'tail -f /home/comprehensive_sweep/backtester/scripts/worker_*.log'
🔗 Integration with Production Bot
After finding top strategy:
- Extract parameters from database:
sqlite3 cluster/exploration.db <<EOF
SELECT params_json FROM strategies
WHERE id = (SELECT id FROM strategies ORDER BY pnl_per_1k DESC LIMIT 1);
EOF
-
Update TradingView indicator (
workflows/trading/moneyline_v9_ma_gap.pinescript):- Set
flip_threshold,ma_gap,momentum_adx, etc. to optimal values - Test in replay mode with historical data
- Set
-
Update bot configuration (
.envfile):- Adjust
MIN_SIGNAL_QUALITY_SCOREif needed - Update position sizing if strategy has different risk profile
- Adjust
-
Forward test (50-100 trades) before increasing capital:
- Use
SOLANA_POSITION_SIZE=10(10% of capital) - Monitor win rate, P&L, drawdown
- If metrics match backtest ± 10%, increase to full size
- Use
📚 Support & Documentation
- Main project docs:
/home/icke/traderv4/.github/copilot-instructions.md(5,181 lines) - Trading goals:
TRADING_GOALS.md(8-phase $106→$100k+ roadmap) - v9 indicator:
INDICATOR_V9_MA_GAP_ROADMAP.md - Optimization roadmaps:
SIGNAL_QUALITY_OPTIMIZATION_ROADMAP.md,POSITION_SCALING_ROADMAP.md - Adaptive leverage:
ADAPTIVE_LEVERAGE_SYSTEM.md
🚀 Future Enhancements
Potential additions:
- Genetic Algorithm Optimization - Breed top performers, test offspring
- Bayesian Optimization - Guide search toward promising parameter regions
- Web Dashboard - Real-time browser-based monitoring (Flask/FastAPI)
- Telegram Alerts - Notify when exceptional strategies found (P&L > threshold)
- Walk-Forward Analysis - Test strategies on rolling time windows
- Multi-Asset Support - Extend to ETH, BTC, other Drift markets
- Auto-Deployment - Push top strategies to production after validation
Questions? Check main project documentation or ask in development chat.
Ready to start? Run test sweep first: python3 cluster/distributed_coordinator.py --chunk-size 100