trading_bot_v4/cluster/README.md

# Distributed Continuous Optimization Cluster

**24/7 automated strategy discovery** across 2 EPYC servers (64 cores total). Explores entire indicator/parameter space to find the absolute best trading approach.

## 🏗️ Architecture

**Three-Component Distributed System:**

1. **Coordinator** (`distributed_coordinator.py`) - Master orchestrator running on srvdocker02
   - Defines parameter grid (14 dimensions, ~500k combinations)
   - Splits work into chunks (e.g., 10,000 combos per chunk)
   - Deploys worker script to EPYC servers via SSH/SCP
   - Assigns chunks to idle workers dynamically
   - Collects CSV results and imports to SQLite database
   - Tracks progress (completed/running/pending chunks)

2. **Worker** (`distributed_worker.py`) - Runs on EPYC servers
   - Integrates with existing `/home/comprehensive_sweep/backtester/` infrastructure
   - Uses proven `simulator.py` vectorized engine and `MoneyLineInputs` class
   - Loads chunk spec (start_idx, end_idx from total parameter grid)
   - Generates parameter combinations via `itertools.product()`
   - Runs multiprocessing sweep with `mp.cpu_count()` workers
   - Saves results to CSV (same format as comprehensive_sweep.py)

3. **Monitor** (`exploration_status.py`) - Real-time status dashboard
   - SSH worker health checks (active distributed_worker.py processes)
   - Chunk progress tracking (total/completed/running/pending)
   - Top 10 strategies leaderboard (P&L, trades, WR, PF, DD)
   - Best configuration details (full parameters)
   - Watch mode for continuous monitoring (30s refresh)

**Infrastructure:**
- **Worker 1:** pve-nu-monitor01 (10.10.254.106) - EPYC 7282 32 threads, 62GB RAM
- **Worker 2:** pve-srvmon01 (10.20.254.100 via worker1 2-hop SSH) - EPYC 7302 32 threads, 31GB RAM
- **Combined:** 64 cores, ~108,000 backtests/day capacity (proven: 65,536 in 29h)
- **Existing Backtester:** `/home/comprehensive_sweep/backtester/` with simulator.py, indicators/, data/
- **Data:** `solusdt_5m.csv` - Binance 5-minute OHLCV (Nov 2024 - Nov 2025)
- **Database:** `exploration.db` SQLite with strategies/chunks/phases tables

## 🚀 Quick Start

### 1. Test with Small Chunk (RECOMMENDED FIRST)

Verify system works before large-scale deployment:

```bash
cd /home/icke/traderv4/cluster

# Modify distributed_coordinator.py temporarily (lines 120-135)
# Reduce parameter ranges to 2-3 values per dimension
# Total: ~500-1000 combinations for testing

# Run test
python3 distributed_coordinator.py --chunk-size 100

# Monitor in separate terminal
python3 exploration_status.py --watch
```

**Expected:** 5-10 chunks complete in 30-60 minutes, all results in `exploration.db`

**Verify:**
- SSH commands execute successfully
- Worker script deploys to `/home/comprehensive_sweep/backtester/scripts/`
- CSV results appear in `cluster/distributed_results/`
- Database populated with strategies (check with `sqlite3 exploration.db "SELECT COUNT(*) FROM strategies"`)
- Monitoring dashboard shows accurate worker/chunk status

### 2. Run Full v9 Parameter Sweep

After test succeeds, explore full parameter space:

```bash
cd /home/icke/traderv4/cluster

# Restore full parameter ranges in distributed_coordinator.py
# Total: ~500,000 combinations (4^8 * 3^3 * 1 ≈ 500k)

# Start exploration (runs in background)
nohup python3 distributed_coordinator.py --chunk-size 10000 > sweep.log 2>&1 &

# Monitor progress
python3 exploration_status.py --watch
# OR
watch -n 60 'python3 exploration_status.py'

# Check logs
tail -f sweep.log
```

**Expected Results:**
- Duration: ~3.5 hours with 64 cores
- Find 5-10 configurations with P&L > $250/1k (baseline: $192/1k)
- Quality filters: 700+ trades, 50-70% WR, PF ≥ 1.2

### 3. Query Top Strategies

```bash
# Top 20 performers
sqlite3 cluster/exploration.db <<EOF
SELECT
    params_json,
    printf('$%.2f', pnl_per_1k) as pnl,
    trades,
    printf('%.1f%%', win_rate * 100) as wr,
    printf('%.2f', profit_factor) as pf,
    printf('%.1f%%', max_drawdown * 100) as dd,
    DATE(tested_at) as tested
FROM strategies
WHERE trades >= 700
  AND win_rate >= 0.50
  AND win_rate <= 0.70
  AND profit_factor >= 1.2
ORDER BY pnl_per_1k DESC
LIMIT 20;
EOF
```

## 📊 Parameter Space (14 Dimensions)

**v9 Money Line Configuration:**

```python
ParameterGrid(
    flip_thresholds=[0.4, 0.5, 0.6, 0.7],           # EMA flip confirmation (4 values)
    ma_gaps=[0.20, 0.30, 0.40, 0.50],                # MA50-MA200 convergence bonus (4 values)
    adx_mins=[18, 21, 24, 27],                       # ADX requirement for momentum filter (4 values)
    long_pos_maxs=[60, 65, 70, 75],                  # Price position for LONG momentum (4 values)
    short_pos_mins=[20, 25, 30, 35],                 # Price position for SHORT momentum (4 values)
    cooldowns=[1, 2, 3, 4],                          # Bars between signals (4 values)
    position_sizes=[1.0],                            # Full position (1 value fixed)
    tp1_multipliers=[1.5, 2.0, 2.5],                 # TP1 as ATR multiple (3 values)
    tp2_multipliers=[3.0, 4.0, 5.0],                 # TP2 as ATR multiple (3 values)
    sl_multipliers=[2.0, 3.0, 4.0],                  # SL as ATR multiple (3 values)
    tp1_close_percents=[0.5, 0.6, 0.7, 0.75],       # TP1 close % (4 values)
    trailing_multipliers=[1.0, 1.5, 2.0],            # Trailing stop multiplier (3 values)
    vol_mins=[0.8, 1.0, 1.2],                        # Minimum volume ratio (3 values)
    max_bars_list=[100, 150, 200]                    # Max bars in position (3 values)
)

# Total: 4×4×4×4×4×4×1×3×3×3×4×3×3×3 ≈ 497,664 combinations
```

## 🎯 Quality Filters

**Applied to all strategy results:**

- **Minimum trades:** 700+ (statistical significance)
- **Win rate range:** 50-70% (realistic, avoids overfitting)
- **Profit factor:** ≥ 1.2 (solid edge)
- **Max drawdown:** Tracked but no hard limit (informational)

**Why these filters:**
- Trade count validates statistical robustness
- WR range prevents curve-fitting (>70% = overfit, <50% = coin flip)
- PF threshold ensures strategy has actual edge

## 📈 Expected Results

**Current Baseline (v9 default parameters):**
- P&L: $192 per $1k capital
- Trades: ~700
- Win Rate: ~61%
- Profit Factor: ~1.4

**Optimization Goals:**
- **Target:** >$250/1k P&L (30% improvement)
- **Stretch:** >$300/1k P&L (56% improvement)
- **Expected:** Find 5-10 configurations meeting quality filters with P&L > $250/1k

**Why achievable:**
- 500k combinations vs 27 tested in narrow sweep
- Full parameter space exploration vs limited grid
- Proven infrastructure (65,536 backtests completed successfully)

## 🔄 Continuous Exploration Roadmap

**Phase 1: v9 Money Line Parameter Optimization (~500k combos, 3.5h)**
- Status: READY TO RUN
- Goal: Find optimal flip_threshold, ma_gap, momentum filters
- Expected: >$250/1k P&L

**Phase 2: RSI Divergence Integration (~100k combos, 45min)**
- Add RSI divergence detection
- Combine with v9 momentum filter
- Parameters: RSI lookback, divergence strength threshold
- Goal: Catch trend reversals early

**Phase 3: Volume Profile Analysis (~200k combos, 1.5h)**
- Volume profile zones (POC, VAH, VAL)
- Order flow imbalance detection
- Parameters: Profile window, entry threshold, confirmation bars
- Goal: Better entry timing

**Phase 4: Multi-Timeframe Confirmation (~150k combos, 1h)**
- 5min + 15min + 1H alignment
- Higher timeframe trend filter
- Parameters: Timeframes to use, alignment strictness
- Goal: Reduce false signals

**Phase 5: Hybrid Indicators (~50k combos, 30min)**
- Combine best performers from Phase 1-4
- Test cross-strategy synergy
- Goal: Break $300/1k barrier

**Phase 6: ML-Based Optimization (~100k+ combos, 1h+)**
- Feature engineering from top strategies
- Gradient boosting / random forest
- Genetic algorithm parameter tuning
- Goal: Discover non-obvious patterns

## 📁 File Structure

```
cluster/
├── distributed_coordinator.py    # Master orchestrator (650 lines)
├── distributed_worker.py         # Worker script (350 lines)
├── exploration_status.py         # Monitoring dashboard (200 lines)
├── exploration.db                # SQLite results database
├── distributed_results/          # CSV results from workers
│   ├── worker1_chunk_0.csv
│   ├── worker1_chunk_1.csv
│   └── worker2_chunk_0.csv
└── README.md                     # This file

/home/comprehensive_sweep/backtester/  (on EPYC servers)
├── simulator.py                  # Core vectorized engine
├── indicators/
│   ├── money_line.py             # MoneyLineInputs class
│   └── ...
├── data/
│   └── solusdt_5m.csv            # Binance 5-minute OHLCV
├── scripts/
│   ├── comprehensive_sweep.py    # Original multiprocessing sweep
│   └── distributed_worker.py     # Deployed by coordinator
└── .venv/                        # Python 3.11.2, pandas, numpy
```

## 💾 Database Schema

### strategies table
```sql
CREATE TABLE strategies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    phase_id INTEGER,                -- Which exploration phase (1=v9, 2=RSI, etc.)
    params_json TEXT NOT NULL,       -- JSON parameter configuration
    pnl_per_1k REAL,                 -- Performance metric ($ PnL per $1k)
    trades INTEGER,                  -- Total trades in backtest
    win_rate REAL,                   -- Decimal win rate (0.61 = 61%)
    profit_factor REAL,              -- Gross profit / gross loss
    max_drawdown REAL,               -- Largest peak-to-trough decline (decimal)
    tested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_strategies_pnl ON strategies(pnl_per_1k DESC);
CREATE INDEX idx_strategies_trades ON strategies(trades);
```

### chunks table
```sql
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    phase_id INTEGER,
    worker_id TEXT,                  -- 'worker1' or 'worker2'
    start_idx INTEGER,               -- Start index in parameter grid
    end_idx INTEGER,                 -- End index (exclusive)
    total_combos INTEGER,            -- Total in this chunk
    status TEXT DEFAULT 'pending',   -- pending/running/completed/failed
    assigned_at TIMESTAMP,
    completed_at TIMESTAMP,
    result_file TEXT,                -- Path to CSV result file
    FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_chunks_status ON chunks(status);
```

### phases table
```sql
CREATE TABLE phases (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,              -- 'v9_optimization', 'rsi_divergence', etc.
    description TEXT,
    total_combinations INTEGER,      -- Total parameter combinations
    started_at TIMESTAMP,
    completed_at TIMESTAMP
);
```

## 🔧 Troubleshooting

### SSH Connection Issues

**Symptom:** "Connection refused" or timeout errors

**Solutions:**
```bash
# Test Worker 1 connectivity
ssh root@10.10.254.106 'echo "Worker 1 OK"'

# Test Worker 2 (2-hop) connectivity
ssh root@10.10.254.106 'ssh root@10.20.254.100 "echo Worker 2 OK"'

# Check SSH keys
ssh-add -l

# Verify authorized_keys on workers
ssh root@10.10.254.106 'cat ~/.ssh/authorized_keys'
```

### Path/Import Errors on Workers

**Symptom:** "ModuleNotFoundError" or "FileNotFoundError"

**Solutions:**
```bash
# Verify backtester exists on Worker 1
ssh root@10.10.254.106 'ls -lah /home/comprehensive_sweep/backtester/'

# Check Python environment
ssh root@10.10.254.106 'cd /home/comprehensive_sweep && source .venv/bin/activate && python --version'

# Verify data file
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/data/solusdt_5m.csv'

# Check distributed_worker.py deployment
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/scripts/distributed_worker.py'
```

### Worker Processes Stuck/Hung

**Symptom:** exploration_status.py shows "running" but no progress

**Solutions:**
```bash
# Check worker processes
ssh root@10.10.254.106 'ps aux | grep distributed_worker'

# Check worker CPU usage (should be near 100% on 32 cores)
ssh root@10.10.254.106 'top -bn1 | head -20'

# Kill hung worker (coordinator will reassign chunk)
ssh root@10.10.254.106 'pkill -f distributed_worker.py'

# Check worker logs
ssh root@10.10.254.106 'tail -50 /home/comprehensive_sweep/backtester/scripts/worker_*.log'
```

### Database Locked/Corrupt

**Symptom:** "database is locked" errors

**Solutions:**
```bash
# Check for stale locks
cd /home/icke/traderv4/cluster
fuser exploration.db

# Backup and rebuild
cp exploration.db exploration.db.backup
sqlite3 exploration.db "VACUUM;"

# Verify integrity
sqlite3 exploration.db "PRAGMA integrity_check;"
```

### Results Not Importing

**Symptom:** CSVs in distributed_results/ but database empty

**Solutions:**
```bash
# Check CSV format
head -20 cluster/distributed_results/worker1_chunk_0.csv

# Manual import test
python3 -c "
import sqlite3
import pandas as pd

df = pd.read_csv('cluster/distributed_results/worker1_chunk_0.csv')
print(f'Loaded {len(df)} results')
print(df.columns.tolist())
print(df.head())
"

# Check coordinator logs for import errors
grep -i "error\|exception" sweep.log | tail -20
```

## ⚡ Performance Tuning

### Chunk Size Trade-offs

**Small chunks (1,000-5,000):**
- ✅ Better load balancing
- ✅ Faster feedback loop
- ❌ More SSH/SCP overhead
- ❌ More database writes

**Large chunks (10,000-20,000):**
- ✅ Less overhead
- ✅ Fewer database transactions
- ❌ Less granular progress tracking
- ❌ Wasted work if chunk fails

**Recommended:** 10,000 combos per chunk (good balance)

### Worker Concurrency

**Current:** Uses `mp.cpu_count()` (32 workers per EPYC)

**To reduce CPU load:**
```python
# In distributed_worker.py line ~280
# Change from:
workers = mp.cpu_count()
# To:
workers = int(mp.cpu_count() * 0.7)  # 70% utilization (22 workers)
```

### Database Optimization

**For large result sets (>100k strategies):**
```bash
# Add indexes if queries slow
sqlite3 cluster/exploration.db <<EOF
CREATE INDEX IF NOT EXISTS idx_strategies_phase ON strategies(phase_id);
CREATE INDEX IF NOT EXISTS idx_strategies_wr ON strategies(win_rate);
CREATE INDEX IF NOT EXISTS idx_strategies_pf ON strategies(profit_factor);
ANALYZE;
EOF
```

## ✅ Best Practices

1. **Always test with small chunk first** (100-1000 combos) before full sweep
2. **Monitor regularly** with `exploration_status.py --watch` during runs
3. **Backup database** before major changes: `cp exploration.db exploration.db.backup`
4. **Review top strategies** after each phase completion
5. **Archive old results** if disk space low (CSV files can be deleted after import)
6. **Validate quality filters** - adjust if too strict/lenient based on results
7. **Check worker logs** if progress stalls: `ssh root@10.10.254.106 'tail -f /home/comprehensive_sweep/backtester/scripts/worker_*.log'`

## 🔗 Integration with Production Bot

**After finding top strategy:**

1. **Extract parameters from database:**
```bash
sqlite3 cluster/exploration.db <<EOF
SELECT params_json FROM strategies
WHERE id = (SELECT id FROM strategies ORDER BY pnl_per_1k DESC LIMIT 1);
EOF
```

2. **Update TradingView indicator** (`workflows/trading/moneyline_v9_ma_gap.pinescript`):
   - Set `flip_threshold`, `ma_gap`, `momentum_adx`, etc. to optimal values
   - Test in replay mode with historical data

3. **Update bot configuration** (`.env` file):
   - Adjust `MIN_SIGNAL_QUALITY_SCORE` if needed
   - Update position sizing if strategy has different risk profile

4. **Forward test** (50-100 trades) before increasing capital:
   - Use `SOLANA_POSITION_SIZE=10` (10% of capital)
   - Monitor win rate, P&L, drawdown
   - If metrics match backtest ± 10%, increase to full size

## 📚 Support & Documentation

- **Main project docs:** `/home/icke/traderv4/.github/copilot-instructions.md` (5,181 lines)
- **Trading goals:** `TRADING_GOALS.md` (8-phase $106→$100k+ roadmap)
- **v9 indicator:** `INDICATOR_V9_MA_GAP_ROADMAP.md`
- **Optimization roadmaps:** `SIGNAL_QUALITY_OPTIMIZATION_ROADMAP.md`, `POSITION_SCALING_ROADMAP.md`
- **Adaptive leverage:** `ADAPTIVE_LEVERAGE_SYSTEM.md`

## 🚀 Future Enhancements

**Potential additions:**

1. **Genetic Algorithm Optimization** - Breed top performers, test offspring
2. **Bayesian Optimization** - Guide search toward promising parameter regions
3. **Web Dashboard** - Real-time browser-based monitoring (Flask/FastAPI)
4. **Telegram Alerts** - Notify when exceptional strategies found (P&L > threshold)
5. **Walk-Forward Analysis** - Test strategies on rolling time windows
6. **Multi-Asset Support** - Extend to ETH, BTC, other Drift markets
7. **Auto-Deployment** - Push top strategies to production after validation

---

**Questions?** Check main project documentation or ask in development chat.

**Ready to start?** Run test sweep first: `python3 cluster/distributed_coordinator.py --chunk-size 100`