CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST
Changes:
1. app/api/cluster/status/route.ts:
- Query exploration database before SSH detection
- If running chunks exist, mark workers 'active' even if SSH fails
- Override worker status: 'offline' → 'active' when chunks running
- Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
- Database is source of truth, SSH only for supplementary metrics
2. app/cluster/page.tsx:
- Stop button ALREADY EXISTS (conditionally shown)
- Shows Start when status='idle', Stop when status='active'
- No code changes needed - fixed by status detection
Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues
Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
494 lines
17 KiB
Markdown
494 lines
17 KiB
Markdown
# Distributed Continuous Optimization Cluster
|
||
|
||
**24/7 automated strategy discovery** across 2 EPYC servers (64 cores total). Explores entire indicator/parameter space to find the absolute best trading approach.
|
||
|
||
## 🏗️ Architecture
|
||
|
||
**Three-Component Distributed System:**
|
||
|
||
1. **Coordinator** (`distributed_coordinator.py`) - Master orchestrator running on srvdocker02
|
||
- Defines parameter grid (14 dimensions, ~500k combinations)
|
||
- Splits work into chunks (e.g., 10,000 combos per chunk)
|
||
- Deploys worker script to EPYC servers via SSH/SCP
|
||
- Assigns chunks to idle workers dynamically
|
||
- Collects CSV results and imports to SQLite database
|
||
- Tracks progress (completed/running/pending chunks)
|
||
|
||
2. **Worker** (`distributed_worker.py`) - Runs on EPYC servers
|
||
- Integrates with existing `/home/comprehensive_sweep/backtester/` infrastructure
|
||
- Uses proven `simulator.py` vectorized engine and `MoneyLineInputs` class
|
||
- Loads chunk spec (start_idx, end_idx from total parameter grid)
|
||
- Generates parameter combinations via `itertools.product()`
|
||
- Runs multiprocessing sweep with `mp.cpu_count()` workers
|
||
- Saves results to CSV (same format as comprehensive_sweep.py)
|
||
|
||
3. **Monitor** (`exploration_status.py`) - Real-time status dashboard
|
||
- SSH worker health checks (active distributed_worker.py processes)
|
||
- Chunk progress tracking (total/completed/running/pending)
|
||
- Top 10 strategies leaderboard (P&L, trades, WR, PF, DD)
|
||
- Best configuration details (full parameters)
|
||
- Watch mode for continuous monitoring (30s refresh)
|
||
|
||
**Infrastructure:**
|
||
- **Worker 1:** pve-nu-monitor01 (10.10.254.106) - EPYC 7282 32 threads, 62GB RAM
|
||
- **Worker 2:** pve-srvmon01 (10.20.254.100 via worker1 2-hop SSH) - EPYC 7302 32 threads, 31GB RAM
|
||
- **Combined:** 64 cores, ~108,000 backtests/day capacity (proven: 65,536 in 29h)
|
||
- **Existing Backtester:** `/home/comprehensive_sweep/backtester/` with simulator.py, indicators/, data/
|
||
- **Data:** `solusdt_5m.csv` - Binance 5-minute OHLCV (Nov 2024 - Nov 2025)
|
||
- **Database:** `exploration.db` SQLite with strategies/chunks/phases tables
|
||
|
||
## 🚀 Quick Start
|
||
|
||
### 1. Test with Small Chunk (RECOMMENDED FIRST)
|
||
|
||
Verify system works before large-scale deployment:
|
||
|
||
```bash
|
||
cd /home/icke/traderv4/cluster
|
||
|
||
# Modify distributed_coordinator.py temporarily (lines 120-135)
|
||
# Reduce parameter ranges to 2-3 values per dimension
|
||
# Total: ~500-1000 combinations for testing
|
||
|
||
# Run test
|
||
python3 distributed_coordinator.py --chunk-size 100
|
||
|
||
# Monitor in separate terminal
|
||
python3 exploration_status.py --watch
|
||
```
|
||
|
||
**Expected:** 5-10 chunks complete in 30-60 minutes, all results in `exploration.db`
|
||
|
||
**Verify:**
|
||
- SSH commands execute successfully
|
||
- Worker script deploys to `/home/comprehensive_sweep/backtester/scripts/`
|
||
- CSV results appear in `cluster/distributed_results/`
|
||
- Database populated with strategies (check with `sqlite3 exploration.db "SELECT COUNT(*) FROM strategies"`)
|
||
- Monitoring dashboard shows accurate worker/chunk status
|
||
|
||
### 2. Run Full v9 Parameter Sweep
|
||
|
||
After test succeeds, explore full parameter space:
|
||
|
||
```bash
|
||
cd /home/icke/traderv4/cluster
|
||
|
||
# Restore full parameter ranges in distributed_coordinator.py
|
||
# Total: ~500,000 combinations (4^8 * 3^3 * 1 ≈ 500k)
|
||
|
||
# Start exploration (runs in background)
|
||
nohup python3 distributed_coordinator.py --chunk-size 10000 > sweep.log 2>&1 &
|
||
|
||
# Monitor progress
|
||
python3 exploration_status.py --watch
|
||
# OR
|
||
watch -n 60 'python3 exploration_status.py'
|
||
|
||
# Check logs
|
||
tail -f sweep.log
|
||
```
|
||
|
||
**Expected Results:**
|
||
- Duration: ~3.5 hours with 64 cores
|
||
- Find 5-10 configurations with P&L > $250/1k (baseline: $192/1k)
|
||
- Quality filters: 700+ trades, 50-70% WR, PF ≥ 1.2
|
||
|
||
### 3. Query Top Strategies
|
||
|
||
```bash
|
||
# Top 20 performers
|
||
sqlite3 cluster/exploration.db <<EOF
|
||
SELECT
|
||
params_json,
|
||
printf('$%.2f', pnl_per_1k) as pnl,
|
||
trades,
|
||
printf('%.1f%%', win_rate * 100) as wr,
|
||
printf('%.2f', profit_factor) as pf,
|
||
printf('%.1f%%', max_drawdown * 100) as dd,
|
||
DATE(tested_at) as tested
|
||
FROM strategies
|
||
WHERE trades >= 700
|
||
AND win_rate >= 0.50
|
||
AND win_rate <= 0.70
|
||
AND profit_factor >= 1.2
|
||
ORDER BY pnl_per_1k DESC
|
||
LIMIT 20;
|
||
EOF
|
||
```
|
||
|
||
## 📊 Parameter Space (14 Dimensions)
|
||
|
||
**v9 Money Line Configuration:**
|
||
|
||
```python
|
||
ParameterGrid(
|
||
flip_thresholds=[0.4, 0.5, 0.6, 0.7], # EMA flip confirmation (4 values)
|
||
ma_gaps=[0.20, 0.30, 0.40, 0.50], # MA50-MA200 convergence bonus (4 values)
|
||
adx_mins=[18, 21, 24, 27], # ADX requirement for momentum filter (4 values)
|
||
long_pos_maxs=[60, 65, 70, 75], # Price position for LONG momentum (4 values)
|
||
short_pos_mins=[20, 25, 30, 35], # Price position for SHORT momentum (4 values)
|
||
cooldowns=[1, 2, 3, 4], # Bars between signals (4 values)
|
||
position_sizes=[1.0], # Full position (1 value fixed)
|
||
tp1_multipliers=[1.5, 2.0, 2.5], # TP1 as ATR multiple (3 values)
|
||
tp2_multipliers=[3.0, 4.0, 5.0], # TP2 as ATR multiple (3 values)
|
||
sl_multipliers=[2.0, 3.0, 4.0], # SL as ATR multiple (3 values)
|
||
tp1_close_percents=[0.5, 0.6, 0.7, 0.75], # TP1 close % (4 values)
|
||
trailing_multipliers=[1.0, 1.5, 2.0], # Trailing stop multiplier (3 values)
|
||
vol_mins=[0.8, 1.0, 1.2], # Minimum volume ratio (3 values)
|
||
max_bars_list=[100, 150, 200] # Max bars in position (3 values)
|
||
)
|
||
|
||
# Total: 4×4×4×4×4×4×1×3×3×3×4×3×3×3 ≈ 497,664 combinations
|
||
```
|
||
|
||
## 🎯 Quality Filters
|
||
|
||
**Applied to all strategy results:**
|
||
|
||
- **Minimum trades:** 700+ (statistical significance)
|
||
- **Win rate range:** 50-70% (realistic, avoids overfitting)
|
||
- **Profit factor:** ≥ 1.2 (solid edge)
|
||
- **Max drawdown:** Tracked but no hard limit (informational)
|
||
|
||
**Why these filters:**
|
||
- Trade count validates statistical robustness
|
||
- WR range prevents curve-fitting (>70% = overfit, <50% = coin flip)
|
||
- PF threshold ensures strategy has actual edge
|
||
|
||
## 📈 Expected Results
|
||
|
||
**Current Baseline (v9 default parameters):**
|
||
- P&L: $192 per $1k capital
|
||
- Trades: ~700
|
||
- Win Rate: ~61%
|
||
- Profit Factor: ~1.4
|
||
|
||
**Optimization Goals:**
|
||
- **Target:** >$250/1k P&L (30% improvement)
|
||
- **Stretch:** >$300/1k P&L (56% improvement)
|
||
- **Expected:** Find 5-10 configurations meeting quality filters with P&L > $250/1k
|
||
|
||
**Why achievable:**
|
||
- 500k combinations vs 27 tested in narrow sweep
|
||
- Full parameter space exploration vs limited grid
|
||
- Proven infrastructure (65,536 backtests completed successfully)
|
||
|
||
## 🔄 Continuous Exploration Roadmap
|
||
|
||
**Phase 1: v9 Money Line Parameter Optimization (~500k combos, 3.5h)**
|
||
- Status: READY TO RUN
|
||
- Goal: Find optimal flip_threshold, ma_gap, momentum filters
|
||
- Expected: >$250/1k P&L
|
||
|
||
**Phase 2: RSI Divergence Integration (~100k combos, 45min)**
|
||
- Add RSI divergence detection
|
||
- Combine with v9 momentum filter
|
||
- Parameters: RSI lookback, divergence strength threshold
|
||
- Goal: Catch trend reversals early
|
||
|
||
**Phase 3: Volume Profile Analysis (~200k combos, 1.5h)**
|
||
- Volume profile zones (POC, VAH, VAL)
|
||
- Order flow imbalance detection
|
||
- Parameters: Profile window, entry threshold, confirmation bars
|
||
- Goal: Better entry timing
|
||
|
||
**Phase 4: Multi-Timeframe Confirmation (~150k combos, 1h)**
|
||
- 5min + 15min + 1H alignment
|
||
- Higher timeframe trend filter
|
||
- Parameters: Timeframes to use, alignment strictness
|
||
- Goal: Reduce false signals
|
||
|
||
**Phase 5: Hybrid Indicators (~50k combos, 30min)**
|
||
- Combine best performers from Phase 1-4
|
||
- Test cross-strategy synergy
|
||
- Goal: Break $300/1k barrier
|
||
|
||
**Phase 6: ML-Based Optimization (~100k+ combos, 1h+)**
|
||
- Feature engineering from top strategies
|
||
- Gradient boosting / random forest
|
||
- Genetic algorithm parameter tuning
|
||
- Goal: Discover non-obvious patterns
|
||
|
||
## 📁 File Structure
|
||
|
||
```
|
||
cluster/
|
||
├── distributed_coordinator.py # Master orchestrator (650 lines)
|
||
├── distributed_worker.py # Worker script (350 lines)
|
||
├── exploration_status.py # Monitoring dashboard (200 lines)
|
||
├── exploration.db # SQLite results database
|
||
├── distributed_results/ # CSV results from workers
|
||
│ ├── worker1_chunk_0.csv
|
||
│ ├── worker1_chunk_1.csv
|
||
│ └── worker2_chunk_0.csv
|
||
└── README.md # This file
|
||
|
||
/home/comprehensive_sweep/backtester/ (on EPYC servers)
|
||
├── simulator.py # Core vectorized engine
|
||
├── indicators/
|
||
│ ├── money_line.py # MoneyLineInputs class
|
||
│ └── ...
|
||
├── data/
|
||
│ └── solusdt_5m.csv # Binance 5-minute OHLCV
|
||
├── scripts/
|
||
│ ├── comprehensive_sweep.py # Original multiprocessing sweep
|
||
│ └── distributed_worker.py # Deployed by coordinator
|
||
└── .venv/ # Python 3.11.2, pandas, numpy
|
||
```
|
||
|
||
## 💾 Database Schema
|
||
|
||
### strategies table
|
||
```sql
|
||
CREATE TABLE strategies (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
phase_id INTEGER, -- Which exploration phase (1=v9, 2=RSI, etc.)
|
||
params_json TEXT NOT NULL, -- JSON parameter configuration
|
||
pnl_per_1k REAL, -- Performance metric ($ PnL per $1k)
|
||
trades INTEGER, -- Total trades in backtest
|
||
win_rate REAL, -- Decimal win rate (0.61 = 61%)
|
||
profit_factor REAL, -- Gross profit / gross loss
|
||
max_drawdown REAL, -- Largest peak-to-trough decline (decimal)
|
||
tested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||
FOREIGN KEY (phase_id) REFERENCES phases(id)
|
||
);
|
||
CREATE INDEX idx_strategies_pnl ON strategies(pnl_per_1k DESC);
|
||
CREATE INDEX idx_strategies_trades ON strategies(trades);
|
||
```
|
||
|
||
### chunks table
|
||
```sql
|
||
CREATE TABLE chunks (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
phase_id INTEGER,
|
||
worker_id TEXT, -- 'worker1' or 'worker2'
|
||
start_idx INTEGER, -- Start index in parameter grid
|
||
end_idx INTEGER, -- End index (exclusive)
|
||
total_combos INTEGER, -- Total in this chunk
|
||
status TEXT DEFAULT 'pending', -- pending/running/completed/failed
|
||
assigned_at TIMESTAMP,
|
||
completed_at TIMESTAMP,
|
||
result_file TEXT, -- Path to CSV result file
|
||
FOREIGN KEY (phase_id) REFERENCES phases(id)
|
||
);
|
||
CREATE INDEX idx_chunks_status ON chunks(status);
|
||
```
|
||
|
||
### phases table
|
||
```sql
|
||
CREATE TABLE phases (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
name TEXT NOT NULL, -- 'v9_optimization', 'rsi_divergence', etc.
|
||
description TEXT,
|
||
total_combinations INTEGER, -- Total parameter combinations
|
||
started_at TIMESTAMP,
|
||
completed_at TIMESTAMP
|
||
);
|
||
```
|
||
|
||
## 🔧 Troubleshooting
|
||
|
||
### SSH Connection Issues
|
||
|
||
**Symptom:** "Connection refused" or timeout errors
|
||
|
||
**Solutions:**
|
||
```bash
|
||
# Test Worker 1 connectivity
|
||
ssh root@10.10.254.106 'echo "Worker 1 OK"'
|
||
|
||
# Test Worker 2 (2-hop) connectivity
|
||
ssh root@10.10.254.106 'ssh root@10.20.254.100 "echo Worker 2 OK"'
|
||
|
||
# Check SSH keys
|
||
ssh-add -l
|
||
|
||
# Verify authorized_keys on workers
|
||
ssh root@10.10.254.106 'cat ~/.ssh/authorized_keys'
|
||
```
|
||
|
||
### Path/Import Errors on Workers
|
||
|
||
**Symptom:** "ModuleNotFoundError" or "FileNotFoundError"
|
||
|
||
**Solutions:**
|
||
```bash
|
||
# Verify backtester exists on Worker 1
|
||
ssh root@10.10.254.106 'ls -lah /home/comprehensive_sweep/backtester/'
|
||
|
||
# Check Python environment
|
||
ssh root@10.10.254.106 'cd /home/comprehensive_sweep && source .venv/bin/activate && python --version'
|
||
|
||
# Verify data file
|
||
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/data/solusdt_5m.csv'
|
||
|
||
# Check distributed_worker.py deployment
|
||
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/scripts/distributed_worker.py'
|
||
```
|
||
|
||
### Worker Processes Stuck/Hung
|
||
|
||
**Symptom:** exploration_status.py shows "running" but no progress
|
||
|
||
**Solutions:**
|
||
```bash
|
||
# Check worker processes
|
||
ssh root@10.10.254.106 'ps aux | grep distributed_worker'
|
||
|
||
# Check worker CPU usage (should be near 100% on 32 cores)
|
||
ssh root@10.10.254.106 'top -bn1 | head -20'
|
||
|
||
# Kill hung worker (coordinator will reassign chunk)
|
||
ssh root@10.10.254.106 'pkill -f distributed_worker.py'
|
||
|
||
# Check worker logs
|
||
ssh root@10.10.254.106 'tail -50 /home/comprehensive_sweep/backtester/scripts/worker_*.log'
|
||
```
|
||
|
||
### Database Locked/Corrupt
|
||
|
||
**Symptom:** "database is locked" errors
|
||
|
||
**Solutions:**
|
||
```bash
|
||
# Check for stale locks
|
||
cd /home/icke/traderv4/cluster
|
||
fuser exploration.db
|
||
|
||
# Backup and rebuild
|
||
cp exploration.db exploration.db.backup
|
||
sqlite3 exploration.db "VACUUM;"
|
||
|
||
# Verify integrity
|
||
sqlite3 exploration.db "PRAGMA integrity_check;"
|
||
```
|
||
|
||
### Results Not Importing
|
||
|
||
**Symptom:** CSVs in distributed_results/ but database empty
|
||
|
||
**Solutions:**
|
||
```bash
|
||
# Check CSV format
|
||
head -20 cluster/distributed_results/worker1_chunk_0.csv
|
||
|
||
# Manual import test
|
||
python3 -c "
|
||
import sqlite3
|
||
import pandas as pd
|
||
|
||
df = pd.read_csv('cluster/distributed_results/worker1_chunk_0.csv')
|
||
print(f'Loaded {len(df)} results')
|
||
print(df.columns.tolist())
|
||
print(df.head())
|
||
"
|
||
|
||
# Check coordinator logs for import errors
|
||
grep -i "error\|exception" sweep.log | tail -20
|
||
```
|
||
|
||
## ⚡ Performance Tuning
|
||
|
||
### Chunk Size Trade-offs
|
||
|
||
**Small chunks (1,000-5,000):**
|
||
- ✅ Better load balancing
|
||
- ✅ Faster feedback loop
|
||
- ❌ More SSH/SCP overhead
|
||
- ❌ More database writes
|
||
|
||
**Large chunks (10,000-20,000):**
|
||
- ✅ Less overhead
|
||
- ✅ Fewer database transactions
|
||
- ❌ Less granular progress tracking
|
||
- ❌ Wasted work if chunk fails
|
||
|
||
**Recommended:** 10,000 combos per chunk (good balance)
|
||
|
||
### Worker Concurrency
|
||
|
||
**Current:** Uses `mp.cpu_count()` (32 workers per EPYC)
|
||
|
||
**To reduce CPU load:**
|
||
```python
|
||
# In distributed_worker.py line ~280
|
||
# Change from:
|
||
workers = mp.cpu_count()
|
||
# To:
|
||
workers = int(mp.cpu_count() * 0.7) # 70% utilization (22 workers)
|
||
```
|
||
|
||
### Database Optimization
|
||
|
||
**For large result sets (>100k strategies):**
|
||
```bash
|
||
# Add indexes if queries slow
|
||
sqlite3 cluster/exploration.db <<EOF
|
||
CREATE INDEX IF NOT EXISTS idx_strategies_phase ON strategies(phase_id);
|
||
CREATE INDEX IF NOT EXISTS idx_strategies_wr ON strategies(win_rate);
|
||
CREATE INDEX IF NOT EXISTS idx_strategies_pf ON strategies(profit_factor);
|
||
ANALYZE;
|
||
EOF
|
||
```
|
||
|
||
## ✅ Best Practices
|
||
|
||
1. **Always test with small chunk first** (100-1000 combos) before full sweep
|
||
2. **Monitor regularly** with `exploration_status.py --watch` during runs
|
||
3. **Backup database** before major changes: `cp exploration.db exploration.db.backup`
|
||
4. **Review top strategies** after each phase completion
|
||
5. **Archive old results** if disk space low (CSV files can be deleted after import)
|
||
6. **Validate quality filters** - adjust if too strict/lenient based on results
|
||
7. **Check worker logs** if progress stalls: `ssh root@10.10.254.106 'tail -f /home/comprehensive_sweep/backtester/scripts/worker_*.log'`
|
||
|
||
## 🔗 Integration with Production Bot
|
||
|
||
**After finding top strategy:**
|
||
|
||
1. **Extract parameters from database:**
|
||
```bash
|
||
sqlite3 cluster/exploration.db <<EOF
|
||
SELECT params_json FROM strategies
|
||
WHERE id = (SELECT id FROM strategies ORDER BY pnl_per_1k DESC LIMIT 1);
|
||
EOF
|
||
```
|
||
|
||
2. **Update TradingView indicator** (`workflows/trading/moneyline_v9_ma_gap.pinescript`):
|
||
- Set `flip_threshold`, `ma_gap`, `momentum_adx`, etc. to optimal values
|
||
- Test in replay mode with historical data
|
||
|
||
3. **Update bot configuration** (`.env` file):
|
||
- Adjust `MIN_SIGNAL_QUALITY_SCORE` if needed
|
||
- Update position sizing if strategy has different risk profile
|
||
|
||
4. **Forward test** (50-100 trades) before increasing capital:
|
||
- Use `SOLANA_POSITION_SIZE=10` (10% of capital)
|
||
- Monitor win rate, P&L, drawdown
|
||
- If metrics match backtest ± 10%, increase to full size
|
||
|
||
## 📚 Support & Documentation
|
||
|
||
- **Main project docs:** `/home/icke/traderv4/.github/copilot-instructions.md` (5,181 lines)
|
||
- **Trading goals:** `TRADING_GOALS.md` (8-phase $106→$100k+ roadmap)
|
||
- **v9 indicator:** `INDICATOR_V9_MA_GAP_ROADMAP.md`
|
||
- **Optimization roadmaps:** `SIGNAL_QUALITY_OPTIMIZATION_ROADMAP.md`, `POSITION_SCALING_ROADMAP.md`
|
||
- **Adaptive leverage:** `ADAPTIVE_LEVERAGE_SYSTEM.md`
|
||
|
||
## 🚀 Future Enhancements
|
||
|
||
**Potential additions:**
|
||
|
||
1. **Genetic Algorithm Optimization** - Breed top performers, test offspring
|
||
2. **Bayesian Optimization** - Guide search toward promising parameter regions
|
||
3. **Web Dashboard** - Real-time browser-based monitoring (Flask/FastAPI)
|
||
4. **Telegram Alerts** - Notify when exceptional strategies found (P&L > threshold)
|
||
5. **Walk-Forward Analysis** - Test strategies on rolling time windows
|
||
6. **Multi-Asset Support** - Extend to ETH, BTC, other Drift markets
|
||
7. **Auto-Deployment** - Push top strategies to production after validation
|
||
|
||
---
|
||
|
||
**Questions?** Check main project documentation or ask in development chat.
|
||
|
||
**Ready to start?** Run test sweep first: `python3 cluster/distributed_coordinator.py --chunk-size 100`
|