Files
trading_bot_v4/cluster/README.md
mindesbunister cc56b72df2 fix: Database-first cluster status detection + Stop button clarification
CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST

Changes:
1. app/api/cluster/status/route.ts:
   - Query exploration database before SSH detection
   - If running chunks exist, mark workers 'active' even if SSH fails
   - Override worker status: 'offline' → 'active' when chunks running
   - Log: ' Cluster status: ACTIVE (database shows running chunks)'
   - Database is source of truth, SSH only for supplementary metrics

2. app/cluster/page.tsx:
   - Stop button ALREADY EXISTS (conditionally shown)
   - Shows Start when status='idle', Stop when status='active'
   - No code changes needed - fixed by status detection

Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues

Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
2025-11-30 22:23:01 +01:00

494 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Distributed Continuous Optimization Cluster
**24/7 automated strategy discovery** across 2 EPYC servers (64 cores total). Explores entire indicator/parameter space to find the absolute best trading approach.
## 🏗️ Architecture
**Three-Component Distributed System:**
1. **Coordinator** (`distributed_coordinator.py`) - Master orchestrator running on srvdocker02
- Defines parameter grid (14 dimensions, ~500k combinations)
- Splits work into chunks (e.g., 10,000 combos per chunk)
- Deploys worker script to EPYC servers via SSH/SCP
- Assigns chunks to idle workers dynamically
- Collects CSV results and imports to SQLite database
- Tracks progress (completed/running/pending chunks)
2. **Worker** (`distributed_worker.py`) - Runs on EPYC servers
- Integrates with existing `/home/comprehensive_sweep/backtester/` infrastructure
- Uses proven `simulator.py` vectorized engine and `MoneyLineInputs` class
- Loads chunk spec (start_idx, end_idx from total parameter grid)
- Generates parameter combinations via `itertools.product()`
- Runs multiprocessing sweep with `mp.cpu_count()` workers
- Saves results to CSV (same format as comprehensive_sweep.py)
3. **Monitor** (`exploration_status.py`) - Real-time status dashboard
- SSH worker health checks (active distributed_worker.py processes)
- Chunk progress tracking (total/completed/running/pending)
- Top 10 strategies leaderboard (P&L, trades, WR, PF, DD)
- Best configuration details (full parameters)
- Watch mode for continuous monitoring (30s refresh)
**Infrastructure:**
- **Worker 1:** pve-nu-monitor01 (10.10.254.106) - EPYC 7282 32 threads, 62GB RAM
- **Worker 2:** pve-srvmon01 (10.20.254.100 via worker1 2-hop SSH) - EPYC 7302 32 threads, 31GB RAM
- **Combined:** 64 cores, ~108,000 backtests/day capacity (proven: 65,536 in 29h)
- **Existing Backtester:** `/home/comprehensive_sweep/backtester/` with simulator.py, indicators/, data/
- **Data:** `solusdt_5m.csv` - Binance 5-minute OHLCV (Nov 2024 - Nov 2025)
- **Database:** `exploration.db` SQLite with strategies/chunks/phases tables
## 🚀 Quick Start
### 1. Test with Small Chunk (RECOMMENDED FIRST)
Verify system works before large-scale deployment:
```bash
cd /home/icke/traderv4/cluster
# Modify distributed_coordinator.py temporarily (lines 120-135)
# Reduce parameter ranges to 2-3 values per dimension
# Total: ~500-1000 combinations for testing
# Run test
python3 distributed_coordinator.py --chunk-size 100
# Monitor in separate terminal
python3 exploration_status.py --watch
```
**Expected:** 5-10 chunks complete in 30-60 minutes, all results in `exploration.db`
**Verify:**
- SSH commands execute successfully
- Worker script deploys to `/home/comprehensive_sweep/backtester/scripts/`
- CSV results appear in `cluster/distributed_results/`
- Database populated with strategies (check with `sqlite3 exploration.db "SELECT COUNT(*) FROM strategies"`)
- Monitoring dashboard shows accurate worker/chunk status
### 2. Run Full v9 Parameter Sweep
After test succeeds, explore full parameter space:
```bash
cd /home/icke/traderv4/cluster
# Restore full parameter ranges in distributed_coordinator.py
# Total: ~500,000 combinations (4^8 * 3^3 * 1 ≈ 500k)
# Start exploration (runs in background)
nohup python3 distributed_coordinator.py --chunk-size 10000 > sweep.log 2>&1 &
# Monitor progress
python3 exploration_status.py --watch
# OR
watch -n 60 'python3 exploration_status.py'
# Check logs
tail -f sweep.log
```
**Expected Results:**
- Duration: ~3.5 hours with 64 cores
- Find 5-10 configurations with P&L > $250/1k (baseline: $192/1k)
- Quality filters: 700+ trades, 50-70% WR, PF ≥ 1.2
### 3. Query Top Strategies
```bash
# Top 20 performers
sqlite3 cluster/exploration.db <<EOF
SELECT
params_json,
printf('$%.2f', pnl_per_1k) as pnl,
trades,
printf('%.1f%%', win_rate * 100) as wr,
printf('%.2f', profit_factor) as pf,
printf('%.1f%%', max_drawdown * 100) as dd,
DATE(tested_at) as tested
FROM strategies
WHERE trades >= 700
AND win_rate >= 0.50
AND win_rate <= 0.70
AND profit_factor >= 1.2
ORDER BY pnl_per_1k DESC
LIMIT 20;
EOF
```
## 📊 Parameter Space (14 Dimensions)
**v9 Money Line Configuration:**
```python
ParameterGrid(
flip_thresholds=[0.4, 0.5, 0.6, 0.7], # EMA flip confirmation (4 values)
ma_gaps=[0.20, 0.30, 0.40, 0.50], # MA50-MA200 convergence bonus (4 values)
adx_mins=[18, 21, 24, 27], # ADX requirement for momentum filter (4 values)
long_pos_maxs=[60, 65, 70, 75], # Price position for LONG momentum (4 values)
short_pos_mins=[20, 25, 30, 35], # Price position for SHORT momentum (4 values)
cooldowns=[1, 2, 3, 4], # Bars between signals (4 values)
position_sizes=[1.0], # Full position (1 value fixed)
tp1_multipliers=[1.5, 2.0, 2.5], # TP1 as ATR multiple (3 values)
tp2_multipliers=[3.0, 4.0, 5.0], # TP2 as ATR multiple (3 values)
sl_multipliers=[2.0, 3.0, 4.0], # SL as ATR multiple (3 values)
tp1_close_percents=[0.5, 0.6, 0.7, 0.75], # TP1 close % (4 values)
trailing_multipliers=[1.0, 1.5, 2.0], # Trailing stop multiplier (3 values)
vol_mins=[0.8, 1.0, 1.2], # Minimum volume ratio (3 values)
max_bars_list=[100, 150, 200] # Max bars in position (3 values)
)
# Total: 4×4×4×4×4×4×1×3×3×3×4×3×3×3 ≈ 497,664 combinations
```
## 🎯 Quality Filters
**Applied to all strategy results:**
- **Minimum trades:** 700+ (statistical significance)
- **Win rate range:** 50-70% (realistic, avoids overfitting)
- **Profit factor:** ≥ 1.2 (solid edge)
- **Max drawdown:** Tracked but no hard limit (informational)
**Why these filters:**
- Trade count validates statistical robustness
- WR range prevents curve-fitting (>70% = overfit, <50% = coin flip)
- PF threshold ensures strategy has actual edge
## 📈 Expected Results
**Current Baseline (v9 default parameters):**
- P&L: $192 per $1k capital
- Trades: ~700
- Win Rate: ~61%
- Profit Factor: ~1.4
**Optimization Goals:**
- **Target:** >$250/1k P&L (30% improvement)
- **Stretch:** >$300/1k P&L (56% improvement)
- **Expected:** Find 5-10 configurations meeting quality filters with P&L > $250/1k
**Why achievable:**
- 500k combinations vs 27 tested in narrow sweep
- Full parameter space exploration vs limited grid
- Proven infrastructure (65,536 backtests completed successfully)
## 🔄 Continuous Exploration Roadmap
**Phase 1: v9 Money Line Parameter Optimization (~500k combos, 3.5h)**
- Status: READY TO RUN
- Goal: Find optimal flip_threshold, ma_gap, momentum filters
- Expected: >$250/1k P&L
**Phase 2: RSI Divergence Integration (~100k combos, 45min)**
- Add RSI divergence detection
- Combine with v9 momentum filter
- Parameters: RSI lookback, divergence strength threshold
- Goal: Catch trend reversals early
**Phase 3: Volume Profile Analysis (~200k combos, 1.5h)**
- Volume profile zones (POC, VAH, VAL)
- Order flow imbalance detection
- Parameters: Profile window, entry threshold, confirmation bars
- Goal: Better entry timing
**Phase 4: Multi-Timeframe Confirmation (~150k combos, 1h)**
- 5min + 15min + 1H alignment
- Higher timeframe trend filter
- Parameters: Timeframes to use, alignment strictness
- Goal: Reduce false signals
**Phase 5: Hybrid Indicators (~50k combos, 30min)**
- Combine best performers from Phase 1-4
- Test cross-strategy synergy
- Goal: Break $300/1k barrier
**Phase 6: ML-Based Optimization (~100k+ combos, 1h+)**
- Feature engineering from top strategies
- Gradient boosting / random forest
- Genetic algorithm parameter tuning
- Goal: Discover non-obvious patterns
## 📁 File Structure
```
cluster/
├── distributed_coordinator.py # Master orchestrator (650 lines)
├── distributed_worker.py # Worker script (350 lines)
├── exploration_status.py # Monitoring dashboard (200 lines)
├── exploration.db # SQLite results database
├── distributed_results/ # CSV results from workers
│ ├── worker1_chunk_0.csv
│ ├── worker1_chunk_1.csv
│ └── worker2_chunk_0.csv
└── README.md # This file
/home/comprehensive_sweep/backtester/ (on EPYC servers)
├── simulator.py # Core vectorized engine
├── indicators/
│ ├── money_line.py # MoneyLineInputs class
│ └── ...
├── data/
│ └── solusdt_5m.csv # Binance 5-minute OHLCV
├── scripts/
│ ├── comprehensive_sweep.py # Original multiprocessing sweep
│ └── distributed_worker.py # Deployed by coordinator
└── .venv/ # Python 3.11.2, pandas, numpy
```
## 💾 Database Schema
### strategies table
```sql
CREATE TABLE strategies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
phase_id INTEGER, -- Which exploration phase (1=v9, 2=RSI, etc.)
params_json TEXT NOT NULL, -- JSON parameter configuration
pnl_per_1k REAL, -- Performance metric ($ PnL per $1k)
trades INTEGER, -- Total trades in backtest
win_rate REAL, -- Decimal win rate (0.61 = 61%)
profit_factor REAL, -- Gross profit / gross loss
max_drawdown REAL, -- Largest peak-to-trough decline (decimal)
tested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_strategies_pnl ON strategies(pnl_per_1k DESC);
CREATE INDEX idx_strategies_trades ON strategies(trades);
```
### chunks table
```sql
CREATE TABLE chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
phase_id INTEGER,
worker_id TEXT, -- 'worker1' or 'worker2'
start_idx INTEGER, -- Start index in parameter grid
end_idx INTEGER, -- End index (exclusive)
total_combos INTEGER, -- Total in this chunk
status TEXT DEFAULT 'pending', -- pending/running/completed/failed
assigned_at TIMESTAMP,
completed_at TIMESTAMP,
result_file TEXT, -- Path to CSV result file
FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_chunks_status ON chunks(status);
```
### phases table
```sql
CREATE TABLE phases (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL, -- 'v9_optimization', 'rsi_divergence', etc.
description TEXT,
total_combinations INTEGER, -- Total parameter combinations
started_at TIMESTAMP,
completed_at TIMESTAMP
);
```
## 🔧 Troubleshooting
### SSH Connection Issues
**Symptom:** "Connection refused" or timeout errors
**Solutions:**
```bash
# Test Worker 1 connectivity
ssh root@10.10.254.106 'echo "Worker 1 OK"'
# Test Worker 2 (2-hop) connectivity
ssh root@10.10.254.106 'ssh root@10.20.254.100 "echo Worker 2 OK"'
# Check SSH keys
ssh-add -l
# Verify authorized_keys on workers
ssh root@10.10.254.106 'cat ~/.ssh/authorized_keys'
```
### Path/Import Errors on Workers
**Symptom:** "ModuleNotFoundError" or "FileNotFoundError"
**Solutions:**
```bash
# Verify backtester exists on Worker 1
ssh root@10.10.254.106 'ls -lah /home/comprehensive_sweep/backtester/'
# Check Python environment
ssh root@10.10.254.106 'cd /home/comprehensive_sweep && source .venv/bin/activate && python --version'
# Verify data file
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/data/solusdt_5m.csv'
# Check distributed_worker.py deployment
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/scripts/distributed_worker.py'
```
### Worker Processes Stuck/Hung
**Symptom:** exploration_status.py shows "running" but no progress
**Solutions:**
```bash
# Check worker processes
ssh root@10.10.254.106 'ps aux | grep distributed_worker'
# Check worker CPU usage (should be near 100% on 32 cores)
ssh root@10.10.254.106 'top -bn1 | head -20'
# Kill hung worker (coordinator will reassign chunk)
ssh root@10.10.254.106 'pkill -f distributed_worker.py'
# Check worker logs
ssh root@10.10.254.106 'tail -50 /home/comprehensive_sweep/backtester/scripts/worker_*.log'
```
### Database Locked/Corrupt
**Symptom:** "database is locked" errors
**Solutions:**
```bash
# Check for stale locks
cd /home/icke/traderv4/cluster
fuser exploration.db
# Backup and rebuild
cp exploration.db exploration.db.backup
sqlite3 exploration.db "VACUUM;"
# Verify integrity
sqlite3 exploration.db "PRAGMA integrity_check;"
```
### Results Not Importing
**Symptom:** CSVs in distributed_results/ but database empty
**Solutions:**
```bash
# Check CSV format
head -20 cluster/distributed_results/worker1_chunk_0.csv
# Manual import test
python3 -c "
import sqlite3
import pandas as pd
df = pd.read_csv('cluster/distributed_results/worker1_chunk_0.csv')
print(f'Loaded {len(df)} results')
print(df.columns.tolist())
print(df.head())
"
# Check coordinator logs for import errors
grep -i "error\|exception" sweep.log | tail -20
```
## ⚡ Performance Tuning
### Chunk Size Trade-offs
**Small chunks (1,000-5,000):**
- ✅ Better load balancing
- ✅ Faster feedback loop
- ❌ More SSH/SCP overhead
- ❌ More database writes
**Large chunks (10,000-20,000):**
- ✅ Less overhead
- ✅ Fewer database transactions
- ❌ Less granular progress tracking
- ❌ Wasted work if chunk fails
**Recommended:** 10,000 combos per chunk (good balance)
### Worker Concurrency
**Current:** Uses `mp.cpu_count()` (32 workers per EPYC)
**To reduce CPU load:**
```python
# In distributed_worker.py line ~280
# Change from:
workers = mp.cpu_count()
# To:
workers = int(mp.cpu_count() * 0.7) # 70% utilization (22 workers)
```
### Database Optimization
**For large result sets (>100k strategies):**
```bash
# Add indexes if queries slow
sqlite3 cluster/exploration.db <<EOF
CREATE INDEX IF NOT EXISTS idx_strategies_phase ON strategies(phase_id);
CREATE INDEX IF NOT EXISTS idx_strategies_wr ON strategies(win_rate);
CREATE INDEX IF NOT EXISTS idx_strategies_pf ON strategies(profit_factor);
ANALYZE;
EOF
```
## ✅ Best Practices
1. **Always test with small chunk first** (100-1000 combos) before full sweep
2. **Monitor regularly** with `exploration_status.py --watch` during runs
3. **Backup database** before major changes: `cp exploration.db exploration.db.backup`
4. **Review top strategies** after each phase completion
5. **Archive old results** if disk space low (CSV files can be deleted after import)
6. **Validate quality filters** - adjust if too strict/lenient based on results
7. **Check worker logs** if progress stalls: `ssh root@10.10.254.106 'tail -f /home/comprehensive_sweep/backtester/scripts/worker_*.log'`
## 🔗 Integration with Production Bot
**After finding top strategy:**
1. **Extract parameters from database:**
```bash
sqlite3 cluster/exploration.db <<EOF
SELECT params_json FROM strategies
WHERE id = (SELECT id FROM strategies ORDER BY pnl_per_1k DESC LIMIT 1);
EOF
```
2. **Update TradingView indicator** (`workflows/trading/moneyline_v9_ma_gap.pinescript`):
- Set `flip_threshold`, `ma_gap`, `momentum_adx`, etc. to optimal values
- Test in replay mode with historical data
3. **Update bot configuration** (`.env` file):
- Adjust `MIN_SIGNAL_QUALITY_SCORE` if needed
- Update position sizing if strategy has different risk profile
4. **Forward test** (50-100 trades) before increasing capital:
- Use `SOLANA_POSITION_SIZE=10` (10% of capital)
- Monitor win rate, P&L, drawdown
- If metrics match backtest ± 10%, increase to full size
## 📚 Support & Documentation
- **Main project docs:** `/home/icke/traderv4/.github/copilot-instructions.md` (5,181 lines)
- **Trading goals:** `TRADING_GOALS.md` (8-phase $106→$100k+ roadmap)
- **v9 indicator:** `INDICATOR_V9_MA_GAP_ROADMAP.md`
- **Optimization roadmaps:** `SIGNAL_QUALITY_OPTIMIZATION_ROADMAP.md`, `POSITION_SCALING_ROADMAP.md`
- **Adaptive leverage:** `ADAPTIVE_LEVERAGE_SYSTEM.md`
## 🚀 Future Enhancements
**Potential additions:**
1. **Genetic Algorithm Optimization** - Breed top performers, test offspring
2. **Bayesian Optimization** - Guide search toward promising parameter regions
3. **Web Dashboard** - Real-time browser-based monitoring (Flask/FastAPI)
4. **Telegram Alerts** - Notify when exceptional strategies found (P&L > threshold)
5. **Walk-Forward Analysis** - Test strategies on rolling time windows
6. **Multi-Asset Support** - Extend to ETH, BTC, other Drift markets
7. **Auto-Deployment** - Push top strategies to production after validation
---
**Questions?** Check main project documentation or ask in development chat.
**Ready to start?** Run test sweep first: `python3 cluster/distributed_coordinator.py --chunk-size 100`