fix: Database-first cluster status detection + Stop button clarification
CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST
Changes:
1. app/api/cluster/status/route.ts:
- Query exploration database before SSH detection
- If running chunks exist, mark workers 'active' even if SSH fails
- Override worker status: 'offline' → 'active' when chunks running
- Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
- Database is source of truth, SSH only for supplementary metrics
2. app/cluster/page.tsx:
- Stop button ALREADY EXISTS (conditionally shown)
- Shows Start when status='idle', Stop when status='active'
- No code changes needed - fixed by status detection
Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues
Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
This commit is contained in:
@@ -1,250 +1,493 @@
|
||||
# Continuous Optimization Cluster
|
||||
# Distributed Continuous Optimization Cluster
|
||||
|
||||
24/7 automated strategy optimization across 2 EPYC servers (64 cores total).
|
||||
**24/7 automated strategy discovery** across 2 EPYC servers (64 cores total). Explores entire indicator/parameter space to find the absolute best trading approach.
|
||||
|
||||
## 🏗️ Architecture
|
||||
|
||||
```
|
||||
Master (your local machine)
|
||||
↓ Job Queue (file-based)
|
||||
↓
|
||||
Worker 1: pve-nu-monitor01 (22 workers @ 70% CPU)
|
||||
Worker 2: srv-bd-host01 (22 workers @ 70% CPU)
|
||||
↓
|
||||
Results Database (SQLite)
|
||||
↓
|
||||
Top Strategies (auto-deployment ready)
|
||||
```
|
||||
**Three-Component Distributed System:**
|
||||
|
||||
1. **Coordinator** (`distributed_coordinator.py`) - Master orchestrator running on srvdocker02
|
||||
- Defines parameter grid (14 dimensions, ~500k combinations)
|
||||
- Splits work into chunks (e.g., 10,000 combos per chunk)
|
||||
- Deploys worker script to EPYC servers via SSH/SCP
|
||||
- Assigns chunks to idle workers dynamically
|
||||
- Collects CSV results and imports to SQLite database
|
||||
- Tracks progress (completed/running/pending chunks)
|
||||
|
||||
2. **Worker** (`distributed_worker.py`) - Runs on EPYC servers
|
||||
- Integrates with existing `/home/comprehensive_sweep/backtester/` infrastructure
|
||||
- Uses proven `simulator.py` vectorized engine and `MoneyLineInputs` class
|
||||
- Loads chunk spec (start_idx, end_idx from total parameter grid)
|
||||
- Generates parameter combinations via `itertools.product()`
|
||||
- Runs multiprocessing sweep with `mp.cpu_count()` workers
|
||||
- Saves results to CSV (same format as comprehensive_sweep.py)
|
||||
|
||||
3. **Monitor** (`exploration_status.py`) - Real-time status dashboard
|
||||
- SSH worker health checks (active distributed_worker.py processes)
|
||||
- Chunk progress tracking (total/completed/running/pending)
|
||||
- Top 10 strategies leaderboard (P&L, trades, WR, PF, DD)
|
||||
- Best configuration details (full parameters)
|
||||
- Watch mode for continuous monitoring (30s refresh)
|
||||
|
||||
**Infrastructure:**
|
||||
- **Worker 1:** pve-nu-monitor01 (10.10.254.106) - EPYC 7282 32 threads, 62GB RAM
|
||||
- **Worker 2:** pve-srvmon01 (10.20.254.100 via worker1 2-hop SSH) - EPYC 7302 32 threads, 31GB RAM
|
||||
- **Combined:** 64 cores, ~108,000 backtests/day capacity (proven: 65,536 in 29h)
|
||||
- **Existing Backtester:** `/home/comprehensive_sweep/backtester/` with simulator.py, indicators/, data/
|
||||
- **Data:** `solusdt_5m.csv` - Binance 5-minute OHLCV (Nov 2024 - Nov 2025)
|
||||
- **Database:** `exploration.db` SQLite with strategies/chunks/phases tables
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### 1. Setup Cluster
|
||||
### 1. Test with Small Chunk (RECOMMENDED FIRST)
|
||||
|
||||
Verify system works before large-scale deployment:
|
||||
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
chmod +x setup_cluster.sh
|
||||
./setup_cluster.sh
|
||||
|
||||
# Modify distributed_coordinator.py temporarily (lines 120-135)
|
||||
# Reduce parameter ranges to 2-3 values per dimension
|
||||
# Total: ~500-1000 combinations for testing
|
||||
|
||||
# Run test
|
||||
python3 distributed_coordinator.py --chunk-size 100
|
||||
|
||||
# Monitor in separate terminal
|
||||
python3 exploration_status.py --watch
|
||||
```
|
||||
|
||||
This will:
|
||||
- Create `/root/optimization-cluster` on both EPYC servers
|
||||
- Install Python dependencies (pandas, numpy)
|
||||
- Copy backtester code and OHLCV data
|
||||
- Install worker scripts
|
||||
**Expected:** 5-10 chunks complete in 30-60 minutes, all results in `exploration.db`
|
||||
|
||||
### 2. Start Master Controller
|
||||
**Verify:**
|
||||
- SSH commands execute successfully
|
||||
- Worker script deploys to `/home/comprehensive_sweep/backtester/scripts/`
|
||||
- CSV results appear in `cluster/distributed_results/`
|
||||
- Database populated with strategies (check with `sqlite3 exploration.db "SELECT COUNT(*) FROM strategies"`)
|
||||
- Monitoring dashboard shows accurate worker/chunk status
|
||||
|
||||
```bash
|
||||
python3 master.py
|
||||
```
|
||||
### 2. Run Full v9 Parameter Sweep
|
||||
|
||||
Master will:
|
||||
- Generate initial job queue (v9 parameter sweep: 27 combinations)
|
||||
- Monitor both workers every 60 seconds
|
||||
- Assign jobs to idle workers
|
||||
- Collect and rank results
|
||||
- Display top performers
|
||||
After test succeeds, explore full parameter space:
|
||||
|
||||
### 3. Monitor Progress
|
||||
|
||||
**Terminal 1 - Master logs:**
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
python3 master.py
|
||||
|
||||
# Restore full parameter ranges in distributed_coordinator.py
|
||||
# Total: ~500,000 combinations (4^8 * 3^3 * 1 ≈ 500k)
|
||||
|
||||
# Start exploration (runs in background)
|
||||
nohup python3 distributed_coordinator.py --chunk-size 10000 > sweep.log 2>&1 &
|
||||
|
||||
# Monitor progress
|
||||
python3 exploration_status.py --watch
|
||||
# OR
|
||||
watch -n 60 'python3 exploration_status.py'
|
||||
|
||||
# Check logs
|
||||
tail -f sweep.log
|
||||
```
|
||||
|
||||
**Terminal 2 - Job queue:**
|
||||
**Expected Results:**
|
||||
- Duration: ~3.5 hours with 64 cores
|
||||
- Find 5-10 configurations with P&L > $250/1k (baseline: $192/1k)
|
||||
- Quality filters: 700+ trades, 50-70% WR, PF ≥ 1.2
|
||||
|
||||
### 3. Query Top Strategies
|
||||
|
||||
```bash
|
||||
watch -n 5 'ls -1 cluster/queue/*.json 2>/dev/null | wc -l'
|
||||
```
|
||||
|
||||
**Terminal 3 - Results:**
|
||||
```bash
|
||||
watch -n 10 'sqlite3 cluster/strategies.db "SELECT name, pnl_per_1k, trade_count, win_rate FROM strategies ORDER BY pnl_per_1k DESC LIMIT 5"'
|
||||
```
|
||||
|
||||
## 📊 Database Schema
|
||||
|
||||
### strategies table
|
||||
- `name`: Strategy identifier (e.g., "v9_flip0.6_ma0.35_adx23")
|
||||
- `indicator_type`: Indicator family (v9_moneyline, volume_profile, etc.)
|
||||
- `params`: JSON parameter configuration
|
||||
- `pnl_per_1k`: Performance metric ($ PnL per $1k capital)
|
||||
- `trade_count`: Total trades in backtest
|
||||
- `win_rate`: Percentage winning trades
|
||||
- `profit_factor`: Gross profit / gross loss
|
||||
- `max_drawdown`: Largest peak-to-trough decline
|
||||
- `status`: pending/running/completed/deployed
|
||||
|
||||
### jobs table
|
||||
- `job_file`: Filename in queue directory
|
||||
- `priority`: 1 (high), 2 (medium), 3 (low)
|
||||
- `worker_id`: Which worker is processing
|
||||
- `status`: queued/running/completed/failed
|
||||
|
||||
## 🎯 Job Priorities
|
||||
|
||||
**Priority 1 (HIGH):** Known good strategies
|
||||
- v9 refinements (flip_threshold, ma_gap, momentum_adx)
|
||||
- Proven concepts with minor tweaks
|
||||
|
||||
**Priority 2 (MEDIUM):** New concepts
|
||||
- Volume profile integration
|
||||
- Order flow analysis
|
||||
- Market structure detection
|
||||
|
||||
**Priority 3 (LOW):** Experimental
|
||||
- ML-based indicators
|
||||
- Neural network predictions
|
||||
- Complex multi-timeframe logic
|
||||
|
||||
## 📈 Adding New Strategies
|
||||
|
||||
### Example: Test volume profile indicator
|
||||
|
||||
```python
|
||||
from cluster.master import ClusterMaster
|
||||
|
||||
master = ClusterMaster()
|
||||
|
||||
# Add volume profile jobs
|
||||
for profile_window in [20, 50, 100]:
|
||||
for entry_threshold in [0.6, 0.7, 0.8]:
|
||||
params = {
|
||||
'profile_window': profile_window,
|
||||
'entry_threshold': entry_threshold,
|
||||
'stop_loss_atr': 3.0
|
||||
}
|
||||
|
||||
master.queue.create_job(
|
||||
'volume_profile',
|
||||
params,
|
||||
priority=2 # MEDIUM priority
|
||||
)
|
||||
```
|
||||
|
||||
## 🔒 Safety Features
|
||||
|
||||
1. **Resource Limits:** Each worker respects 70% CPU cap
|
||||
2. **Memory Management:** 4GB per worker, prevents OOM
|
||||
3. **Disk Monitoring:** Auto-cleanup old results when space low
|
||||
4. **Error Recovery:** Failed jobs automatically requeued
|
||||
5. **Manual Approval:** Top strategies wait for user deployment
|
||||
|
||||
## 🏆 Auto-Deployment Gates
|
||||
|
||||
Strategy must pass ALL checks before auto-deployment:
|
||||
|
||||
1. **Trade Count:** Minimum 700 trades (statistical significance)
|
||||
2. **Win Rate:** 63-68% realistic range
|
||||
3. **Profit Factor:** ≥1.5 (solid edge)
|
||||
4. **Max Drawdown:** <20% manageable risk
|
||||
5. **Sharpe Ratio:** ≥1.0 risk-adjusted returns
|
||||
6. **Consistency:** Top 3 in rolling 7-day window
|
||||
|
||||
## 📋 Operational Commands
|
||||
|
||||
### View Queue Status
|
||||
```bash
|
||||
ls -lh cluster/queue/
|
||||
```
|
||||
|
||||
### Check Worker Health
|
||||
```bash
|
||||
ssh root@10.10.254.106 'pgrep -f backtester'
|
||||
ssh root@10.10.254.106 'ssh root@10.20.254.100 "pgrep -f backtester"'
|
||||
```
|
||||
|
||||
### View Top 10 Strategies
|
||||
```bash
|
||||
sqlite3 cluster/strategies.db <<EOF
|
||||
# Top 20 performers
|
||||
sqlite3 cluster/exploration.db <<EOF
|
||||
SELECT
|
||||
name,
|
||||
params_json,
|
||||
printf('$%.2f', pnl_per_1k) as pnl,
|
||||
trade_count as trades,
|
||||
printf('%.1f%%', win_rate) as wr,
|
||||
printf('%.2f', profit_factor) as pf
|
||||
FROM strategies
|
||||
WHERE status = 'completed'
|
||||
ORDER BY pnl_per_1k DESC
|
||||
LIMIT 10;
|
||||
trades,
|
||||
printf('%.1f%%', win_rate * 100) as wr,
|
||||
printf('%.2f', profit_factor) as pf,
|
||||
printf('%.1f%%', max_drawdown * 100) as dd,
|
||||
DATE(tested_at) as tested
|
||||
FROM strategies
|
||||
WHERE trades >= 700
|
||||
AND win_rate >= 0.50
|
||||
AND win_rate <= 0.70
|
||||
AND profit_factor >= 1.2
|
||||
ORDER BY pnl_per_1k DESC
|
||||
LIMIT 20;
|
||||
EOF
|
||||
```
|
||||
|
||||
### Force Job Priority
|
||||
```bash
|
||||
# Make specific job high priority
|
||||
sqlite3 cluster/strategies.db "UPDATE jobs SET priority = 1 WHERE job_file LIKE '%v9_flip0.7%'"
|
||||
## 📊 Parameter Space (14 Dimensions)
|
||||
|
||||
**v9 Money Line Configuration:**
|
||||
|
||||
```python
|
||||
ParameterGrid(
|
||||
flip_thresholds=[0.4, 0.5, 0.6, 0.7], # EMA flip confirmation (4 values)
|
||||
ma_gaps=[0.20, 0.30, 0.40, 0.50], # MA50-MA200 convergence bonus (4 values)
|
||||
adx_mins=[18, 21, 24, 27], # ADX requirement for momentum filter (4 values)
|
||||
long_pos_maxs=[60, 65, 70, 75], # Price position for LONG momentum (4 values)
|
||||
short_pos_mins=[20, 25, 30, 35], # Price position for SHORT momentum (4 values)
|
||||
cooldowns=[1, 2, 3, 4], # Bars between signals (4 values)
|
||||
position_sizes=[1.0], # Full position (1 value fixed)
|
||||
tp1_multipliers=[1.5, 2.0, 2.5], # TP1 as ATR multiple (3 values)
|
||||
tp2_multipliers=[3.0, 4.0, 5.0], # TP2 as ATR multiple (3 values)
|
||||
sl_multipliers=[2.0, 3.0, 4.0], # SL as ATR multiple (3 values)
|
||||
tp1_close_percents=[0.5, 0.6, 0.7, 0.75], # TP1 close % (4 values)
|
||||
trailing_multipliers=[1.0, 1.5, 2.0], # Trailing stop multiplier (3 values)
|
||||
vol_mins=[0.8, 1.0, 1.2], # Minimum volume ratio (3 values)
|
||||
max_bars_list=[100, 150, 200] # Max bars in position (3 values)
|
||||
)
|
||||
|
||||
# Total: 4×4×4×4×4×4×1×3×3×3×4×3×3×3 ≈ 497,664 combinations
|
||||
```
|
||||
|
||||
### Restart Master (safe)
|
||||
```bash
|
||||
# Ctrl+C in master.py terminal
|
||||
# Jobs remain in queue, workers continue
|
||||
# Restart: python3 master.py
|
||||
## 🎯 Quality Filters
|
||||
|
||||
**Applied to all strategy results:**
|
||||
|
||||
- **Minimum trades:** 700+ (statistical significance)
|
||||
- **Win rate range:** 50-70% (realistic, avoids overfitting)
|
||||
- **Profit factor:** ≥ 1.2 (solid edge)
|
||||
- **Max drawdown:** Tracked but no hard limit (informational)
|
||||
|
||||
**Why these filters:**
|
||||
- Trade count validates statistical robustness
|
||||
- WR range prevents curve-fitting (>70% = overfit, <50% = coin flip)
|
||||
- PF threshold ensures strategy has actual edge
|
||||
|
||||
## 📈 Expected Results
|
||||
|
||||
**Current Baseline (v9 default parameters):**
|
||||
- P&L: $192 per $1k capital
|
||||
- Trades: ~700
|
||||
- Win Rate: ~61%
|
||||
- Profit Factor: ~1.4
|
||||
|
||||
**Optimization Goals:**
|
||||
- **Target:** >$250/1k P&L (30% improvement)
|
||||
- **Stretch:** >$300/1k P&L (56% improvement)
|
||||
- **Expected:** Find 5-10 configurations meeting quality filters with P&L > $250/1k
|
||||
|
||||
**Why achievable:**
|
||||
- 500k combinations vs 27 tested in narrow sweep
|
||||
- Full parameter space exploration vs limited grid
|
||||
- Proven infrastructure (65,536 backtests completed successfully)
|
||||
|
||||
## 🔄 Continuous Exploration Roadmap
|
||||
|
||||
**Phase 1: v9 Money Line Parameter Optimization (~500k combos, 3.5h)**
|
||||
- Status: READY TO RUN
|
||||
- Goal: Find optimal flip_threshold, ma_gap, momentum filters
|
||||
- Expected: >$250/1k P&L
|
||||
|
||||
**Phase 2: RSI Divergence Integration (~100k combos, 45min)**
|
||||
- Add RSI divergence detection
|
||||
- Combine with v9 momentum filter
|
||||
- Parameters: RSI lookback, divergence strength threshold
|
||||
- Goal: Catch trend reversals early
|
||||
|
||||
**Phase 3: Volume Profile Analysis (~200k combos, 1.5h)**
|
||||
- Volume profile zones (POC, VAH, VAL)
|
||||
- Order flow imbalance detection
|
||||
- Parameters: Profile window, entry threshold, confirmation bars
|
||||
- Goal: Better entry timing
|
||||
|
||||
**Phase 4: Multi-Timeframe Confirmation (~150k combos, 1h)**
|
||||
- 5min + 15min + 1H alignment
|
||||
- Higher timeframe trend filter
|
||||
- Parameters: Timeframes to use, alignment strictness
|
||||
- Goal: Reduce false signals
|
||||
|
||||
**Phase 5: Hybrid Indicators (~50k combos, 30min)**
|
||||
- Combine best performers from Phase 1-4
|
||||
- Test cross-strategy synergy
|
||||
- Goal: Break $300/1k barrier
|
||||
|
||||
**Phase 6: ML-Based Optimization (~100k+ combos, 1h+)**
|
||||
- Feature engineering from top strategies
|
||||
- Gradient boosting / random forest
|
||||
- Genetic algorithm parameter tuning
|
||||
- Goal: Discover non-obvious patterns
|
||||
|
||||
## 📁 File Structure
|
||||
|
||||
```
|
||||
cluster/
|
||||
├── distributed_coordinator.py # Master orchestrator (650 lines)
|
||||
├── distributed_worker.py # Worker script (350 lines)
|
||||
├── exploration_status.py # Monitoring dashboard (200 lines)
|
||||
├── exploration.db # SQLite results database
|
||||
├── distributed_results/ # CSV results from workers
|
||||
│ ├── worker1_chunk_0.csv
|
||||
│ ├── worker1_chunk_1.csv
|
||||
│ └── worker2_chunk_0.csv
|
||||
└── README.md # This file
|
||||
|
||||
/home/comprehensive_sweep/backtester/ (on EPYC servers)
|
||||
├── simulator.py # Core vectorized engine
|
||||
├── indicators/
|
||||
│ ├── money_line.py # MoneyLineInputs class
|
||||
│ └── ...
|
||||
├── data/
|
||||
│ └── solusdt_5m.csv # Binance 5-minute OHLCV
|
||||
├── scripts/
|
||||
│ ├── comprehensive_sweep.py # Original multiprocessing sweep
|
||||
│ └── distributed_worker.py # Deployed by coordinator
|
||||
└── .venv/ # Python 3.11.2, pandas, numpy
|
||||
```
|
||||
|
||||
## 💾 Database Schema
|
||||
|
||||
### strategies table
|
||||
```sql
|
||||
CREATE TABLE strategies (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
phase_id INTEGER, -- Which exploration phase (1=v9, 2=RSI, etc.)
|
||||
params_json TEXT NOT NULL, -- JSON parameter configuration
|
||||
pnl_per_1k REAL, -- Performance metric ($ PnL per $1k)
|
||||
trades INTEGER, -- Total trades in backtest
|
||||
win_rate REAL, -- Decimal win rate (0.61 = 61%)
|
||||
profit_factor REAL, -- Gross profit / gross loss
|
||||
max_drawdown REAL, -- Largest peak-to-trough decline (decimal)
|
||||
tested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
FOREIGN KEY (phase_id) REFERENCES phases(id)
|
||||
);
|
||||
CREATE INDEX idx_strategies_pnl ON strategies(pnl_per_1k DESC);
|
||||
CREATE INDEX idx_strategies_trades ON strategies(trades);
|
||||
```
|
||||
|
||||
### chunks table
|
||||
```sql
|
||||
CREATE TABLE chunks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
phase_id INTEGER,
|
||||
worker_id TEXT, -- 'worker1' or 'worker2'
|
||||
start_idx INTEGER, -- Start index in parameter grid
|
||||
end_idx INTEGER, -- End index (exclusive)
|
||||
total_combos INTEGER, -- Total in this chunk
|
||||
status TEXT DEFAULT 'pending', -- pending/running/completed/failed
|
||||
assigned_at TIMESTAMP,
|
||||
completed_at TIMESTAMP,
|
||||
result_file TEXT, -- Path to CSV result file
|
||||
FOREIGN KEY (phase_id) REFERENCES phases(id)
|
||||
);
|
||||
CREATE INDEX idx_chunks_status ON chunks(status);
|
||||
```
|
||||
|
||||
### phases table
|
||||
```sql
|
||||
CREATE TABLE phases (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
name TEXT NOT NULL, -- 'v9_optimization', 'rsi_divergence', etc.
|
||||
description TEXT,
|
||||
total_combinations INTEGER, -- Total parameter combinations
|
||||
started_at TIMESTAMP,
|
||||
completed_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### Workers not picking up jobs
|
||||
### SSH Connection Issues
|
||||
|
||||
**Symptom:** "Connection refused" or timeout errors
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check worker logs
|
||||
ssh root@10.10.254.106 'tail -f /root/optimization-cluster/logs/worker.log'
|
||||
# Test Worker 1 connectivity
|
||||
ssh root@10.10.254.106 'echo "Worker 1 OK"'
|
||||
|
||||
# Test Worker 2 (2-hop) connectivity
|
||||
ssh root@10.10.254.106 'ssh root@10.20.254.100 "echo Worker 2 OK"'
|
||||
|
||||
# Check SSH keys
|
||||
ssh-add -l
|
||||
|
||||
# Verify authorized_keys on workers
|
||||
ssh root@10.10.254.106 'cat ~/.ssh/authorized_keys'
|
||||
```
|
||||
|
||||
### Jobs stuck in "running"
|
||||
### Path/Import Errors on Workers
|
||||
|
||||
**Symptom:** "ModuleNotFoundError" or "FileNotFoundError"
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Reset stale jobs (>30 min)
|
||||
sqlite3 cluster/strategies.db <<EOF
|
||||
UPDATE jobs
|
||||
SET status = 'queued', worker_id = NULL
|
||||
WHERE status = 'running'
|
||||
AND started_at < datetime('now', '-30 minutes');
|
||||
# Verify backtester exists on Worker 1
|
||||
ssh root@10.10.254.106 'ls -lah /home/comprehensive_sweep/backtester/'
|
||||
|
||||
# Check Python environment
|
||||
ssh root@10.10.254.106 'cd /home/comprehensive_sweep && source .venv/bin/activate && python --version'
|
||||
|
||||
# Verify data file
|
||||
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/data/solusdt_5m.csv'
|
||||
|
||||
# Check distributed_worker.py deployment
|
||||
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/scripts/distributed_worker.py'
|
||||
```
|
||||
|
||||
### Worker Processes Stuck/Hung
|
||||
|
||||
**Symptom:** exploration_status.py shows "running" but no progress
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check worker processes
|
||||
ssh root@10.10.254.106 'ps aux | grep distributed_worker'
|
||||
|
||||
# Check worker CPU usage (should be near 100% on 32 cores)
|
||||
ssh root@10.10.254.106 'top -bn1 | head -20'
|
||||
|
||||
# Kill hung worker (coordinator will reassign chunk)
|
||||
ssh root@10.10.254.106 'pkill -f distributed_worker.py'
|
||||
|
||||
# Check worker logs
|
||||
ssh root@10.10.254.106 'tail -50 /home/comprehensive_sweep/backtester/scripts/worker_*.log'
|
||||
```
|
||||
|
||||
### Database Locked/Corrupt
|
||||
|
||||
**Symptom:** "database is locked" errors
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check for stale locks
|
||||
cd /home/icke/traderv4/cluster
|
||||
fuser exploration.db
|
||||
|
||||
# Backup and rebuild
|
||||
cp exploration.db exploration.db.backup
|
||||
sqlite3 exploration.db "VACUUM;"
|
||||
|
||||
# Verify integrity
|
||||
sqlite3 exploration.db "PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
### Results Not Importing
|
||||
|
||||
**Symptom:** CSVs in distributed_results/ but database empty
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check CSV format
|
||||
head -20 cluster/distributed_results/worker1_chunk_0.csv
|
||||
|
||||
# Manual import test
|
||||
python3 -c "
|
||||
import sqlite3
|
||||
import pandas as pd
|
||||
|
||||
df = pd.read_csv('cluster/distributed_results/worker1_chunk_0.csv')
|
||||
print(f'Loaded {len(df)} results')
|
||||
print(df.columns.tolist())
|
||||
print(df.head())
|
||||
"
|
||||
|
||||
# Check coordinator logs for import errors
|
||||
grep -i "error\|exception" sweep.log | tail -20
|
||||
```
|
||||
|
||||
## ⚡ Performance Tuning
|
||||
|
||||
### Chunk Size Trade-offs
|
||||
|
||||
**Small chunks (1,000-5,000):**
|
||||
- ✅ Better load balancing
|
||||
- ✅ Faster feedback loop
|
||||
- ❌ More SSH/SCP overhead
|
||||
- ❌ More database writes
|
||||
|
||||
**Large chunks (10,000-20,000):**
|
||||
- ✅ Less overhead
|
||||
- ✅ Fewer database transactions
|
||||
- ❌ Less granular progress tracking
|
||||
- ❌ Wasted work if chunk fails
|
||||
|
||||
**Recommended:** 10,000 combos per chunk (good balance)
|
||||
|
||||
### Worker Concurrency
|
||||
|
||||
**Current:** Uses `mp.cpu_count()` (32 workers per EPYC)
|
||||
|
||||
**To reduce CPU load:**
|
||||
```python
|
||||
# In distributed_worker.py line ~280
|
||||
# Change from:
|
||||
workers = mp.cpu_count()
|
||||
# To:
|
||||
workers = int(mp.cpu_count() * 0.7) # 70% utilization (22 workers)
|
||||
```
|
||||
|
||||
### Database Optimization
|
||||
|
||||
**For large result sets (>100k strategies):**
|
||||
```bash
|
||||
# Add indexes if queries slow
|
||||
sqlite3 cluster/exploration.db <<EOF
|
||||
CREATE INDEX IF NOT EXISTS idx_strategies_phase ON strategies(phase_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_strategies_wr ON strategies(win_rate);
|
||||
CREATE INDEX IF NOT EXISTS idx_strategies_pf ON strategies(profit_factor);
|
||||
ANALYZE;
|
||||
EOF
|
||||
```
|
||||
|
||||
### Disk space low
|
||||
## ✅ Best Practices
|
||||
|
||||
1. **Always test with small chunk first** (100-1000 combos) before full sweep
|
||||
2. **Monitor regularly** with `exploration_status.py --watch` during runs
|
||||
3. **Backup database** before major changes: `cp exploration.db exploration.db.backup`
|
||||
4. **Review top strategies** after each phase completion
|
||||
5. **Archive old results** if disk space low (CSV files can be deleted after import)
|
||||
6. **Validate quality filters** - adjust if too strict/lenient based on results
|
||||
7. **Check worker logs** if progress stalls: `ssh root@10.10.254.106 'tail -f /home/comprehensive_sweep/backtester/scripts/worker_*.log'`
|
||||
|
||||
## 🔗 Integration with Production Bot
|
||||
|
||||
**After finding top strategy:**
|
||||
|
||||
1. **Extract parameters from database:**
|
||||
```bash
|
||||
# Archive old results
|
||||
cd cluster/results
|
||||
tar -czf archive_$(date +%Y%m%d).tar.gz archive/
|
||||
mv archive_$(date +%Y%m%d).tar.gz ~/backups/
|
||||
rm -rf archive/*
|
||||
sqlite3 cluster/exploration.db <<EOF
|
||||
SELECT params_json FROM strategies
|
||||
WHERE id = (SELECT id FROM strategies ORDER BY pnl_per_1k DESC LIMIT 1);
|
||||
EOF
|
||||
```
|
||||
|
||||
## 📈 Expected Performance
|
||||
2. **Update TradingView indicator** (`workflows/trading/moneyline_v9_ma_gap.pinescript`):
|
||||
- Set `flip_threshold`, `ma_gap`, `momentum_adx`, etc. to optimal values
|
||||
- Test in replay mode with historical data
|
||||
|
||||
**Current baseline (v9):** $192 P&L per $1k capital
|
||||
3. **Update bot configuration** (`.env` file):
|
||||
- Adjust `MIN_SIGNAL_QUALITY_SCORE` if needed
|
||||
- Update position sizing if strategy has different risk profile
|
||||
|
||||
**Cluster capacity:**
|
||||
- 64 cores total (44 cores @ 70% utilization)
|
||||
- ~22 parallel backtests
|
||||
- ~1.6s per backtest (v9 on EPYC)
|
||||
- **~49,000 backtests per day**
|
||||
4. **Forward test** (50-100 trades) before increasing capital:
|
||||
- Use `SOLANA_POSITION_SIZE=10` (10% of capital)
|
||||
- Monitor win rate, P&L, drawdown
|
||||
- If metrics match backtest ± 10%, increase to full size
|
||||
|
||||
**Optimization potential:**
|
||||
- Test 100,000+ parameter combinations per week
|
||||
- Discover strategies beyond manual optimization
|
||||
- Continuous adaptation to market regime changes
|
||||
## 📚 Support & Documentation
|
||||
|
||||
## 🎯 Roadmap
|
||||
- **Main project docs:** `/home/icke/traderv4/.github/copilot-instructions.md` (5,181 lines)
|
||||
- **Trading goals:** `TRADING_GOALS.md` (8-phase $106→$100k+ roadmap)
|
||||
- **v9 indicator:** `INDICATOR_V9_MA_GAP_ROADMAP.md`
|
||||
- **Optimization roadmaps:** `SIGNAL_QUALITY_OPTIMIZATION_ROADMAP.md`, `POSITION_SCALING_ROADMAP.md`
|
||||
- **Adaptive leverage:** `ADAPTIVE_LEVERAGE_SYSTEM.md`
|
||||
|
||||
**Phase 1 (Week 1):** v9 refinement
|
||||
- Exhaustive parameter sweep
|
||||
- Find optimal flip_threshold, ma_gap, momentum_adx
|
||||
- Target: >$200/1k P&L
|
||||
## 🚀 Future Enhancements
|
||||
|
||||
**Phase 2 (Week 2-3):** Volume integration
|
||||
- Volume profile entries
|
||||
- Order flow imbalance detection
|
||||
- Target: >$250/1k P&L
|
||||
**Potential additions:**
|
||||
|
||||
**Phase 3 (Week 4+):** Advanced concepts
|
||||
- Multi-timeframe confirmation
|
||||
- Market structure analysis
|
||||
- ML-based signal quality scoring
|
||||
- Target: >$300/1k P&L
|
||||
1. **Genetic Algorithm Optimization** - Breed top performers, test offspring
|
||||
2. **Bayesian Optimization** - Guide search toward promising parameter regions
|
||||
3. **Web Dashboard** - Real-time browser-based monitoring (Flask/FastAPI)
|
||||
4. **Telegram Alerts** - Notify when exceptional strategies found (P&L > threshold)
|
||||
5. **Walk-Forward Analysis** - Test strategies on rolling time windows
|
||||
6. **Multi-Asset Support** - Extend to ETH, BTC, other Drift markets
|
||||
7. **Auto-Deployment** - Push top strategies to production after validation
|
||||
|
||||
## 📞 Contact
|
||||
---
|
||||
|
||||
Questions? Check copilot-instructions.md or ask in main project chat.
|
||||
**Questions?** Check main project documentation or ask in development chat.
|
||||
|
||||
**Ready to start?** Run test sweep first: `python3 cluster/distributed_coordinator.py --chunk-size 100`
|
||||
|
||||
Reference in New Issue
Block a user