Files

mindesbunister ef371a19b9 fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options

CRITICAL FIX (Dec 1, 2025): Cluster start was failing with 'operation failed'

Problem:
- SSH commands timing out after 30s (too short for 2-hop SSH to worker2)
- Missing SSH options caused prompts/delays
- Result: Coordinator failed to start worker processes

Solution:
- Increased timeout from 30s to 60s for nested SSH hops
- Added SSH options: -o StrictHostKeyChecking=no -o ConnectTimeout=10
- Applied options to both ssh_command() and worker startup commands

Verification (Dec 1, 09:40):
- Worker1: 23 processes running (chunk 0-2000)
- Worker2: 24 processes running (chunk 2000-4000)
- Cluster status: ACTIVE with 2 workers
- Both chunks processing successfully

Files changed:
- cluster/distributed_coordinator.py (lines 302-314, 388-414)

2025-12-01 09:41:42 +01:00

__pycache__

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

distributed_results

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

CLUSTER_SETUP.md

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

collect_completed.py

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

DASHBOARD_UPDATE_COMPLETE.md

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

DEPLOYMENT.md

…

distributed_coordinator.py

fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options

2025-12-01 09:41:42 +01:00

distributed_worker_bd_clean.py

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

distributed_worker_bd.py

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

distributed_worker.py

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

exploration_status.py

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

exploration.db

feat: Add direction-specific quality thresholds and dynamic collateral display

2025-12-01 09:09:30 +01:00

IMPLEMENTATION_SUMMARY.md

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

master.py

…

monitor_bd_host01.sh

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

README.md

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

setup_cluster.sh

…

STATUS_DETECTION_FIX_COMPLETE.md

docs: Add comprehensive status detection fix documentation

2025-11-30 22:27:08 +01:00

status.py

…

test_coordinator.py

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

test_ssh_commands.sh

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

web_dashboard.py

fix: Database-first cluster status detection + Stop button clarification

2025-11-30 22:23:01 +01:00

worker.py

…

README.md

Distributed Continuous Optimization Cluster

24/7 automated strategy discovery across 2 EPYC servers (64 cores total). Explores entire indicator/parameter space to find the absolute best trading approach.

🏗️ Architecture

Three-Component Distributed System:

Coordinator (distributed_coordinator.py) - Master orchestrator running on srvdocker02
- Defines parameter grid (14 dimensions, ~500k combinations)
- Splits work into chunks (e.g., 10,000 combos per chunk)
- Deploys worker script to EPYC servers via SSH/SCP
- Assigns chunks to idle workers dynamically
- Collects CSV results and imports to SQLite database
- Tracks progress (completed/running/pending chunks)
Worker (distributed_worker.py) - Runs on EPYC servers
- Integrates with existing /home/comprehensive_sweep/backtester/ infrastructure
- Uses proven simulator.py vectorized engine and MoneyLineInputs class
- Loads chunk spec (start_idx, end_idx from total parameter grid)
- Generates parameter combinations via itertools.product()
- Runs multiprocessing sweep with mp.cpu_count() workers
- Saves results to CSV (same format as comprehensive_sweep.py)
Monitor (exploration_status.py) - Real-time status dashboard
- SSH worker health checks (active distributed_worker.py processes)
- Chunk progress tracking (total/completed/running/pending)
- Top 10 strategies leaderboard (P&L, trades, WR, PF, DD)
- Best configuration details (full parameters)
- Watch mode for continuous monitoring (30s refresh)

Infrastructure:

Worker 1: pve-nu-monitor01 (10.10.254.106) - EPYC 7282 32 threads, 62GB RAM
Worker 2: pve-srvmon01 (10.20.254.100 via worker1 2-hop SSH) - EPYC 7302 32 threads, 31GB RAM
Combined: 64 cores, ~108,000 backtests/day capacity (proven: 65,536 in 29h)
Existing Backtester: /home/comprehensive_sweep/backtester/ with simulator.py, indicators/, data/
Data: solusdt_5m.csv - Binance 5-minute OHLCV (Nov 2024 - Nov 2025)
Database: exploration.db SQLite with strategies/chunks/phases tables

🚀 Quick Start

1. Test with Small Chunk (RECOMMENDED FIRST)

Verify system works before large-scale deployment:

cd /home/icke/traderv4/cluster

# Modify distributed_coordinator.py temporarily (lines 120-135)
# Reduce parameter ranges to 2-3 values per dimension
# Total: ~500-1000 combinations for testing

# Run test
python3 distributed_coordinator.py --chunk-size 100

# Monitor in separate terminal
python3 exploration_status.py --watch

Expected: 5-10 chunks complete in 30-60 minutes, all results in exploration.db

Verify:

SSH commands execute successfully
Worker script deploys to /home/comprehensive_sweep/backtester/scripts/
CSV results appear in cluster/distributed_results/
Database populated with strategies (check with sqlite3 exploration.db "SELECT COUNT(*) FROM strategies")
Monitoring dashboard shows accurate worker/chunk status

2. Run Full v9 Parameter Sweep

After test succeeds, explore full parameter space:

cd /home/icke/traderv4/cluster

# Restore full parameter ranges in distributed_coordinator.py
# Total: ~500,000 combinations (4^8 * 3^3 * 1 ≈ 500k)

# Start exploration (runs in background)
nohup python3 distributed_coordinator.py --chunk-size 10000 > sweep.log 2>&1 &

# Monitor progress
python3 exploration_status.py --watch
# OR
watch -n 60 'python3 exploration_status.py'

# Check logs
tail -f sweep.log

Expected Results:

Duration: ~3.5 hours with 64 cores
Find 5-10 configurations with P&L > $250/1k (baseline: $192/1k)
Quality filters: 700+ trades, 50-70% WR, PF ≥ 1.2

3. Query Top Strategies

# Top 20 performers
sqlite3 cluster/exploration.db <<EOF
SELECT 
    params_json,
    printf('$%.2f', pnl_per_1k) as pnl,
    trades,
    printf('%.1f%%', win_rate * 100) as wr,
    printf('%.2f', profit_factor) as pf,
    printf('%.1f%%', max_drawdown * 100) as dd,
    DATE(tested_at) as tested
FROM strategies
WHERE trades >= 700 
  AND win_rate >= 0.50 
  AND win_rate <= 0.70
  AND profit_factor >= 1.2
ORDER BY pnl_per_1k DESC
LIMIT 20;
EOF

📊 Parameter Space (14 Dimensions)

v9 Money Line Configuration:

ParameterGrid(
    flip_thresholds=[0.4, 0.5, 0.6, 0.7],           # EMA flip confirmation (4 values)
    ma_gaps=[0.20, 0.30, 0.40, 0.50],                # MA50-MA200 convergence bonus (4 values)
    adx_mins=[18, 21, 24, 27],                       # ADX requirement for momentum filter (4 values)
    long_pos_maxs=[60, 65, 70, 75],                  # Price position for LONG momentum (4 values)
    short_pos_mins=[20, 25, 30, 35],                 # Price position for SHORT momentum (4 values)
    cooldowns=[1, 2, 3, 4],                          # Bars between signals (4 values)
    position_sizes=[1.0],                            # Full position (1 value fixed)
    tp1_multipliers=[1.5, 2.0, 2.5],                 # TP1 as ATR multiple (3 values)
    tp2_multipliers=[3.0, 4.0, 5.0],                 # TP2 as ATR multiple (3 values)
    sl_multipliers=[2.0, 3.0, 4.0],                  # SL as ATR multiple (3 values)
    tp1_close_percents=[0.5, 0.6, 0.7, 0.75],       # TP1 close % (4 values)
    trailing_multipliers=[1.0, 1.5, 2.0],            # Trailing stop multiplier (3 values)
    vol_mins=[0.8, 1.0, 1.2],                        # Minimum volume ratio (3 values)
    max_bars_list=[100, 150, 200]                    # Max bars in position (3 values)
)

# Total: 4×4×4×4×4×4×1×3×3×3×4×3×3×3 ≈ 497,664 combinations

🎯 Quality Filters

Applied to all strategy results:

Minimum trades: 700+ (statistical significance)
Win rate range: 50-70% (realistic, avoids overfitting)
Profit factor: ≥ 1.2 (solid edge)
Max drawdown: Tracked but no hard limit (informational)

Why these filters:

Trade count validates statistical robustness
WR range prevents curve-fitting (>70% = overfit, <50% = coin flip)
PF threshold ensures strategy has actual edge

📈 Expected Results

Current Baseline (v9 default parameters):

P&L: $192 per $1k capital
Trades: ~700
Win Rate: ~61%
Profit Factor: ~1.4

Optimization Goals:

Target: >$250/1k P&L (30% improvement)
Stretch: >$300/1k P&L (56% improvement)
Expected: Find 5-10 configurations meeting quality filters with P&L > $250/1k

Why achievable:

500k combinations vs 27 tested in narrow sweep
Full parameter space exploration vs limited grid
Proven infrastructure (65,536 backtests completed successfully)

🔄 Continuous Exploration Roadmap

Phase 1: v9 Money Line Parameter Optimization (~500k combos, 3.5h)

Status: READY TO RUN
Goal: Find optimal flip_threshold, ma_gap, momentum filters
Expected: >$250/1k P&L

Phase 2: RSI Divergence Integration (~100k combos, 45min)

Add RSI divergence detection
Combine with v9 momentum filter
Parameters: RSI lookback, divergence strength threshold
Goal: Catch trend reversals early

Phase 3: Volume Profile Analysis (~200k combos, 1.5h)

Volume profile zones (POC, VAH, VAL)
Order flow imbalance detection
Parameters: Profile window, entry threshold, confirmation bars
Goal: Better entry timing

Phase 4: Multi-Timeframe Confirmation (~150k combos, 1h)

5min + 15min + 1H alignment
Higher timeframe trend filter
Parameters: Timeframes to use, alignment strictness
Goal: Reduce false signals

Phase 5: Hybrid Indicators (~50k combos, 30min)

Combine best performers from Phase 1-4
Test cross-strategy synergy
Goal: Break $300/1k barrier

Phase 6: ML-Based Optimization (~100k+ combos, 1h+)

Feature engineering from top strategies
Gradient boosting / random forest
Genetic algorithm parameter tuning
Goal: Discover non-obvious patterns

📁 File Structure

cluster/
├── distributed_coordinator.py    # Master orchestrator (650 lines)
├── distributed_worker.py         # Worker script (350 lines)
├── exploration_status.py         # Monitoring dashboard (200 lines)
├── exploration.db                # SQLite results database
├── distributed_results/          # CSV results from workers
│   ├── worker1_chunk_0.csv
│   ├── worker1_chunk_1.csv
│   └── worker2_chunk_0.csv
└── README.md                     # This file

/home/comprehensive_sweep/backtester/  (on EPYC servers)
├── simulator.py                  # Core vectorized engine
├── indicators/
│   ├── money_line.py             # MoneyLineInputs class
│   └── ...
├── data/
│   └── solusdt_5m.csv            # Binance 5-minute OHLCV
├── scripts/
│   ├── comprehensive_sweep.py    # Original multiprocessing sweep
│   └── distributed_worker.py     # Deployed by coordinator
└── .venv/                        # Python 3.11.2, pandas, numpy

💾 Database Schema

strategies table

CREATE TABLE strategies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    phase_id INTEGER,                -- Which exploration phase (1=v9, 2=RSI, etc.)
    params_json TEXT NOT NULL,       -- JSON parameter configuration
    pnl_per_1k REAL,                 -- Performance metric ($ PnL per $1k)
    trades INTEGER,                  -- Total trades in backtest
    win_rate REAL,                   -- Decimal win rate (0.61 = 61%)
    profit_factor REAL,              -- Gross profit / gross loss
    max_drawdown REAL,               -- Largest peak-to-trough decline (decimal)
    tested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_strategies_pnl ON strategies(pnl_per_1k DESC);
CREATE INDEX idx_strategies_trades ON strategies(trades);

chunks table

CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    phase_id INTEGER,
    worker_id TEXT,                  -- 'worker1' or 'worker2'
    start_idx INTEGER,               -- Start index in parameter grid
    end_idx INTEGER,                 -- End index (exclusive)
    total_combos INTEGER,            -- Total in this chunk
    status TEXT DEFAULT 'pending',   -- pending/running/completed/failed
    assigned_at TIMESTAMP,
    completed_at TIMESTAMP,
    result_file TEXT,                -- Path to CSV result file
    FOREIGN KEY (phase_id) REFERENCES phases(id)
);
CREATE INDEX idx_chunks_status ON chunks(status);

phases table

CREATE TABLE phases (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,              -- 'v9_optimization', 'rsi_divergence', etc.
    description TEXT,
    total_combinations INTEGER,      -- Total parameter combinations
    started_at TIMESTAMP,
    completed_at TIMESTAMP
);

🔧 Troubleshooting

SSH Connection Issues

Symptom: "Connection refused" or timeout errors

Solutions:

# Test Worker 1 connectivity
ssh root@10.10.254.106 'echo "Worker 1 OK"'

# Test Worker 2 (2-hop) connectivity
ssh root@10.10.254.106 'ssh root@10.20.254.100 "echo Worker 2 OK"'

# Check SSH keys
ssh-add -l

# Verify authorized_keys on workers
ssh root@10.10.254.106 'cat ~/.ssh/authorized_keys'

Path/Import Errors on Workers

Symptom: "ModuleNotFoundError" or "FileNotFoundError"

Solutions:

# Verify backtester exists on Worker 1
ssh root@10.10.254.106 'ls -lah /home/comprehensive_sweep/backtester/'

# Check Python environment
ssh root@10.10.254.106 'cd /home/comprehensive_sweep && source .venv/bin/activate && python --version'

# Verify data file
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/data/solusdt_5m.csv'

# Check distributed_worker.py deployment
ssh root@10.10.254.106 'ls -lh /home/comprehensive_sweep/backtester/scripts/distributed_worker.py'

Worker Processes Stuck/Hung

Symptom: exploration_status.py shows "running" but no progress

Solutions:

# Check worker processes
ssh root@10.10.254.106 'ps aux | grep distributed_worker'

# Check worker CPU usage (should be near 100% on 32 cores)
ssh root@10.10.254.106 'top -bn1 | head -20'

# Kill hung worker (coordinator will reassign chunk)
ssh root@10.10.254.106 'pkill -f distributed_worker.py'

# Check worker logs
ssh root@10.10.254.106 'tail -50 /home/comprehensive_sweep/backtester/scripts/worker_*.log'

Database Locked/Corrupt

Symptom: "database is locked" errors

Solutions:

# Check for stale locks
cd /home/icke/traderv4/cluster
fuser exploration.db

# Backup and rebuild
cp exploration.db exploration.db.backup
sqlite3 exploration.db "VACUUM;"

# Verify integrity
sqlite3 exploration.db "PRAGMA integrity_check;"

Results Not Importing

Symptom: CSVs in distributed_results/ but database empty

Solutions:

# Check CSV format
head -20 cluster/distributed_results/worker1_chunk_0.csv

# Manual import test
python3 -c "
import sqlite3
import pandas as pd

df = pd.read_csv('cluster/distributed_results/worker1_chunk_0.csv')
print(f'Loaded {len(df)} results')
print(df.columns.tolist())
print(df.head())
"

# Check coordinator logs for import errors
grep -i "error\|exception" sweep.log | tail -20

⚡ Performance Tuning

Chunk Size Trade-offs

Small chunks (1,000-5,000):

✅ Better load balancing
✅ Faster feedback loop
❌ More SSH/SCP overhead
❌ More database writes

Large chunks (10,000-20,000):

✅ Less overhead
✅ Fewer database transactions
❌ Less granular progress tracking
❌ Wasted work if chunk fails

Recommended: 10,000 combos per chunk (good balance)

Worker Concurrency

Current: Uses mp.cpu_count() (32 workers per EPYC)

To reduce CPU load:

# In distributed_worker.py line ~280
# Change from:
workers = mp.cpu_count()
# To:
workers = int(mp.cpu_count() * 0.7)  # 70% utilization (22 workers)

Database Optimization

For large result sets (>100k strategies):

# Add indexes if queries slow
sqlite3 cluster/exploration.db <<EOF
CREATE INDEX IF NOT EXISTS idx_strategies_phase ON strategies(phase_id);
CREATE INDEX IF NOT EXISTS idx_strategies_wr ON strategies(win_rate);
CREATE INDEX IF NOT EXISTS idx_strategies_pf ON strategies(profit_factor);
ANALYZE;
EOF

✅ Best Practices

Always test with small chunk first (100-1000 combos) before full sweep
Monitor regularly with exploration_status.py --watch during runs
Backup database before major changes: cp exploration.db exploration.db.backup
Review top strategies after each phase completion
Archive old results if disk space low (CSV files can be deleted after import)
Validate quality filters - adjust if too strict/lenient based on results
Check worker logs if progress stalls: ssh root@10.10.254.106 'tail -f /home/comprehensive_sweep/backtester/scripts/worker_*.log'

🔗 Integration with Production Bot

After finding top strategy:

Extract parameters from database:

sqlite3 cluster/exploration.db <<EOF
SELECT params_json FROM strategies 
WHERE id = (SELECT id FROM strategies ORDER BY pnl_per_1k DESC LIMIT 1);
EOF

Update TradingView indicator (workflows/trading/moneyline_v9_ma_gap.pinescript):
- Set flip_threshold, ma_gap, momentum_adx, etc. to optimal values
- Test in replay mode with historical data
Update bot configuration (.env file):
- Adjust MIN_SIGNAL_QUALITY_SCORE if needed
- Update position sizing if strategy has different risk profile
Forward test (50-100 trades) before increasing capital:
- Use SOLANA_POSITION_SIZE=10 (10% of capital)
- Monitor win rate, P&L, drawdown
- If metrics match backtest ± 10%, increase to full size

📚 Support & Documentation

Main project docs: /home/icke/traderv4/.github/copilot-instructions.md (5,181 lines)
Trading goals: TRADING_GOALS.md (8-phase $106→$100k+ roadmap)
v9 indicator: INDICATOR_V9_MA_GAP_ROADMAP.md
Optimization roadmaps: SIGNAL_QUALITY_OPTIMIZATION_ROADMAP.md, POSITION_SCALING_ROADMAP.md
Adaptive leverage: ADAPTIVE_LEVERAGE_SYSTEM.md

🚀 Future Enhancements

Potential additions:

Genetic Algorithm Optimization - Breed top performers, test offspring
Bayesian Optimization - Guide search toward promising parameter regions
Web Dashboard - Real-time browser-based monitoring (Flask/FastAPI)
Telegram Alerts - Notify when exceptional strategies found (P&L > threshold)
Walk-Forward Analysis - Test strategies on rolling time windows
Multi-Asset Support - Extend to ETH, BTC, other Drift markets
Auto-Deployment - Push top strategies to production after validation

Questions? Check main project documentation or ask in development chat.

Ready to start? Run test sweep first: python3 cluster/distributed_coordinator.py --chunk-size 100

README.md Unescape Escape

Distributed Continuous Optimization Cluster

🏗️ Architecture

🚀 Quick Start

1. Test with Small Chunk (RECOMMENDED FIRST)

2. Run Full v9 Parameter Sweep

3. Query Top Strategies

📊 Parameter Space (14 Dimensions)

🎯 Quality Filters

📈 Expected Results

🔄 Continuous Exploration Roadmap

📁 File Structure

💾 Database Schema

strategies table

chunks table

phases table

🔧 Troubleshooting

SSH Connection Issues

Path/Import Errors on Workers

Worker Processes Stuck/Hung

Database Locked/Corrupt

Results Not Importing

⚡ Performance Tuning

Chunk Size Trade-offs

Worker Concurrency

Database Optimization

✅ Best Practices

🔗 Integration with Production Bot

📚 Support & Documentation

🚀 Future Enhancements

README.md