Files

mindesbunister b77282b560 feat: Add EPYC cluster distributed sweep with web UI

New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights

Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation

Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked

Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100"
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide

Status: Deployed and operational

2025-11-30 13:02:18 +01:00

CLUSTER_SETUP.md

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

DEPLOYMENT.md

…

distributed_coordinator.py

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

distributed_worker_bd_clean.py

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

distributed_worker_bd.py

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

distributed_worker.py

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

master.py

…

monitor_bd_host01.sh

feat: Add EPYC cluster distributed sweep with web UI

2025-11-30 13:02:18 +01:00

README.md

…

setup_cluster.sh

…

status.py

…

worker.py

…

README.md

Continuous Optimization Cluster

24/7 automated strategy optimization across 2 EPYC servers (64 cores total).

🏗️ Architecture

Master (your local machine)
  ↓ Job Queue (file-based)
  ↓
Worker 1: pve-nu-monitor01 (22 workers @ 70% CPU)
Worker 2: srv-bd-host01 (22 workers @ 70% CPU)
  ↓
Results Database (SQLite)
  ↓
Top Strategies (auto-deployment ready)

🚀 Quick Start

1. Setup Cluster

cd /home/icke/traderv4/cluster
chmod +x setup_cluster.sh
./setup_cluster.sh

This will:

Create /root/optimization-cluster on both EPYC servers
Install Python dependencies (pandas, numpy)
Copy backtester code and OHLCV data
Install worker scripts

2. Start Master Controller

python3 master.py

Master will:

Generate initial job queue (v9 parameter sweep: 27 combinations)
Monitor both workers every 60 seconds
Assign jobs to idle workers
Collect and rank results
Display top performers

3. Monitor Progress

Terminal 1 - Master logs:

cd /home/icke/traderv4/cluster
python3 master.py

Terminal 2 - Job queue:

watch -n 5 'ls -1 cluster/queue/*.json 2>/dev/null | wc -l'

Terminal 3 - Results:

watch -n 10 'sqlite3 cluster/strategies.db "SELECT name, pnl_per_1k, trade_count, win_rate FROM strategies ORDER BY pnl_per_1k DESC LIMIT 5"'

📊 Database Schema

strategies table

name: Strategy identifier (e.g., "v9_flip0.6_ma0.35_adx23")
indicator_type: Indicator family (v9_moneyline, volume_profile, etc.)
params: JSON parameter configuration
pnl_per_1k: Performance metric ($ PnL per $1k capital)
trade_count: Total trades in backtest
win_rate: Percentage winning trades
profit_factor: Gross profit / gross loss
max_drawdown: Largest peak-to-trough decline
status: pending/running/completed/deployed

jobs table

job_file: Filename in queue directory
priority: 1 (high), 2 (medium), 3 (low)
worker_id: Which worker is processing
status: queued/running/completed/failed

🎯 Job Priorities

Priority 1 (HIGH): Known good strategies

v9 refinements (flip_threshold, ma_gap, momentum_adx)
Proven concepts with minor tweaks

Priority 2 (MEDIUM): New concepts

Volume profile integration
Order flow analysis
Market structure detection

Priority 3 (LOW): Experimental

ML-based indicators
Neural network predictions
Complex multi-timeframe logic

📈 Adding New Strategies

Example: Test volume profile indicator

from cluster.master import ClusterMaster

master = ClusterMaster()

# Add volume profile jobs
for profile_window in [20, 50, 100]:
    for entry_threshold in [0.6, 0.7, 0.8]:
        params = {
            'profile_window': profile_window,
            'entry_threshold': entry_threshold,
            'stop_loss_atr': 3.0
        }
        
        master.queue.create_job(
            'volume_profile',
            params,
            priority=2  # MEDIUM priority
        )

🔒 Safety Features

Resource Limits: Each worker respects 70% CPU cap
Memory Management: 4GB per worker, prevents OOM
Disk Monitoring: Auto-cleanup old results when space low
Error Recovery: Failed jobs automatically requeued
Manual Approval: Top strategies wait for user deployment

🏆 Auto-Deployment Gates

Strategy must pass ALL checks before auto-deployment:

Trade Count: Minimum 700 trades (statistical significance)
Win Rate: 63-68% realistic range
Profit Factor: ≥1.5 (solid edge)
Max Drawdown: <20% manageable risk
Sharpe Ratio: ≥1.0 risk-adjusted returns
Consistency: Top 3 in rolling 7-day window

📋 Operational Commands

View Queue Status

ls -lh cluster/queue/

Check Worker Health

ssh root@10.10.254.106 'pgrep -f backtester'
ssh root@10.10.254.106 'ssh root@10.20.254.100 "pgrep -f backtester"'

View Top 10 Strategies

sqlite3 cluster/strategies.db <<EOF
SELECT 
    name, 
    printf('$%.2f', pnl_per_1k) as pnl,
    trade_count as trades,
    printf('%.1f%%', win_rate) as wr,
    printf('%.2f', profit_factor) as pf
FROM strategies 
WHERE status = 'completed'
ORDER BY pnl_per_1k DESC 
LIMIT 10;
EOF

Force Job Priority

# Make specific job high priority
sqlite3 cluster/strategies.db "UPDATE jobs SET priority = 1 WHERE job_file LIKE '%v9_flip0.7%'"

Restart Master (safe)

# Ctrl+C in master.py terminal
# Jobs remain in queue, workers continue
# Restart: python3 master.py

🔧 Troubleshooting

Workers not picking up jobs

# Check worker logs
ssh root@10.10.254.106 'tail -f /root/optimization-cluster/logs/worker.log'

Jobs stuck in "running"

# Reset stale jobs (>30 min)
sqlite3 cluster/strategies.db <<EOF
UPDATE jobs 
SET status = 'queued', worker_id = NULL 
WHERE status = 'running' 
  AND started_at < datetime('now', '-30 minutes');
EOF

Disk space low

# Archive old results
cd cluster/results
tar -czf archive_$(date +%Y%m%d).tar.gz archive/
mv archive_$(date +%Y%m%d).tar.gz ~/backups/
rm -rf archive/*

📈 Expected Performance

Current baseline (v9): $192 P&L per $1k capital

Cluster capacity:

64 cores total (44 cores @ 70% utilization)
~22 parallel backtests
~1.6s per backtest (v9 on EPYC)
~49,000 backtests per day

Optimization potential:

Test 100,000+ parameter combinations per week
Discover strategies beyond manual optimization
Continuous adaptation to market regime changes

🎯 Roadmap

Phase 1 (Week 1): v9 refinement

Exhaustive parameter sweep
Find optimal flip_threshold, ma_gap, momentum_adx
Target: >$200/1k P&L

Phase 2 (Week 2-3): Volume integration

Volume profile entries
Order flow imbalance detection
Target: >$250/1k P&L

Phase 3 (Week 4+): Advanced concepts

Multi-timeframe confirmation
Market structure analysis
ML-based signal quality scoring
Target: >$300/1k P&L

📞 Contact

Questions? Check copilot-instructions.md or ask in main project chat.