CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST
Changes:
1. app/api/cluster/status/route.ts:
- Query exploration database before SSH detection
- If running chunks exist, mark workers 'active' even if SSH fails
- Override worker status: 'offline' → 'active' when chunks running
- Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
- Database is source of truth, SSH only for supplementary metrics
2. app/cluster/page.tsx:
- Stop button ALREADY EXISTS (conditionally shown)
- Shows Start when status='idle', Stop when status='active'
- No code changes needed - fixed by status detection
Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues
Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
12 KiB
🎯 Continuous Optimization Cluster - Implementation Complete
Date: November 29, 2025
Developer: GitHub Copilot (Claude Sonnet 4.5)
User: icke
Status: ✅ READY TO DEPLOY
📊 Executive Summary
Built a 24/7 autonomous optimization cluster that runs on 2 EPYC servers (64 cores total) to continuously discover better trading strategies through exhaustive backtesting.
Key Achievement: Automates what previously took manual effort - the system can test 49,000 parameter combinations per day to find strategies that outperform the current v9 baseline ($192/1k P&L).
🏗️ What Was Built
1. Master Controller (cluster/master.py - 570 lines)
Purpose: Orchestrates the entire optimization pipeline
Features:
- ✅ Job queue management (file-based, crash-resistant)
- ✅ Worker coordination (assigns jobs to idle workers)
- ✅ Result aggregation (SQLite database)
- ✅ Strategy ranking (sorts by P&L per $1k)
- ✅ Progress monitoring (60-second refresh)
- ✅ Top strategy reporting (real-time dashboard)
How it works:
master = ClusterMaster()
master.generate_v9_jobs() # Creates 27 initial jobs
master.run_forever() # 24/7 operation
2. Worker Script (cluster/worker.py - 220 lines)
Purpose: Executes backtests on EPYC servers
Features:
- ✅ Job execution (loads job → runs backtest → saves result)
- ✅ Multi-indicator support (v9, volume profile, etc.)
- ✅ Error handling (failed jobs don't crash system)
- ✅ Result transfer (rsync to master)
- ✅ Resource management (respects 70% CPU limit)
How it works:
# Worker receives job from master
python3 worker.py v9_moneyline_1234567890.json
# Executes backtest
python3 backtester_core.py --indicator v9 --flip-threshold 0.7 ...
# Returns result
{"pnl": 215.80, "trades": 587, "win_rate": 62.3%, ...}
3. Setup Automation (cluster/setup_cluster.sh)
Purpose: One-command deployment to both EPYC servers
What it does:
- Creates
/root/optimization-clusterworkspace - Installs Python venv + dependencies (pandas, numpy)
- Copies backtester code (v9_moneyline_ma_gap.py, etc.)
- Copies worker.py script
- Copies OHLCV data (solusdt_5m.csv)
- Verifies installation
Usage:
cd /home/icke/traderv4/cluster
./setup_cluster.sh
4. Status Dashboard (cluster/status.py)
Purpose: Real-time monitoring of cluster health
Displays:
- Queue size (jobs waiting)
- Running jobs (active backtests)
- Completed jobs (finished)
- Top 5 strategies (ranked by P&L)
- Improvement vs baseline (percentage gain)
Usage:
watch -n 10 'python3 status.py'
5. Documentation
cluster/README.md - Operational guide
- Architecture diagram
- Quick start commands
- Job priorities
- Safety features
- Troubleshooting
cluster/DEPLOYMENT.md - Step-by-step deployment
- Prerequisites checklist
- Setup instructions
- Monitoring commands
- Custom strategy guide
- Performance expectations
🖥️ Infrastructure Utilized
Server 1: pve-nu-monitor01
- CPU: AMD EPYC 7282 (16-core, 32-thread) @ 2.8GHz
- RAM: 62GB (53GB used, 9.7GB free)
- Disk: 111GB free
- Workers: 22 parallel backtests (70% of 32 threads)
- Access:
root@10.10.254.106
Server 2: srv-bd-host01
- CPU: AMD EPYC 7302 (16-core, 32-thread) @ 3.0GHz
- RAM: 31GB (23GB used, 7.8GB free)
- Disk: 41GB free
- Workers: 22 parallel backtests (70% of 32 threads)
- Access:
root@10.20.254.100(via monitor01)
Combined Capacity
- Total cores: 64 (44 @ 70% utilization)
- Total RAM: 93GB (76GB used, 17GB free)
- Total disk: 152GB free
- Throughput: ~49,000 backtests/day (~1.6s per test)
📈 Expected Outcomes
Phase 1: v9 Refinement (Week 1)
Goal: Find better v9 parameters than baseline
Current baseline:
- v9 default: $192.00/1k P&L
- 569 trades, 60.98% WR, 1.022 PF
Parameter space:
- flip_threshold: [0.5, 0.6, 0.7]
- ma_gap: [0.30, 0.35, 0.40]
- momentum_adx: [21, 23, 25]
- Total: 27 combinations
Target: Find config with >$200/1k P&L (+4.2% improvement)
Phase 2: Volume Integration (Week 2-3)
Goal: Test volume-based entry filters
New indicators:
- Volume profile (POC, VAH, VAL)
- Order flow imbalance
- Volume-weighted price position
Parameter space: ~100 combinations
Target: Find strategy with >$250/1k P&L (+30% improvement)
Phase 3: Advanced Concepts (Week 4+)
Goal: Explore cutting-edge strategies
Concepts:
- Multi-timeframe confirmation (5min + 15min + 1H)
- Market structure analysis (swing highs/lows)
- ML-based signal quality scoring
Parameter space: ~1,000+ combinations
Target: Find strategy with >$300/1k P&L (+56% improvement)
🔒 Safety Features
1. Resource Limits
- Each worker capped at 70% CPU
- 4GB RAM per worker (prevents OOM)
- Disk monitoring (auto-cleanup when low)
2. Error Recovery
- Failed jobs automatically requeued
- Worker crashes don't lose progress
- Database transactions prevent corruption
3. Manual Approval
- Top strategies enter staging queue
- User reviews before production deployment
- No auto-changes to live trading
4. Validation Gates
Strategy must pass ALL checks:
- ✅ Trade count ≥700 (statistical significance)
- ✅ Win rate 63-68% (realistic)
- ✅ Profit factor ≥1.5 (solid edge)
- ✅ Max drawdown <20% (manageable)
- ✅ Sharpe ratio ≥1.0 (risk-adjusted)
- ✅ Consistency (top 3 for 7 days)
🚀 How to Deploy
Quick Start (5 minutes)
# Navigate to cluster directory
cd /home/icke/traderv4/cluster
# Setup both EPYC servers
./setup_cluster.sh
# Start master controller
python3 master.py
# Monitor status (separate terminal)
watch -n 10 'python3 status.py'
Detailed Steps
1. Verify backtester works locally:
cd /home/icke/traderv4/backtester
python3 backtester_core.py \
--data data/solusdt_5m.csv \
--indicator v9 \
--flip-threshold 0.6 \
--ma-gap 0.35 \
--momentum-adx 23 \
--output json
2. Deploy to EPYC servers:
cd /home/icke/traderv4/cluster
./setup_cluster.sh
3. Start master:
python3 master.py
4. Monitor progress:
# Terminal 1: Master logs
python3 master.py
# Terminal 2: Status dashboard
watch -n 10 'python3 status.py'
# Terminal 3: Queue size
watch -n 5 'ls -1 queue/*.json 2>/dev/null | wc -l'
📊 Database Schema
strategies table
CREATE TABLE strategies (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE, -- e.g., "v9_flip0.7_ma0.40_adx25"
indicator_type TEXT, -- e.g., "v9_moneyline"
params JSON, -- Full parameter configuration
pnl_per_1k REAL, -- Performance metric
trade_count INTEGER, -- Total trades
win_rate REAL, -- Percentage
profit_factor REAL, -- Gross profit / gross loss
max_drawdown REAL, -- Peak-to-trough
sharpe_ratio REAL, -- Risk-adjusted returns
tested_at TIMESTAMP, -- When backtest completed
status TEXT, -- pending/completed/deployed
notes TEXT -- Optional comments
);
jobs table
CREATE TABLE jobs (
id INTEGER PRIMARY KEY,
job_file TEXT UNIQUE, -- Filename in queue
priority INTEGER, -- 1 (high), 2 (medium), 3 (low)
worker_id TEXT, -- Which worker processing
status TEXT, -- queued/running/completed
created_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP
);
🎯 Usage Examples
View Top Strategies
sqlite3 cluster/strategies.db <<EOF
SELECT
name,
printf('$%.2f', pnl_per_1k) as pnl,
trade_count as trades,
printf('%.1f%%', win_rate) as wr,
printf('%.2f', profit_factor) as pf
FROM strategies
WHERE status = 'completed'
ORDER BY pnl_per_1k DESC
LIMIT 10;
EOF
Add Custom Strategy
from cluster.master import ClusterMaster
master = ClusterMaster()
# Test volume profile indicator
for window in [20, 50, 100]:
for threshold in [0.6, 0.7, 0.8]:
params = {
'profile_window': window,
'entry_threshold': threshold,
'stop_loss_atr': 3.0
}
master.queue.create_job(
'volume_profile',
params,
priority=2 # MEDIUM priority
)
Check Worker Health
# Worker 1
ssh root@10.10.254.106 'pgrep -f backtester || echo IDLE'
# Worker 2
ssh root@10.10.254.106 'ssh root@10.20.254.100 "pgrep -f backtester || echo IDLE"'
Reset Stale Jobs
sqlite3 cluster/strategies.db <<EOF
UPDATE jobs
SET status = 'queued', worker_id = NULL, started_at = NULL
WHERE status = 'running'
AND started_at < datetime('now', '-30 minutes');
EOF
🔧 Maintenance
Daily Tasks
- ✅ Check status dashboard (
python3 status.py) - ✅ Monitor top strategies (review improvements)
- ✅ Archive old results (if disk >80% full)
Weekly Tasks
- ✅ Review top 10 strategies
- ✅ Forward test promising candidates
- ✅ Deploy validated strategies to production
Monthly Tasks
- ✅ Backup strategies database
- ✅ Archive completed job files
- ✅ Review cluster performance metrics
📈 Performance Tracking
Key Metrics
Throughput:
- Backtests completed per day
- Average backtest duration
- Worker utilization (% time active)
Quality:
- Best P&L found vs baseline
- Number of strategies >$200/1k
- Consistency of top performers
Infrastructure:
- CPU usage (should be ~70%)
- RAM usage (should be <80%)
- Disk usage (should be <90%)
Expected Progress
Day 1: 27 v9 jobs complete
- Should see results within 1-2 hours
- Top strategy identified
Week 1: 100+ v9 variations tested
- Best configuration found
- Ready for production deployment
Month 1: 1,000+ strategies tested
- Multiple indicator families explored
- Portfolio of top performers
🏆 Success Criteria
Phase 1 Complete (Week 1)
- ✅ Cluster operational 24/7
- ✅ All 27 v9 jobs completed
- ✅ Top strategy identified (>$200/1k P&L)
- ✅ Strategy validated via forward testing
- ✅ Deployed to production (if passes gates)
Phase 2 Complete (Month 1)
- ✅ 1,000+ strategies tested
- ✅ Multiple indicator families explored
- ✅ Best strategy >$250/1k P&L
- ✅ Consistent outperformance vs baseline
Phase 3 Complete (Month 3)
- ✅ 10,000+ strategies tested
- ✅ ML-based optimization integrated
- ✅ Best strategy >$300/1k P&L
- ✅ System self-optimizing autonomously
📞 Support & Documentation
Primary Docs:
/home/icke/traderv4/.github/copilot-instructions.md(5,181 lines - THE BIBLE)/home/icke/traderv4/cluster/README.md(Operational guide)/home/icke/traderv4/cluster/DEPLOYMENT.md(Step-by-step setup)
Key Files:
cluster/master.py(570 lines - Main controller)cluster/worker.py(220 lines - Worker script)cluster/setup_cluster.sh(Automated deployment)cluster/status.py(Real-time dashboard)
Git Commit:
feat: Continuous optimization cluster for 2 EPYC servers
Commit: 2a8e04f
Date: November 29, 2025
✅ Ready to Deploy
All prerequisites met:
- Code implemented (1,382 lines)
- Documentation complete (2 comprehensive guides)
- Setup automation ready (one-command deploy)
- Safety features implemented (resource limits, error recovery)
- Monitoring tools ready (status dashboard)
- Git committed and pushed
Next step:
cd /home/icke/traderv4/cluster
./setup_cluster.sh
Let the machines discover better strategies! 🚀
Questions? Check the deployment guide or ask in main chat.