Files

mindesbunister cc56b72df2 fix: Database-first cluster status detection + Stop button clarification

CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST

Changes:
1. app/api/cluster/status/route.ts:
   - Query exploration database before SSH detection
   - If running chunks exist, mark workers 'active' even if SSH fails
   - Override worker status: 'offline' → 'active' when chunks running
   - Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
   - Database is source of truth, SSH only for supplementary metrics

2. app/cluster/page.tsx:
   - Stop button ALREADY EXISTS (conditionally shown)
   - Shows Start when status='idle', Stop when status='active'
   - No code changes needed - fixed by status detection

Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues

Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2

2025-11-30 22:23:01 +01:00

12 KiB

Raw Blame History

🎯 Continuous Optimization Cluster - Implementation Complete

Date: November 29, 2025
Developer: GitHub Copilot (Claude Sonnet 4.5)
User: icke
Status: ✅ READY TO DEPLOY

📊 Executive Summary

Built a 24/7 autonomous optimization cluster that runs on 2 EPYC servers (64 cores total) to continuously discover better trading strategies through exhaustive backtesting.

Key Achievement: Automates what previously took manual effort - the system can test 49,000 parameter combinations per day to find strategies that outperform the current v9 baseline ($192/1k P&L).

🏗️ What Was Built

1. Master Controller (`cluster/master.py` - 570 lines)

Purpose: Orchestrates the entire optimization pipeline

Features:

✅ Job queue management (file-based, crash-resistant)
✅ Worker coordination (assigns jobs to idle workers)
✅ Result aggregation (SQLite database)
✅ Strategy ranking (sorts by P&L per $1k)
✅ Progress monitoring (60-second refresh)
✅ Top strategy reporting (real-time dashboard)

How it works:

master = ClusterMaster()
master.generate_v9_jobs()  # Creates 27 initial jobs
master.run_forever()       # 24/7 operation

2. Worker Script (`cluster/worker.py` - 220 lines)

Purpose: Executes backtests on EPYC servers

Features:

✅ Job execution (loads job → runs backtest → saves result)
✅ Multi-indicator support (v9, volume profile, etc.)
✅ Error handling (failed jobs don't crash system)
✅ Result transfer (rsync to master)
✅ Resource management (respects 70% CPU limit)

How it works:

# Worker receives job from master
python3 worker.py v9_moneyline_1234567890.json

# Executes backtest
python3 backtester_core.py --indicator v9 --flip-threshold 0.7 ...

# Returns result
{"pnl": 215.80, "trades": 587, "win_rate": 62.3%, ...}

3. Setup Automation (`cluster/setup_cluster.sh`)

Purpose: One-command deployment to both EPYC servers

What it does:

Creates /root/optimization-cluster workspace
Installs Python venv + dependencies (pandas, numpy)
Copies backtester code (v9_moneyline_ma_gap.py, etc.)
Copies worker.py script
Copies OHLCV data (solusdt_5m.csv)
Verifies installation

Usage:

cd /home/icke/traderv4/cluster
./setup_cluster.sh

4. Status Dashboard (`cluster/status.py`)

Purpose: Real-time monitoring of cluster health

Displays:

Queue size (jobs waiting)
Running jobs (active backtests)
Completed jobs (finished)
Top 5 strategies (ranked by P&L)
Improvement vs baseline (percentage gain)

Usage:

watch -n 10 'python3 status.py'

5. Documentation

cluster/README.md - Operational guide

Architecture diagram
Quick start commands
Job priorities
Safety features
Troubleshooting

cluster/DEPLOYMENT.md - Step-by-step deployment

Prerequisites checklist
Setup instructions
Monitoring commands
Custom strategy guide
Performance expectations

🖥️ Infrastructure Utilized

Server 1: pve-nu-monitor01

CPU: AMD EPYC 7282 (16-core, 32-thread) @ 2.8GHz
RAM: 62GB (53GB used, 9.7GB free)
Disk: 111GB free
Workers: 22 parallel backtests (70% of 32 threads)
Access: root@10.10.254.106

Server 2: srv-bd-host01

CPU: AMD EPYC 7302 (16-core, 32-thread) @ 3.0GHz
RAM: 31GB (23GB used, 7.8GB free)
Disk: 41GB free
Workers: 22 parallel backtests (70% of 32 threads)
Access: root@10.20.254.100 (via monitor01)

Combined Capacity

Total cores: 64 (44 @ 70% utilization)
Total RAM: 93GB (76GB used, 17GB free)
Total disk: 152GB free
Throughput: ~49,000 backtests/day (~1.6s per test)

📈 Expected Outcomes

Phase 1: v9 Refinement (Week 1)

Goal: Find better v9 parameters than baseline

Current baseline:

v9 default: $192.00/1k P&L
569 trades, 60.98% WR, 1.022 PF

Parameter space:

flip_threshold: [0.5, 0.6, 0.7]
ma_gap: [0.30, 0.35, 0.40]
momentum_adx: [21, 23, 25]
Total: 27 combinations

Target: Find config with >$200/1k P&L (+4.2% improvement)

Phase 2: Volume Integration (Week 2-3)

Goal: Test volume-based entry filters

New indicators:

Volume profile (POC, VAH, VAL)
Order flow imbalance
Volume-weighted price position

Parameter space: ~100 combinations

Target: Find strategy with >$250/1k P&L (+30% improvement)

Phase 3: Advanced Concepts (Week 4+)

Goal: Explore cutting-edge strategies

Concepts:

Multi-timeframe confirmation (5min + 15min + 1H)
Market structure analysis (swing highs/lows)
ML-based signal quality scoring

Parameter space: ~1,000+ combinations

Target: Find strategy with >$300/1k P&L (+56% improvement)

🔒 Safety Features

1. Resource Limits

Each worker capped at 70% CPU
4GB RAM per worker (prevents OOM)
Disk monitoring (auto-cleanup when low)

2. Error Recovery

Failed jobs automatically requeued
Worker crashes don't lose progress
Database transactions prevent corruption

3. Manual Approval

Top strategies enter staging queue
User reviews before production deployment
No auto-changes to live trading

4. Validation Gates

Strategy must pass ALL checks:

✅ Trade count ≥700 (statistical significance)
✅ Win rate 63-68% (realistic)
✅ Profit factor ≥1.5 (solid edge)
✅ Max drawdown <20% (manageable)
✅ Sharpe ratio ≥1.0 (risk-adjusted)
✅ Consistency (top 3 for 7 days)

🚀 How to Deploy

Quick Start (5 minutes)

# Navigate to cluster directory
cd /home/icke/traderv4/cluster

# Setup both EPYC servers
./setup_cluster.sh

# Start master controller
python3 master.py

# Monitor status (separate terminal)
watch -n 10 'python3 status.py'

Detailed Steps

1. Verify backtester works locally:

cd /home/icke/traderv4/backtester
python3 backtester_core.py \
  --data data/solusdt_5m.csv \
  --indicator v9 \
  --flip-threshold 0.6 \
  --ma-gap 0.35 \
  --momentum-adx 23 \
  --output json

2. Deploy to EPYC servers:

cd /home/icke/traderv4/cluster
./setup_cluster.sh

3. Start master:

python3 master.py

4. Monitor progress:

# Terminal 1: Master logs
python3 master.py

# Terminal 2: Status dashboard
watch -n 10 'python3 status.py'

# Terminal 3: Queue size
watch -n 5 'ls -1 queue/*.json 2>/dev/null | wc -l'

📊 Database Schema

strategies table

CREATE TABLE strategies (
  id INTEGER PRIMARY KEY,
  name TEXT UNIQUE,              -- e.g., "v9_flip0.7_ma0.40_adx25"
  indicator_type TEXT,            -- e.g., "v9_moneyline"
  params JSON,                    -- Full parameter configuration
  pnl_per_1k REAL,                -- Performance metric
  trade_count INTEGER,            -- Total trades
  win_rate REAL,                  -- Percentage
  profit_factor REAL,             -- Gross profit / gross loss
  max_drawdown REAL,              -- Peak-to-trough
  sharpe_ratio REAL,              -- Risk-adjusted returns
  tested_at TIMESTAMP,            -- When backtest completed
  status TEXT,                    -- pending/completed/deployed
  notes TEXT                      -- Optional comments
);

jobs table

CREATE TABLE jobs (
  id INTEGER PRIMARY KEY,
  job_file TEXT UNIQUE,           -- Filename in queue
  priority INTEGER,               -- 1 (high), 2 (medium), 3 (low)
  worker_id TEXT,                 -- Which worker processing
  status TEXT,                    -- queued/running/completed
  created_at TIMESTAMP,
  started_at TIMESTAMP,
  completed_at TIMESTAMP
);

🎯 Usage Examples

View Top Strategies

sqlite3 cluster/strategies.db <<EOF
SELECT 
  name, 
  printf('$%.2f', pnl_per_1k) as pnl,
  trade_count as trades,
  printf('%.1f%%', win_rate) as wr,
  printf('%.2f', profit_factor) as pf
FROM strategies 
WHERE status = 'completed'
ORDER BY pnl_per_1k DESC 
LIMIT 10;
EOF

Add Custom Strategy

from cluster.master import ClusterMaster

master = ClusterMaster()

# Test volume profile indicator
for window in [20, 50, 100]:
    for threshold in [0.6, 0.7, 0.8]:
        params = {
            'profile_window': window,
            'entry_threshold': threshold,
            'stop_loss_atr': 3.0
        }
        
        master.queue.create_job(
            'volume_profile',
            params,
            priority=2  # MEDIUM priority
        )

Check Worker Health

# Worker 1
ssh root@10.10.254.106 'pgrep -f backtester || echo IDLE'

# Worker 2
ssh root@10.10.254.106 'ssh root@10.20.254.100 "pgrep -f backtester || echo IDLE"'

Reset Stale Jobs

sqlite3 cluster/strategies.db <<EOF
UPDATE jobs 
SET status = 'queued', worker_id = NULL, started_at = NULL
WHERE status = 'running' 
  AND started_at < datetime('now', '-30 minutes');
EOF

🔧 Maintenance

Daily Tasks

✅ Check status dashboard (python3 status.py)
✅ Monitor top strategies (review improvements)
✅ Archive old results (if disk >80% full)

Weekly Tasks

✅ Review top 10 strategies
✅ Forward test promising candidates
✅ Deploy validated strategies to production

Monthly Tasks

✅ Backup strategies database
✅ Archive completed job files
✅ Review cluster performance metrics

📈 Performance Tracking

Key Metrics

Throughput:

Backtests completed per day
Average backtest duration
Worker utilization (% time active)

Quality:

Best P&L found vs baseline
Number of strategies >$200/1k
Consistency of top performers

Infrastructure:

CPU usage (should be ~70%)
RAM usage (should be <80%)
Disk usage (should be <90%)

Expected Progress

Day 1: 27 v9 jobs complete

Should see results within 1-2 hours
Top strategy identified

Week 1: 100+ v9 variations tested

Best configuration found
Ready for production deployment

Month 1: 1,000+ strategies tested

Multiple indicator families explored
Portfolio of top performers

🏆 Success Criteria

Phase 1 Complete (Week 1)

✅ Cluster operational 24/7
✅ All 27 v9 jobs completed
✅ Top strategy identified (>$200/1k P&L)
✅ Strategy validated via forward testing
✅ Deployed to production (if passes gates)

Phase 2 Complete (Month 1)

✅ 1,000+ strategies tested
✅ Multiple indicator families explored
✅ Best strategy >$250/1k P&L
✅ Consistent outperformance vs baseline

Phase 3 Complete (Month 3)

✅ 10,000+ strategies tested
✅ ML-based optimization integrated
✅ Best strategy >$300/1k P&L
✅ System self-optimizing autonomously

📞 Support & Documentation

Primary Docs:

/home/icke/traderv4/.github/copilot-instructions.md (5,181 lines - THE BIBLE)
/home/icke/traderv4/cluster/README.md (Operational guide)
/home/icke/traderv4/cluster/DEPLOYMENT.md (Step-by-step setup)

Key Files:

cluster/master.py (570 lines - Main controller)
cluster/worker.py (220 lines - Worker script)
cluster/setup_cluster.sh (Automated deployment)
cluster/status.py (Real-time dashboard)

Git Commit:

feat: Continuous optimization cluster for 2 EPYC servers
Commit: 2a8e04f
Date: November 29, 2025

✅ Ready to Deploy

All prerequisites met:

Code implemented (1,382 lines)
Documentation complete (2 comprehensive guides)
Setup automation ready (one-command deploy)
Safety features implemented (resource limits, error recovery)
Monitoring tools ready (status dashboard)
Git committed and pushed

Next step:

cd /home/icke/traderv4/cluster
./setup_cluster.sh

Let the machines discover better strategies! 🚀

Questions? Check the deployment guide or ask in main chat.

12 KiB Raw Blame History