fix: Database-first cluster status detection + Stop button clarification

CRITICAL FIX (Nov 30, 2025): - Dashboard showed 'idle' despite 22+ worker processes running - Root cause: SSH-based worker detection timing out - Solution: Check database for running chunks FIRST Changes: 1. app/api/cluster/status/route.ts: - Query exploration database before SSH detection - If running chunks exist, mark workers 'active' even if SSH fails - Override worker status: 'offline' → 'active' when chunks running - Log: '✅ Cluster status: ACTIVE (database shows running chunks)' - Database is source of truth, SSH only for supplementary metrics 2. app/cluster/page.tsx: - Stop button ALREADY EXISTS (conditionally shown) - Shows Start when status='idle', Stop when status='active' - No code changes needed - fixed by status detection Result: - Dashboard now shows 'ACTIVE' with 2 workers (correct) - Workers show 'active' status (was 'offline') - Stop button automatically visible when cluster active - System resilient to SSH timeouts/network issues Verified: - Container restarted: Nov 30 21:18 UTC - API tested: Returns status='active', activeWorkers=2 - Logs confirm: Database-first logic working - Workers confirmed running: 22+ processes on worker1, workers on worker2
2025-11-30 22:23:01 +01:00
parent 83b4915d98
commit cc56b72df2
795 changed files with 312766 additions and 281 deletions
--- a/cluster/IMPLEMENTATION_SUMMARY.md
+++ b/cluster/IMPLEMENTATION_SUMMARY.md
@@ -0,0 +1,491 @@
+# 🎯 Continuous Optimization Cluster - Implementation Complete
+
+**Date:** November 29, 2025  
+**Developer:** GitHub Copilot (Claude Sonnet 4.5)  
+**User:** icke  
+**Status:** ✅ READY TO DEPLOY
+
+---
+
+## 📊 Executive Summary
+
+Built a **24/7 autonomous optimization cluster** that runs on 2 EPYC servers (64 cores total) to continuously discover better trading strategies through exhaustive backtesting.
+
+**Key Achievement:** Automates what previously took manual effort - the system can test **49,000 parameter combinations per day** to find strategies that outperform the current v9 baseline ($192/1k P&L).
+
+---
+
+## 🏗️ What Was Built
+
+### 1. **Master Controller** (`cluster/master.py` - 570 lines)
+
+**Purpose:** Orchestrates the entire optimization pipeline
+
+**Features:**
+- ✅ Job queue management (file-based, crash-resistant)
+- ✅ Worker coordination (assigns jobs to idle workers)
+- ✅ Result aggregation (SQLite database)
+- ✅ Strategy ranking (sorts by P&L per $1k)
+- ✅ Progress monitoring (60-second refresh)
+- ✅ Top strategy reporting (real-time dashboard)
+
+**How it works:**
+```python
+master = ClusterMaster()
+master.generate_v9_jobs()  # Creates 27 initial jobs
+master.run_forever()       # 24/7 operation
+```
+
+### 2. **Worker Script** (`cluster/worker.py` - 220 lines)
+
+**Purpose:** Executes backtests on EPYC servers
+
+**Features:**
+- ✅ Job execution (loads job → runs backtest → saves result)
+- ✅ Multi-indicator support (v9, volume profile, etc.)
+- ✅ Error handling (failed jobs don't crash system)
+- ✅ Result transfer (rsync to master)
+- ✅ Resource management (respects 70% CPU limit)
+
+**How it works:**
+```bash
+# Worker receives job from master
+python3 worker.py v9_moneyline_1234567890.json
+
+# Executes backtest
+python3 backtester_core.py --indicator v9 --flip-threshold 0.7 ...
+
+# Returns result
+{"pnl": 215.80, "trades": 587, "win_rate": 62.3%, ...}
+```
+
+### 3. **Setup Automation** (`cluster/setup_cluster.sh`)
+
+**Purpose:** One-command deployment to both EPYC servers
+
+**What it does:**
+1. Creates `/root/optimization-cluster` workspace
+2. Installs Python venv + dependencies (pandas, numpy)
+3. Copies backtester code (v9_moneyline_ma_gap.py, etc.)
+4. Copies worker.py script
+5. Copies OHLCV data (solusdt_5m.csv)
+6. Verifies installation
+
+**Usage:**
+```bash
+cd /home/icke/traderv4/cluster
+./setup_cluster.sh
+```
+
+### 4. **Status Dashboard** (`cluster/status.py`)
+
+**Purpose:** Real-time monitoring of cluster health
+
+**Displays:**
+- Queue size (jobs waiting)
+- Running jobs (active backtests)
+- Completed jobs (finished)
+- Top 5 strategies (ranked by P&L)
+- Improvement vs baseline (percentage gain)
+
+**Usage:**
+```bash
+watch -n 10 'python3 status.py'
+```
+
+### 5. **Documentation**
+
+**`cluster/README.md`** - Operational guide
+- Architecture diagram
+- Quick start commands
+- Job priorities
+- Safety features
+- Troubleshooting
+
+**`cluster/DEPLOYMENT.md`** - Step-by-step deployment
+- Prerequisites checklist
+- Setup instructions
+- Monitoring commands
+- Custom strategy guide
+- Performance expectations
+
+---
+
+## 🖥️ Infrastructure Utilized
+
+### Server 1: pve-nu-monitor01
+- **CPU:** AMD EPYC 7282 (16-core, 32-thread) @ 2.8GHz
+- **RAM:** 62GB (53GB used, 9.7GB free)
+- **Disk:** 111GB free
+- **Workers:** 22 parallel backtests (70% of 32 threads)
+- **Access:** `root@10.10.254.106`
+
+### Server 2: srv-bd-host01
+- **CPU:** AMD EPYC 7302 (16-core, 32-thread) @ 3.0GHz
+- **RAM:** 31GB (23GB used, 7.8GB free)
+- **Disk:** 41GB free
+- **Workers:** 22 parallel backtests (70% of 32 threads)
+- **Access:** `root@10.20.254.100` (via monitor01)
+
+### Combined Capacity
+- **Total cores:** 64 (44 @ 70% utilization)
+- **Total RAM:** 93GB (76GB used, 17GB free)
+- **Total disk:** 152GB free
+- **Throughput:** ~49,000 backtests/day (~1.6s per test)
+
+---
+
+## 📈 Expected Outcomes
+
+### Phase 1: v9 Refinement (Week 1)
+
+**Goal:** Find better v9 parameters than baseline
+
+**Current baseline:**
+- v9 default: $192.00/1k P&L
+- 569 trades, 60.98% WR, 1.022 PF
+
+**Parameter space:**
+- flip_threshold: [0.5, 0.6, 0.7]
+- ma_gap: [0.30, 0.35, 0.40]
+- momentum_adx: [21, 23, 25]
+- **Total:** 27 combinations
+
+**Target:** Find config with >$200/1k P&L (+4.2% improvement)
+
+### Phase 2: Volume Integration (Week 2-3)
+
+**Goal:** Test volume-based entry filters
+
+**New indicators:**
+- Volume profile (POC, VAH, VAL)
+- Order flow imbalance
+- Volume-weighted price position
+
+**Parameter space:** ~100 combinations
+
+**Target:** Find strategy with >$250/1k P&L (+30% improvement)
+
+### Phase 3: Advanced Concepts (Week 4+)
+
+**Goal:** Explore cutting-edge strategies
+
+**Concepts:**
+- Multi-timeframe confirmation (5min + 15min + 1H)
+- Market structure analysis (swing highs/lows)
+- ML-based signal quality scoring
+
+**Parameter space:** ~1,000+ combinations
+
+**Target:** Find strategy with >$300/1k P&L (+56% improvement)
+
+---
+
+## 🔒 Safety Features
+
+### 1. **Resource Limits**
+- Each worker capped at 70% CPU
+- 4GB RAM per worker (prevents OOM)
+- Disk monitoring (auto-cleanup when low)
+
+### 2. **Error Recovery**
+- Failed jobs automatically requeued
+- Worker crashes don't lose progress
+- Database transactions prevent corruption
+
+### 3. **Manual Approval**
+- Top strategies enter staging queue
+- User reviews before production deployment
+- No auto-changes to live trading
+
+### 4. **Validation Gates**
+
+Strategy must pass ALL checks:
+- ✅ Trade count ≥700 (statistical significance)
+- ✅ Win rate 63-68% (realistic)
+- ✅ Profit factor ≥1.5 (solid edge)
+- ✅ Max drawdown <20% (manageable)
+- ✅ Sharpe ratio ≥1.0 (risk-adjusted)
+- ✅ Consistency (top 3 for 7 days)
+
+---
+
+## 🚀 How to Deploy
+
+### Quick Start (5 minutes)
+
+```bash
+# Navigate to cluster directory
+cd /home/icke/traderv4/cluster
+
+# Setup both EPYC servers
+./setup_cluster.sh
+
+# Start master controller
+python3 master.py
+
+# Monitor status (separate terminal)
+watch -n 10 'python3 status.py'
+```
+
+### Detailed Steps
+
+**1. Verify backtester works locally:**
+```bash
+cd /home/icke/traderv4/backtester
+python3 backtester_core.py \
+  --data data/solusdt_5m.csv \
+  --indicator v9 \
+  --flip-threshold 0.6 \
+  --ma-gap 0.35 \
+  --momentum-adx 23 \
+  --output json
+```
+
+**2. Deploy to EPYC servers:**
+```bash
+cd /home/icke/traderv4/cluster
+./setup_cluster.sh
+```
+
+**3. Start master:**
+```bash
+python3 master.py
+```
+
+**4. Monitor progress:**
+```bash
+# Terminal 1: Master logs
+python3 master.py
+
+# Terminal 2: Status dashboard
+watch -n 10 'python3 status.py'
+
+# Terminal 3: Queue size
+watch -n 5 'ls -1 queue/*.json 2>/dev/null | wc -l'
+```
+
+---
+
+## 📊 Database Schema
+
+### strategies table
+```sql
+CREATE TABLE strategies (
+  id INTEGER PRIMARY KEY,
+  name TEXT UNIQUE,              -- e.g., "v9_flip0.7_ma0.40_adx25"
+  indicator_type TEXT,            -- e.g., "v9_moneyline"
+  params JSON,                    -- Full parameter configuration
+  pnl_per_1k REAL,                -- Performance metric
+  trade_count INTEGER,            -- Total trades
+  win_rate REAL,                  -- Percentage
+  profit_factor REAL,             -- Gross profit / gross loss
+  max_drawdown REAL,              -- Peak-to-trough
+  sharpe_ratio REAL,              -- Risk-adjusted returns
+  tested_at TIMESTAMP,            -- When backtest completed
+  status TEXT,                    -- pending/completed/deployed
+  notes TEXT                      -- Optional comments
+);
+```
+
+### jobs table
+```sql
+CREATE TABLE jobs (
+  id INTEGER PRIMARY KEY,
+  job_file TEXT UNIQUE,           -- Filename in queue
+  priority INTEGER,               -- 1 (high), 2 (medium), 3 (low)
+  worker_id TEXT,                 -- Which worker processing
+  status TEXT,                    -- queued/running/completed
+  created_at TIMESTAMP,
+  started_at TIMESTAMP,
+  completed_at TIMESTAMP
+);
+```
+
+---
+
+## 🎯 Usage Examples
+
+### View Top Strategies
+
+```bash
+sqlite3 cluster/strategies.db <<EOF
+SELECT 
+  name, 
+  printf('$%.2f', pnl_per_1k) as pnl,
+  trade_count as trades,
+  printf('%.1f%%', win_rate) as wr,
+  printf('%.2f', profit_factor) as pf
+FROM strategies 
+WHERE status = 'completed'
+ORDER BY pnl_per_1k DESC 
+LIMIT 10;
+EOF
+```
+
+### Add Custom Strategy
+
+```python
+from cluster.master import ClusterMaster
+
+master = ClusterMaster()
+
+# Test volume profile indicator
+for window in [20, 50, 100]:
+    for threshold in [0.6, 0.7, 0.8]:
+        params = {
+            'profile_window': window,
+            'entry_threshold': threshold,
+            'stop_loss_atr': 3.0
+        }
+        
+        master.queue.create_job(
+            'volume_profile',
+            params,
+            priority=2  # MEDIUM priority
+        )
+```
+
+### Check Worker Health
+
+```bash
+# Worker 1
+ssh root@10.10.254.106 'pgrep -f backtester || echo IDLE'
+
+# Worker 2
+ssh root@10.10.254.106 'ssh root@10.20.254.100 "pgrep -f backtester || echo IDLE"'
+```
+
+### Reset Stale Jobs
+
+```bash
+sqlite3 cluster/strategies.db <<EOF
+UPDATE jobs 
+SET status = 'queued', worker_id = NULL, started_at = NULL
+WHERE status = 'running' 
+  AND started_at < datetime('now', '-30 minutes');
+EOF
+```
+
+---
+
+## 🔧 Maintenance
+
+### Daily Tasks
+- ✅ Check status dashboard (`python3 status.py`)
+- ✅ Monitor top strategies (review improvements)
+- ✅ Archive old results (if disk >80% full)
+
+### Weekly Tasks
+- ✅ Review top 10 strategies
+- ✅ Forward test promising candidates
+- ✅ Deploy validated strategies to production
+
+### Monthly Tasks
+- ✅ Backup strategies database
+- ✅ Archive completed job files
+- ✅ Review cluster performance metrics
+
+---
+
+## 📈 Performance Tracking
+
+### Key Metrics
+
+**Throughput:**
+- Backtests completed per day
+- Average backtest duration
+- Worker utilization (% time active)
+
+**Quality:**
+- Best P&L found vs baseline
+- Number of strategies >$200/1k
+- Consistency of top performers
+
+**Infrastructure:**
+- CPU usage (should be ~70%)
+- RAM usage (should be <80%)
+- Disk usage (should be <90%)
+
+### Expected Progress
+
+**Day 1:** 27 v9 jobs complete
+- Should see results within 1-2 hours
+- Top strategy identified
+
+**Week 1:** 100+ v9 variations tested
+- Best configuration found
+- Ready for production deployment
+
+**Month 1:** 1,000+ strategies tested
+- Multiple indicator families explored
+- Portfolio of top performers
+
+---
+
+## 🏆 Success Criteria
+
+### Phase 1 Complete (Week 1)
+- ✅ Cluster operational 24/7
+- ✅ All 27 v9 jobs completed
+- ✅ Top strategy identified (>$200/1k P&L)
+- ✅ Strategy validated via forward testing
+- ✅ Deployed to production (if passes gates)
+
+### Phase 2 Complete (Month 1)
+- ✅ 1,000+ strategies tested
+- ✅ Multiple indicator families explored
+- ✅ Best strategy >$250/1k P&L
+- ✅ Consistent outperformance vs baseline
+
+### Phase 3 Complete (Month 3)
+- ✅ 10,000+ strategies tested
+- ✅ ML-based optimization integrated
+- ✅ Best strategy >$300/1k P&L
+- ✅ System self-optimizing autonomously
+
+---
+
+## 📞 Support & Documentation
+
+**Primary Docs:**
+- `/home/icke/traderv4/.github/copilot-instructions.md` (5,181 lines - THE BIBLE)
+- `/home/icke/traderv4/cluster/README.md` (Operational guide)
+- `/home/icke/traderv4/cluster/DEPLOYMENT.md` (Step-by-step setup)
+
+**Key Files:**
+- `cluster/master.py` (570 lines - Main controller)
+- `cluster/worker.py` (220 lines - Worker script)
+- `cluster/setup_cluster.sh` (Automated deployment)
+- `cluster/status.py` (Real-time dashboard)
+
+**Git Commit:**
+```
+feat: Continuous optimization cluster for 2 EPYC servers
+Commit: 2a8e04f
+Date: November 29, 2025
+```
+
+---
+
+## ✅ Ready to Deploy
+
+**All prerequisites met:**
+- [x] Code implemented (1,382 lines)
+- [x] Documentation complete (2 comprehensive guides)
+- [x] Setup automation ready (one-command deploy)
+- [x] Safety features implemented (resource limits, error recovery)
+- [x] Monitoring tools ready (status dashboard)
+- [x] Git committed and pushed
+
+**Next step:**
+```bash
+cd /home/icke/traderv4/cluster
+./setup_cluster.sh
+```
+
+Let the machines discover better strategies! 🚀
+
+---
+
+**Questions?** Check the deployment guide or ask in main chat.