2 Commits

Author SHA1 Message Date
mindesbunister
cc56b72df2 fix: Database-first cluster status detection + Stop button clarification
CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST

Changes:
1. app/api/cluster/status/route.ts:
   - Query exploration database before SSH detection
   - If running chunks exist, mark workers 'active' even if SSH fails
   - Override worker status: 'offline' → 'active' when chunks running
   - Log: ' Cluster status: ACTIVE (database shows running chunks)'
   - Database is source of truth, SSH only for supplementary metrics

2. app/cluster/page.tsx:
   - Stop button ALREADY EXISTS (conditionally shown)
   - Shows Start when status='idle', Stop when status='active'
   - No code changes needed - fixed by status detection

Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues

Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
2025-11-30 22:23:01 +01:00
mindesbunister
2a8e04fe57 feat: Continuous optimization cluster for 2 EPYC servers
- Master controller with job queue and result aggregation
- Worker scripts for parallel backtesting (22 workers per server)
- SQLite database for strategy ranking and performance tracking
- File-based job queue (simple, robust, survives crashes)
- Auto-setup script for both EPYC servers
- Status dashboard for monitoring progress
- Comprehensive deployment guide

Architecture:
- Master: Job generation, worker coordination, result collection
- Worker 1 (pve-nu-monitor01): AMD EPYC 7282, 22 parallel jobs
- Worker 2 (srv-bd-host01): AMD EPYC 7302, 22 parallel jobs
- Total capacity: ~49,000 backtests/day (44 cores @ 70%)

Initial focus: v9 parameter refinement (27 configurations)
Target: Find strategies >00/1k P&L (current baseline 92/1k)

Files:
- cluster/master.py: Main controller (570 lines)
- cluster/worker.py: Worker execution script (220 lines)
- cluster/setup_cluster.sh: Automated deployment
- cluster/status.py: Real-time status dashboard
- cluster/README.md: Operational documentation
- cluster/DEPLOYMENT.md: Step-by-step deployment guide
2025-11-29 22:34:52 +01:00