docs: Update cluster start button fix documentation with Dec 1 database cleanup solution

This commit is contained in:
mindesbunister
2025-12-01 08:29:37 +01:00
parent 5d07fbbd28
commit 203eedd33e

View File

@@ -1,14 +1,138 @@
# Cluster Start Button Fix - Nov 30, 2025 # Cluster Start Button Fix - COMPLETE (Nov 30 + Dec 1, 2025)
## Problem ## Problem History
### Original Issue (Nov 30, 2025)
The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work. The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work.
## Root Cause **Root Cause:** The coordinator had a hardcoded `chunk_size = 10,000` which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error.
The coordinator had a hardcoded `chunk_size = 10,000` which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error:
### Second Issue (Dec 1, 2025) - DATABASE STALE STATE
**Symptom:** Start button showed "already running" when cluster wasn't actually running
**Root Cause:** Database had stale chunks in `status='running'` state from previously crashed/killed coordinator process, but no actual coordinator process was running.
**Impact:** User could not start cluster for parameter optimization work (4,000 combinations pending).
## Solutions Implemented
### Fix 1: Coordinator Chunk Size (Nov 30, 2025)
Changed hardcoded chunk_size from 10,000 to dynamic calculation based on total combinations.
### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
**File:** `app/api/cluster/control/route.ts`
**Problem:** Control endpoint didn't check database state, only process state. This meant:
- Crashed coordinator left chunks in "running" state
- Status API checked database → saw "running" → reported "active"
- Start button disabled when status = "active"
- User couldn't start cluster even though nothing was running
**Solution Implemented:**
1. **Enhanced Start Action:**
- Check if coordinator already running (prevent duplicates)
- Reset any stale "running" chunks to "pending" before starting
- Verify coordinator actually started, return log output on failure
```typescript
// Check if coordinator is already running
const checkCmd = 'ps aux | grep distributed_coordinator.py | grep -v grep | wc -l'
const { stdout: checkStdout } = await execAsync(checkCmd)
const alreadyRunning = parseInt(checkStdout.trim()) > 0
if (alreadyRunning) {
return NextResponse.json({
success: false,
error: 'Coordinator is already running',
}, { status: 400 })
}
// Reset any stale "running" chunks (orphaned from crashed coordinator)
const dbPath = path.join(process.cwd(), 'cluster', 'exploration.db')
const resetCmd = `sqlite3 ${dbPath} "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"`
await execAsync(resetCmd)
console.log('✅ Database cleanup complete')
// Start the coordinator
const startCmd = 'cd /home/icke/traderv4/cluster && nohup python3 distributed_coordinator.py > coordinator.log 2>&1 &'
await execAsync(startCmd)
``` ```
📋 Resuming from chunk 1 (found 1 existing chunks)
Starting at combo 10,000 / 4,096 2. **Enhanced Stop Action:**
- Reset running chunks to pending when stopping
- Prevents future stale database states
- Graceful handling if no processes found
**Immediate Fix Applied (Nov 30):**
```bash
sqlite3 cluster/exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"
```
**Result:** Cluster status changed from "active" to "idle", start button functional again.
## Verification Checklist
- [x] Fix 1: Coordinator chunk size adjusted (Nov 30)
- [x] Fix 2: Database cleanup applied (Dec 1)
- [x] Cluster status shows "idle" (verified)
- [x] Control endpoint enhanced (committed 5d07fbb)
- [x] Docker container rebuilt and restarted
- [x] Code committed and pushed
- [ ] **USER ACTION NEEDED:** Test start button functionality
- [ ] **USER ACTION NEEDED:** Verify coordinator starts and workers begin processing
## Testing Instructions
1. **Open cluster UI:** http://localhost:3001/cluster
2. **Click Start Cluster button**
3. **Expected behavior:**
- Button should trigger start action
- Cluster status should change from "idle" to "active"
- Active workers should increase from 0 to 2
- Workers should begin processing parameter combinations
4. **Verify on EPYC servers:**
```bash
# Check coordinator running
ssh root@10.10.254.106 "ps aux | grep distributed_coordinator | grep -v grep"
# Check workers running
ssh root@10.10.254.106 "ps aux | grep distributed_worker | wc -l"
```
5. **Check database state:**
```bash
sqlite3 cluster/exploration.db "SELECT id, status, assigned_worker FROM chunks ORDER BY id;"
```
## Status
✅ **FIX 1 COMPLETE** (Nov 30, 2025)
- Coordinator chunk size fixed
- Verified coordinator can process 4,096 combinations
✅ **FIX 2 DEPLOYED** (Dec 1, 2025 08:38 UTC)
- Container rebuilt: 77s build time
- Container restarted: trading-bot-v4 running
- Cluster status: idle (correct)
- Database cleanup logic active in start/stop actions
- Ready for user testing
⏳ **PENDING USER VERIFICATION**
- User needs to test start button functionality
- User needs to verify coordinator starts successfully
- User needs to confirm workers begin processing
## Git Commits
**Nov 30:** Coordinator chunk size fix
**Dec 1 (5d07fbb):** "critical: Fix EPYC cluster start button - database cleanup before start"
**Files Changed:** `app/api/cluster/control/route.ts` (61 insertions, 5 deletions)
``` ```
The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately. The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately.