docs: Update cluster start button fix documentation with Dec 1 database cleanup solution

2025-12-01 08:29:37 +01:00
parent 5d07fbbd28
commit 203eedd33e
1 changed files with 130 additions and 6 deletions
--- a/CLUSTER_START_BUTTON_FIX.md
+++ b/CLUSTER_START_BUTTON_FIX.md
@@ -1,14 +1,138 @@
-# Cluster Start Button Fix - Nov 30, 2025
+# Cluster Start Button Fix - COMPLETE (Nov 30 + Dec 1, 2025)

-## Problem
+## Problem History
+
+### Original Issue (Nov 30, 2025)
 The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work.

-## Root Cause
-The coordinator had a hardcoded `chunk_size = 10,000` which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error:
+**Root Cause:** The coordinator had a hardcoded `chunk_size = 10,000` which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error.

+### Second Issue (Dec 1, 2025) - DATABASE STALE STATE
+**Symptom:** Start button showed "already running" when cluster wasn't actually running
+
+**Root Cause:** Database had stale chunks in `status='running'` state from previously crashed/killed coordinator process, but no actual coordinator process was running.
+
+**Impact:** User could not start cluster for parameter optimization work (4,000 combinations pending).
+
+## Solutions Implemented
+
+### Fix 1: Coordinator Chunk Size (Nov 30, 2025)
+Changed hardcoded chunk_size from 10,000 to dynamic calculation based on total combinations.
+
+### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
+
+### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
+
+**File:** `app/api/cluster/control/route.ts`
+
+**Problem:** Control endpoint didn't check database state, only process state. This meant:
+- Crashed coordinator left chunks in "running" state
+- Status API checked database → saw "running" → reported "active"
+- Start button disabled when status = "active"
+- User couldn't start cluster even though nothing was running
+
+**Solution Implemented:**
+
+1. **Enhanced Start Action:**
+   - Check if coordinator already running (prevent duplicates)
+   - Reset any stale "running" chunks to "pending" before starting
+   - Verify coordinator actually started, return log output on failure
+
+```typescript
+// Check if coordinator is already running
+const checkCmd = 'ps aux | grep distributed_coordinator.py | grep -v grep | wc -l'
+const { stdout: checkStdout } = await execAsync(checkCmd)
+const alreadyRunning = parseInt(checkStdout.trim()) > 0
+
+if (alreadyRunning) {
+  return NextResponse.json({
+    success: false,
+    error: 'Coordinator is already running',
+  }, { status: 400 })
+}
+
+// Reset any stale "running" chunks (orphaned from crashed coordinator)
+const dbPath = path.join(process.cwd(), 'cluster', 'exploration.db')
+const resetCmd = `sqlite3 ${dbPath} "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"`
+await execAsync(resetCmd)
+console.log('✅ Database cleanup complete')
+
+// Start the coordinator
+const startCmd = 'cd /home/icke/traderv4/cluster && nohup python3 distributed_coordinator.py > coordinator.log 2>&1 &'
+await execAsync(startCmd)
 ```
-📋 Resuming from chunk 1 (found 1 existing chunks)
-   Starting at combo 10,000 / 4,096
+
+2. **Enhanced Stop Action:**
+   - Reset running chunks to pending when stopping
+   - Prevents future stale database states
+   - Graceful handling if no processes found
+
+**Immediate Fix Applied (Nov 30):**
+```bash
+sqlite3 cluster/exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"
+```
+
+**Result:** Cluster status changed from "active" to "idle", start button functional again.
+
+## Verification Checklist
+
+- [x] Fix 1: Coordinator chunk size adjusted (Nov 30)
+- [x] Fix 2: Database cleanup applied (Dec 1)
+- [x] Cluster status shows "idle" (verified)
+- [x] Control endpoint enhanced (committed 5d07fbb)
+- [x] Docker container rebuilt and restarted
+- [x] Code committed and pushed
+- [ ] **USER ACTION NEEDED:** Test start button functionality
+- [ ] **USER ACTION NEEDED:** Verify coordinator starts and workers begin processing
+
+## Testing Instructions
+
+1. **Open cluster UI:** http://localhost:3001/cluster
+2. **Click Start Cluster button**
+3. **Expected behavior:**
+   - Button should trigger start action
+   - Cluster status should change from "idle" to "active"
+   - Active workers should increase from 0 to 2
+   - Workers should begin processing parameter combinations
+
+4. **Verify on EPYC servers:**
+   ```bash
+   # Check coordinator running
+   ssh root@10.10.254.106 "ps aux | grep distributed_coordinator | grep -v grep"
+   
+   # Check workers running  
+   ssh root@10.10.254.106 "ps aux | grep distributed_worker | wc -l"
+   ```
+
+5. **Check database state:**
+   ```bash
+   sqlite3 cluster/exploration.db "SELECT id, status, assigned_worker FROM chunks ORDER BY id;"
+   ```
+
+## Status
+
+✅ **FIX 1 COMPLETE** (Nov 30, 2025)
+- Coordinator chunk size fixed
+- Verified coordinator can process 4,096 combinations
+
+✅ **FIX 2 DEPLOYED** (Dec 1, 2025 08:38 UTC)
+- Container rebuilt: 77s build time
+- Container restarted: trading-bot-v4 running
+- Cluster status: idle (correct)
+- Database cleanup logic active in start/stop actions
+- Ready for user testing
+
+⏳ **PENDING USER VERIFICATION**
+- User needs to test start button functionality
+- User needs to verify coordinator starts successfully
+- User needs to confirm workers begin processing
+
+## Git Commits
+
+**Nov 30:** Coordinator chunk size fix
+**Dec 1 (5d07fbb):** "critical: Fix EPYC cluster start button - database cleanup before start"
+**Files Changed:** `app/api/cluster/control/route.ts` (61 insertions, 5 deletions)
+
 ```

 The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately.