Commit Graph

2 Commits

Author SHA1 Message Date
mindesbunister
5d07fbbd28 critical: Fix EPYC cluster start button - database cleanup before start
Problem:
- Start button showed 'already running' when cluster wasn't actually running
- Database had stale chunks in 'running' state from crashed/killed coordinator
- Control endpoint checked process but not database state

Solution:
1. Reset stale 'running' chunks to 'pending' before starting coordinator
2. Verify coordinator not running before starting (prevent duplicates)
3. Add database cleanup to stop action as well (prevent future stale states)
4. Enhanced error reporting with coordinator log output

Changes:
- app/api/cluster/control/route.ts
  - Added database cleanup in start action (reset running chunks)
  - Added process check before start (prevent duplicates)
  - Added database cleanup in stop action (cleanup orphaned state)
  - Added coordinator log output on start failure
  - Improved error messages and logging

Impact:
- Start button now works correctly even after unclean coordinator shutdown
- Prevents false 'already running' reports
- Automatic cleanup of stale database state
- Better error diagnostics

Verified:
- Container rebuilt and restarted successfully
- Cluster status shows 'idle' after database cleanup
- Ready for user to test start button functionality
2025-12-01 08:28:05 +01:00
mindesbunister
cc56b72df2 fix: Database-first cluster status detection + Stop button clarification
CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST

Changes:
1. app/api/cluster/status/route.ts:
   - Query exploration database before SSH detection
   - If running chunks exist, mark workers 'active' even if SSH fails
   - Override worker status: 'offline' → 'active' when chunks running
   - Log: ' Cluster status: ACTIVE (database shows running chunks)'
   - Database is source of truth, SSH only for supplementary metrics

2. app/cluster/page.tsx:
   - Stop button ALREADY EXISTS (conditionally shown)
   - Shows Start when status='idle', Stop when status='active'
   - No code changes needed - fixed by status detection

Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues

Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
2025-11-30 22:23:01 +01:00