CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST
Changes:
1. app/api/cluster/status/route.ts:
- Query exploration database before SSH detection
- If running chunks exist, mark workers 'active' even if SSH fails
- Override worker status: 'offline' → 'active' when chunks running
- Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
- Database is source of truth, SSH only for supplementary metrics
2. app/cluster/page.tsx:
- Stop button ALREADY EXISTS (conditionally shown)
- Shows Start when status='idle', Stop when status='active'
- No code changes needed - fixed by status detection
Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues
Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
25 lines
783 B
Python
25 lines
783 B
Python
#!/usr/bin/env python3
|
|
"""One-time script to collect results from completed chunks"""
|
|
|
|
import sys
|
|
sys.path.insert(0, '.')
|
|
from distributed_coordinator import DistributedCoordinator
|
|
|
|
completed_chunks = {
|
|
'worker1': ['v9_chunk_000006', 'v9_chunk_000008'],
|
|
'worker2': ['v9_chunk_000000', 'v9_chunk_000007', 'v9_chunk_000009']
|
|
}
|
|
|
|
coordinator = DistributedCoordinator()
|
|
|
|
for worker_id, chunks in completed_chunks.items():
|
|
for chunk_id in chunks:
|
|
print(f"📥 Collecting {chunk_id} from {worker_id}...")
|
|
try:
|
|
coordinator.collect_results(worker_id, chunk_id)
|
|
print(f"✅ Successfully collected {chunk_id}")
|
|
except Exception as e:
|
|
print(f"⚠️ Error collecting {chunk_id}: {e}")
|
|
|
|
print("\n✅ Collection complete!")
|