CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST
Changes:
1. app/api/cluster/status/route.ts:
- Query exploration database before SSH detection
- If running chunks exist, mark workers 'active' even if SSH fails
- Override worker status: 'offline' → 'active' when chunks running
- Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
- Database is source of truth, SSH only for supplementary metrics
2. app/cluster/page.tsx:
- Stop button ALREADY EXISTS (conditionally shown)
- Shows Start when status='idle', Stop when status='active'
- No code changes needed - fixed by status detection
Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues
Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
22 lines
706 B
Bash
Executable File
22 lines
706 B
Bash
Executable File
#!/bin/bash
|
|
|
|
echo "Testing different SSH command patterns..."
|
|
echo ""
|
|
|
|
# What coordinator currently does (BROKEN):
|
|
echo "=== Test 1: Single quotes (current) ==="
|
|
ssh root@10.10.254.106 ssh root@10.20.254.100 'echo test1 > /tmp/test1.txt && cat /tmp/test1.txt'
|
|
echo "Exit code: $?"
|
|
echo ""
|
|
|
|
# What should work (double-nested quotes):
|
|
echo "=== Test 2: Double-nested quotes ==="
|
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'echo test2 > /tmp/test2.txt && cat /tmp/test2.txt'"
|
|
echo "Exit code: $?"
|
|
echo ""
|
|
|
|
# Verify which files were created:
|
|
echo "=== Checking which test files exist ==="
|
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ls -la /tmp/test*.txt 2>/dev/null || echo \"No test files found\"'"
|
|
|