mindesbunister
1f83a7d7c4
feat: Add coordinator log viewer to cluster UI
...
- Created /api/cluster/logs endpoint to read coordinator.log
- Added real-time log display in cluster UI (updates every 3s)
- Shows last 100 lines of coordinator.log in terminal-style display
- Includes manual refresh button
- Improves debugging experience - no need to SSH for logs
User feedback: 'why dont we add the output of the log at the bottom of the page so i know whats going on'
This addresses poor visibility into coordinator errors and failures.
Next step: Fix SSH timeout issue blocking worker execution.
2025-12-01 11:49:23 +01:00
mindesbunister
db33af9f17
fix: Stop button database reset + UI state display (DATABASE-FIRST ARCHITECTURE)
...
CRITICAL FIXES:
1. Stop button now resets database FIRST (before pkill)
- Database cleanup happens even if coordinator crashed
- Prevents stale 'running' chunks blocking restart
- Uses Node.js sqlite library (not CLI - Docker compatible)
2. UI enhancement - 4-state display
- ⚡ Processing (running > 0)
- ⏳ Pending (pending > 0, running = 0)
- ✅ Complete (all completed)
- ⏸️ Idle (no work queued) [NEW]
- Shows pending chunk count when present
TECHNICAL DETAILS:
- Replaced sqlite3 CLI calls with proper Node.js API
- Fixed permissions: chown 1001:1001 cluster/ for container write
- Database-first logic: reset → pkill → verify
- Detailed logging for each operation step
FILES CHANGED:
- app/api/cluster/control/route.ts (database operations refactored)
- app/cluster/page.tsx (4-state UI display)
VERIFIED:
- Stop button successfully reset 3 'running' chunks → 'pending'
- UI correctly shows Idle state after Stop
- Container logs show detailed operation flow
- Database operations work in Docker environment
DEPLOYMENT:
- Container rebuilt with fixed code
- Tested with real stale database (3 running chunks)
- All operations working correctly
2025-12-01 11:34:47 +01:00
mindesbunister
cc56b72df2
fix: Database-first cluster status detection + Stop button clarification
...
CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST
Changes:
1. app/api/cluster/status/route.ts:
- Query exploration database before SSH detection
- If running chunks exist, mark workers 'active' even if SSH fails
- Override worker status: 'offline' → 'active' when chunks running
- Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
- Database is source of truth, SSH only for supplementary metrics
2. app/cluster/page.tsx:
- Stop button ALREADY EXISTS (conditionally shown)
- Shows Start when status='idle', Stop when status='active'
- No code changes needed - fixed by status detection
Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues
Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
2025-11-30 22:23:01 +01:00
mindesbunister
8a3141e793
feat: Add cluster page navigation
...
- Add EPYC Cluster card to landing page (first position, purple/pink gradient)
- Add back button to cluster page (animated left arrow, links to dashboard)
- Update landing page grid layout (lg:grid-cols-3 xl:grid-cols-4 for 7 cards)
- Complete bidirectional navigation: dashboard ↔ cluster monitoring
Navigation features:
- Cluster card: 🖥️ icon, "Monitor distributed parameter exploration" description
- Back button: Animated hover effect (arrow slides left, color transitions)
- Responsive grid: 2 cols (mobile), 3 cols (tablet), 4 cols (desktop)
- Consistent styling with existing navigation cards
2025-11-30 13:18:03 +01:00
mindesbunister
b77282b560
feat: Add EPYC cluster distributed sweep with web UI
...
New Features:
- Distributed coordinator orchestrates 2x AMD EPYC 16-core servers
- 64 total cores processing 12M parameter combinations (70% CPU limit)
- Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106
- Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100)
- Web UI at /cluster shows real-time status and AI recommendations
- API endpoint /api/cluster/status serves cluster metrics
- Auto-refresh every 30s with top strategies and actionable insights
Files Added:
- cluster/distributed_coordinator.py (510 lines) - Main orchestrator
- cluster/distributed_worker.py (271 lines) - Worker1 script
- cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script
- cluster/monitor_bd_host01.sh - Monitoring script
- app/api/cluster/status/route.ts (274 lines) - API endpoint
- app/cluster/page.tsx (258 lines) - Web UI
- cluster/CLUSTER_SETUP.md - Complete setup and access documentation
Technical Details:
- SQLite database tracks chunk assignments
- 10,000 combinations per chunk (1,195 total chunks)
- Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC)
- SSH/SCP for deployment and result collection
- Handles 2-hop SSH for bd-host01 access
- Results in CSV format with top strategies ranked
Access Documentation:
- Worker1: ssh root@10.10.254.106
- Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100 "
- Web UI: http://localhost:3001/cluster
- See CLUSTER_SETUP.md for complete guide
Status: Deployed and operational
2025-11-30 13:02:18 +01:00