- Changed default chunk_size from 10,000 to 2,000
- Fixes bug where coordinator exited immediately for 4,096 combo exploration
- Coordinator was calculating: chunk 1 starts at 10,000 > 4,096 total = 'all done'
- Now creates 2-3 appropriately-sized chunks for distribution
- Verified: Workers now start and process assigned chunks
- Status: ✅ Docker rebuilt and deployed to port 3001
2.1 KiB
Cluster Start Button Fix - Nov 30, 2025
Problem
The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work.
Root Cause
The coordinator had a hardcoded chunk_size = 10,000 which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error:
📋 Resuming from chunk 1 (found 1 existing chunks)
Starting at combo 10,000 / 4,096
The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately.
Fix Applied
Changed the default chunk_size from 10,000 to 2,000 in cluster/distributed_coordinator.py:
# Before:
parser.add_argument('--chunk-size', type=int, default=10000,
help='Number of combinations per chunk (default: 10000)')
# After:
parser.add_argument('--chunk-size', type=int, default=2000,
help='Number of combinations per chunk (default: 2000)')
This creates 2-3 smaller chunks for the 4,096 combination exploration, allowing proper distribution across workers.
Verification
- ✅ Manual coordinator run created chunks successfully
- ✅ Both workers (worker1 and worker2) started processing
- ✅ Docker image rebuilt with fix
- ✅ Container deployed and running
Result
The start button now works correctly:
- Coordinator creates appropriate-sized chunks
- Workers are assigned work
- Exploration runs to completion
- Progress is tracked in the database
Next Steps
You can now use the start button in the web dashboard at http://10.0.0.48:3001/cluster to start explorations. The system will:
- Create 2-3 chunks of ~2,000 combinations each
- Distribute to worker1 and worker2
- Run for ~30-60 minutes to complete 4,096 combinations
- Save top 100 results to CSV
- Update dashboard with live progress
Files Modified
cluster/distributed_coordinator.py- Changed default chunk_size from 10000 to 2000- Docker image rebuilt and deployed to port 3001