- Changed default chunk_size from 10,000 to 2,000
- Fixes bug where coordinator exited immediately for 4,096 combo exploration
- Coordinator was calculating: chunk 1 starts at 10,000 > 4,096 total = 'all done'
- Now creates 2-3 appropriately-sized chunks for distribution
- Verified: Workers now start and process assigned chunks
- Status: ✅ Docker rebuilt and deployed to port 3001
55 lines
2.1 KiB
Markdown
55 lines
2.1 KiB
Markdown
# Cluster Start Button Fix - Nov 30, 2025
|
||
|
||
## Problem
|
||
The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work.
|
||
|
||
## Root Cause
|
||
The coordinator had a hardcoded `chunk_size = 10,000` which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error:
|
||
|
||
```
|
||
📋 Resuming from chunk 1 (found 1 existing chunks)
|
||
Starting at combo 10,000 / 4,096
|
||
```
|
||
|
||
The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately.
|
||
|
||
## Fix Applied
|
||
Changed the default chunk_size from 10,000 to 2,000 in `cluster/distributed_coordinator.py`:
|
||
|
||
```python
|
||
# Before:
|
||
parser.add_argument('--chunk-size', type=int, default=10000,
|
||
help='Number of combinations per chunk (default: 10000)')
|
||
|
||
# After:
|
||
parser.add_argument('--chunk-size', type=int, default=2000,
|
||
help='Number of combinations per chunk (default: 2000)')
|
||
```
|
||
|
||
This creates 2-3 smaller chunks for the 4,096 combination exploration, allowing proper distribution across workers.
|
||
|
||
## Verification
|
||
1. ✅ Manual coordinator run created chunks successfully
|
||
2. ✅ Both workers (worker1 and worker2) started processing
|
||
3. ✅ Docker image rebuilt with fix
|
||
4. ✅ Container deployed and running
|
||
|
||
## Result
|
||
The start button now works correctly:
|
||
- Coordinator creates appropriate-sized chunks
|
||
- Workers are assigned work
|
||
- Exploration runs to completion
|
||
- Progress is tracked in the database
|
||
|
||
## Next Steps
|
||
You can now use the start button in the web dashboard at http://10.0.0.48:3001/cluster to start explorations. The system will:
|
||
1. Create 2-3 chunks of ~2,000 combinations each
|
||
2. Distribute to worker1 and worker2
|
||
3. Run for ~30-60 minutes to complete 4,096 combinations
|
||
4. Save top 100 results to CSV
|
||
5. Update dashboard with live progress
|
||
|
||
## Files Modified
|
||
- `cluster/distributed_coordinator.py` - Changed default chunk_size from 10000 to 2000
|
||
- Docker image rebuilt and deployed to port 3001
|