docs: Major documentation reorganization + ENV variable reference
**Documentation Structure:** - Created docs/ subdirectory organization (analysis/, architecture/, bugs/, cluster/, deployments/, roadmaps/, setup/, archived/) - Moved 68 root markdown files to appropriate categories - Root directory now clean (only README.md remains) - Total: 83 markdown files now organized by purpose **New Content:** - Added comprehensive Environment Variable Reference to copilot-instructions.md - 100+ ENV variables documented with types, defaults, purpose, notes - Organized by category: Required (Drift/RPC/Pyth), Trading Config (quality/ leverage/sizing), ATR System, Runner System, Risk Limits, Notifications, etc. - Includes usage examples (correct vs wrong patterns) **File Distribution:** - docs/analysis/ - Performance analyses, blocked signals, profit projections - docs/architecture/ - Adaptive leverage, ATR trailing, indicator tracking - docs/bugs/ - CRITICAL_*.md, FIXES_*.md bug reports (7 files) - docs/cluster/ - EPYC setup, distributed computing docs (3 files) - docs/deployments/ - *_COMPLETE.md, DEPLOYMENT_*.md status (12 files) - docs/roadmaps/ - All *ROADMAP*.md strategic planning files (7 files) - docs/setup/ - TradingView guides, signal quality, n8n setup (8 files) - docs/archived/2025_pre_nov/ - Obsolete verification checklist (1 file) **Key Improvements:** - ENV variable reference: Single source of truth for all configuration - Common Pitfalls #68-71: Already complete, verified during audit - Better findability: Category-based navigation vs 68 files in root - Preserves history: All files git mv (rename), not copy/delete - Zero broken functionality: Only documentation moved, no code changes **Verification:** - 83 markdown files now in docs/ subdirectories - Root directory cleaned: 68 files → 0 files (except README.md) - Git history preserved for all moved files - Container running: trading-bot-v4 (no restart needed) **Next Steps:** - Create README.md files in each docs subdirectory - Add navigation index - Update main README.md with new structure - Consolidate duplicate deployment docs - Archive truly obsolete files (old SQL backups) See: docs/analysis/CLEANUP_PLAN.md for complete reorganization strategy
This commit is contained in:
178
docs/cluster/CLUSTER_START_BUTTON_FIX.md
Normal file
178
docs/cluster/CLUSTER_START_BUTTON_FIX.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Cluster Start Button Fix - COMPLETE (Nov 30 + Dec 1, 2025)
|
||||
|
||||
## Problem History
|
||||
|
||||
### Original Issue (Nov 30, 2025)
|
||||
The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work.
|
||||
|
||||
**Root Cause:** The coordinator had a hardcoded `chunk_size = 10,000` which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error.
|
||||
|
||||
### Second Issue (Dec 1, 2025) - DATABASE STALE STATE
|
||||
**Symptom:** Start button showed "already running" when cluster wasn't actually running
|
||||
|
||||
**Root Cause:** Database had stale chunks in `status='running'` state from previously crashed/killed coordinator process, but no actual coordinator process was running.
|
||||
|
||||
**Impact:** User could not start cluster for parameter optimization work (4,000 combinations pending).
|
||||
|
||||
## Solutions Implemented
|
||||
|
||||
### Fix 1: Coordinator Chunk Size (Nov 30, 2025)
|
||||
Changed hardcoded chunk_size from 10,000 to dynamic calculation based on total combinations.
|
||||
|
||||
### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
|
||||
|
||||
### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
|
||||
|
||||
**File:** `app/api/cluster/control/route.ts`
|
||||
|
||||
**Problem:** Control endpoint didn't check database state, only process state. This meant:
|
||||
- Crashed coordinator left chunks in "running" state
|
||||
- Status API checked database → saw "running" → reported "active"
|
||||
- Start button disabled when status = "active"
|
||||
- User couldn't start cluster even though nothing was running
|
||||
|
||||
**Solution Implemented:**
|
||||
|
||||
1. **Enhanced Start Action:**
|
||||
- Check if coordinator already running (prevent duplicates)
|
||||
- Reset any stale "running" chunks to "pending" before starting
|
||||
- Verify coordinator actually started, return log output on failure
|
||||
|
||||
```typescript
|
||||
// Check if coordinator is already running
|
||||
const checkCmd = 'ps aux | grep distributed_coordinator.py | grep -v grep | wc -l'
|
||||
const { stdout: checkStdout } = await execAsync(checkCmd)
|
||||
const alreadyRunning = parseInt(checkStdout.trim()) > 0
|
||||
|
||||
if (alreadyRunning) {
|
||||
return NextResponse.json({
|
||||
success: false,
|
||||
error: 'Coordinator is already running',
|
||||
}, { status: 400 })
|
||||
}
|
||||
|
||||
// Reset any stale "running" chunks (orphaned from crashed coordinator)
|
||||
const dbPath = path.join(process.cwd(), 'cluster', 'exploration.db')
|
||||
const resetCmd = `sqlite3 ${dbPath} "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"`
|
||||
await execAsync(resetCmd)
|
||||
console.log('✅ Database cleanup complete')
|
||||
|
||||
// Start the coordinator
|
||||
const startCmd = 'cd /home/icke/traderv4/cluster && nohup python3 distributed_coordinator.py > coordinator.log 2>&1 &'
|
||||
await execAsync(startCmd)
|
||||
```
|
||||
|
||||
2. **Enhanced Stop Action:**
|
||||
- Reset running chunks to pending when stopping
|
||||
- Prevents future stale database states
|
||||
- Graceful handling if no processes found
|
||||
|
||||
**Immediate Fix Applied (Nov 30):**
|
||||
```bash
|
||||
sqlite3 cluster/exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"
|
||||
```
|
||||
|
||||
**Result:** Cluster status changed from "active" to "idle", start button functional again.
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [x] Fix 1: Coordinator chunk size adjusted (Nov 30)
|
||||
- [x] Fix 2: Database cleanup applied (Dec 1)
|
||||
- [x] Cluster status shows "idle" (verified)
|
||||
- [x] Control endpoint enhanced (committed 5d07fbb)
|
||||
- [x] Docker container rebuilt and restarted
|
||||
- [x] Code committed and pushed
|
||||
- [ ] **USER ACTION NEEDED:** Test start button functionality
|
||||
- [ ] **USER ACTION NEEDED:** Verify coordinator starts and workers begin processing
|
||||
|
||||
## Testing Instructions
|
||||
|
||||
1. **Open cluster UI:** http://localhost:3001/cluster
|
||||
2. **Click Start Cluster button**
|
||||
3. **Expected behavior:**
|
||||
- Button should trigger start action
|
||||
- Cluster status should change from "idle" to "active"
|
||||
- Active workers should increase from 0 to 2
|
||||
- Workers should begin processing parameter combinations
|
||||
|
||||
4. **Verify on EPYC servers:**
|
||||
```bash
|
||||
# Check coordinator running
|
||||
ssh root@10.10.254.106 "ps aux | grep distributed_coordinator | grep -v grep"
|
||||
|
||||
# Check workers running
|
||||
ssh root@10.10.254.106 "ps aux | grep distributed_worker | wc -l"
|
||||
```
|
||||
|
||||
5. **Check database state:**
|
||||
```bash
|
||||
sqlite3 cluster/exploration.db "SELECT id, status, assigned_worker FROM chunks ORDER BY id;"
|
||||
```
|
||||
|
||||
## Status
|
||||
|
||||
✅ **FIX 1 COMPLETE** (Nov 30, 2025)
|
||||
- Coordinator chunk size fixed
|
||||
- Verified coordinator can process 4,096 combinations
|
||||
|
||||
✅ **FIX 2 DEPLOYED** (Dec 1, 2025 08:38 UTC)
|
||||
- Container rebuilt: 77s build time
|
||||
- Container restarted: trading-bot-v4 running
|
||||
- Cluster status: idle (correct)
|
||||
- Database cleanup logic active in start/stop actions
|
||||
- Ready for user testing
|
||||
|
||||
⏳ **PENDING USER VERIFICATION**
|
||||
- User needs to test start button functionality
|
||||
- User needs to verify coordinator starts successfully
|
||||
- User needs to confirm workers begin processing
|
||||
|
||||
## Git Commits
|
||||
|
||||
**Nov 30:** Coordinator chunk size fix
|
||||
**Dec 1 (5d07fbb):** "critical: Fix EPYC cluster start button - database cleanup before start"
|
||||
**Files Changed:** `app/api/cluster/control/route.ts` (61 insertions, 5 deletions)
|
||||
|
||||
```
|
||||
|
||||
The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately.
|
||||
|
||||
## Fix Applied
|
||||
Changed the default chunk_size from 10,000 to 2,000 in `cluster/distributed_coordinator.py`:
|
||||
|
||||
```python
|
||||
# Before:
|
||||
parser.add_argument('--chunk-size', type=int, default=10000,
|
||||
help='Number of combinations per chunk (default: 10000)')
|
||||
|
||||
# After:
|
||||
parser.add_argument('--chunk-size', type=int, default=2000,
|
||||
help='Number of combinations per chunk (default: 2000)')
|
||||
```
|
||||
|
||||
This creates 2-3 smaller chunks for the 4,096 combination exploration, allowing proper distribution across workers.
|
||||
|
||||
## Verification
|
||||
1. ✅ Manual coordinator run created chunks successfully
|
||||
2. ✅ Both workers (worker1 and worker2) started processing
|
||||
3. ✅ Docker image rebuilt with fix
|
||||
4. ✅ Container deployed and running
|
||||
|
||||
## Result
|
||||
The start button now works correctly:
|
||||
- Coordinator creates appropriate-sized chunks
|
||||
- Workers are assigned work
|
||||
- Exploration runs to completion
|
||||
- Progress is tracked in the database
|
||||
|
||||
## Next Steps
|
||||
You can now use the start button in the web dashboard at http://10.0.0.48:3001/cluster to start explorations. The system will:
|
||||
1. Create 2-3 chunks of ~2,000 combinations each
|
||||
2. Distribute to worker1 and worker2
|
||||
3. Run for ~30-60 minutes to complete 4,096 combinations
|
||||
4. Save top 100 results to CSV
|
||||
5. Update dashboard with live progress
|
||||
|
||||
## Files Modified
|
||||
- `cluster/distributed_coordinator.py` - Changed default chunk_size from 10000 to 2000
|
||||
- Docker image rebuilt and deployed to port 3001
|
||||
Reference in New Issue
Block a user