Files

mindesbunister 4c36fa2bc3 docs: Major documentation reorganization + ENV variable reference

**Documentation Structure:**
- Created docs/ subdirectory organization (analysis/, architecture/, bugs/,
  cluster/, deployments/, roadmaps/, setup/, archived/)
- Moved 68 root markdown files to appropriate categories
- Root directory now clean (only README.md remains)
- Total: 83 markdown files now organized by purpose

**New Content:**
- Added comprehensive Environment Variable Reference to copilot-instructions.md
- 100+ ENV variables documented with types, defaults, purpose, notes
- Organized by category: Required (Drift/RPC/Pyth), Trading Config (quality/
  leverage/sizing), ATR System, Runner System, Risk Limits, Notifications, etc.
- Includes usage examples (correct vs wrong patterns)

**File Distribution:**
- docs/analysis/ - Performance analyses, blocked signals, profit projections
- docs/architecture/ - Adaptive leverage, ATR trailing, indicator tracking
- docs/bugs/ - CRITICAL_*.md, FIXES_*.md bug reports (7 files)
- docs/cluster/ - EPYC setup, distributed computing docs (3 files)
- docs/deployments/ - *_COMPLETE.md, DEPLOYMENT_*.md status (12 files)
- docs/roadmaps/ - All *ROADMAP*.md strategic planning files (7 files)
- docs/setup/ - TradingView guides, signal quality, n8n setup (8 files)
- docs/archived/2025_pre_nov/ - Obsolete verification checklist (1 file)

**Key Improvements:**
- ENV variable reference: Single source of truth for all configuration
- Common Pitfalls #68-71: Already complete, verified during audit
- Better findability: Category-based navigation vs 68 files in root
- Preserves history: All files git mv (rename), not copy/delete
- Zero broken functionality: Only documentation moved, no code changes

**Verification:**
- 83 markdown files now in docs/ subdirectories
- Root directory cleaned: 68 files → 0 files (except README.md)
- Git history preserved for all moved files
- Container running: trading-bot-v4 (no restart needed)

**Next Steps:**
- Create README.md files in each docs subdirectory
- Add navigation index
- Update main README.md with new structure
- Consolidate duplicate deployment docs
- Archive truly obsolete files (old SQL backups)

See: docs/analysis/CLEANUP_PLAN.md for complete reorganization strategy

2025-12-04 08:29:59 +01:00

6.7 KiB

Raw Blame History

Cluster Start Button Fix - COMPLETE (Nov 30 + Dec 1, 2025)

Problem History

Original Issue (Nov 30, 2025)

The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work.

Root Cause: The coordinator had a hardcoded chunk_size = 10,000 which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error.

Second Issue (Dec 1, 2025) - DATABASE STALE STATE

Symptom: Start button showed "already running" when cluster wasn't actually running

Root Cause: Database had stale chunks in status='running' state from previously crashed/killed coordinator process, but no actual coordinator process was running.

Impact: User could not start cluster for parameter optimization work (4,000 combinations pending).

Solutions Implemented

Fix 1: Coordinator Chunk Size (Nov 30, 2025)

Changed hardcoded chunk_size from 10,000 to dynamic calculation based on total combinations.

Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX

File: app/api/cluster/control/route.ts

Problem: Control endpoint didn't check database state, only process state. This meant:

Crashed coordinator left chunks in "running" state
Status API checked database → saw "running" → reported "active"
Start button disabled when status = "active"
User couldn't start cluster even though nothing was running

Solution Implemented:

Enhanced Start Action:
- Check if coordinator already running (prevent duplicates)
- Reset any stale "running" chunks to "pending" before starting
- Verify coordinator actually started, return log output on failure

// Check if coordinator is already running
const checkCmd = 'ps aux | grep distributed_coordinator.py | grep -v grep | wc -l'
const { stdout: checkStdout } = await execAsync(checkCmd)
const alreadyRunning = parseInt(checkStdout.trim()) > 0

if (alreadyRunning) {
  return NextResponse.json({
    success: false,
    error: 'Coordinator is already running',
  }, { status: 400 })
}

// Reset any stale "running" chunks (orphaned from crashed coordinator)
const dbPath = path.join(process.cwd(), 'cluster', 'exploration.db')
const resetCmd = `sqlite3 ${dbPath} "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"`
await execAsync(resetCmd)
console.log('✅ Database cleanup complete')

// Start the coordinator
const startCmd = 'cd /home/icke/traderv4/cluster && nohup python3 distributed_coordinator.py > coordinator.log 2>&1 &'
await execAsync(startCmd)

Enhanced Stop Action:
- Reset running chunks to pending when stopping
- Prevents future stale database states
- Graceful handling if no processes found

Immediate Fix Applied (Nov 30):

sqlite3 cluster/exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"

Result: Cluster status changed from "active" to "idle", start button functional again.

Verification Checklist

Fix 1: Coordinator chunk size adjusted (Nov 30)
Fix 2: Database cleanup applied (Dec 1)
Cluster status shows "idle" (verified)
Control endpoint enhanced (committed 5d07fbb)
Docker container rebuilt and restarted
Code committed and pushed
USER ACTION NEEDED: Test start button functionality
USER ACTION NEEDED: Verify coordinator starts and workers begin processing

Testing Instructions

Open cluster UI: http://localhost:3001/cluster
Click Start Cluster button
Expected behavior:
- Button should trigger start action
- Cluster status should change from "idle" to "active"
- Active workers should increase from 0 to 2
- Workers should begin processing parameter combinations

Verify on EPYC servers:

# Check coordinator running
ssh root@10.10.254.106 "ps aux | grep distributed_coordinator | grep -v grep"

# Check workers running  
ssh root@10.10.254.106 "ps aux | grep distributed_worker | wc -l"

Check database state:

sqlite3 cluster/exploration.db "SELECT id, status, assigned_worker FROM chunks ORDER BY id;"

Status

✅ FIX 1 COMPLETE (Nov 30, 2025)

Coordinator chunk size fixed
Verified coordinator can process 4,096 combinations

✅ FIX 2 DEPLOYED (Dec 1, 2025 08:38 UTC)

Container rebuilt: 77s build time
Container restarted: trading-bot-v4 running
Cluster status: idle (correct)
Database cleanup logic active in start/stop actions
Ready for user testing

⏳ PENDING USER VERIFICATION

User needs to test start button functionality
User needs to verify coordinator starts successfully
User needs to confirm workers begin processing

Git Commits

Nov 30: Coordinator chunk size fix Dec 1 (5d07fbb): "critical: Fix EPYC cluster start button - database cleanup before start" Files Changed: app/api/cluster/control/route.ts (61 insertions, 5 deletions)


The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately.

## Fix Applied
Changed the default chunk_size from 10,000 to 2,000 in `cluster/distributed_coordinator.py`:

```python
# Before:
parser.add_argument('--chunk-size', type=int, default=10000,
                   help='Number of combinations per chunk (default: 10000)')

# After:
parser.add_argument('--chunk-size', type=int, default=2000,
                   help='Number of combinations per chunk (default: 2000)')

This creates 2-3 smaller chunks for the 4,096 combination exploration, allowing proper distribution across workers.

Verification

✅ Manual coordinator run created chunks successfully
✅ Both workers (worker1 and worker2) started processing
✅ Docker image rebuilt with fix
✅ Container deployed and running

Result

The start button now works correctly:

Coordinator creates appropriate-sized chunks
Workers are assigned work
Exploration runs to completion
Progress is tracked in the database

Next Steps

You can now use the start button in the web dashboard at http://10.0.0.48:3001/cluster to start explorations. The system will:

Create 2-3 chunks of ~2,000 combinations each
Distribute to worker1 and worker2
Run for ~30-60 minutes to complete 4,096 combinations
Save top 100 results to CSV
Update dashboard with live progress

Files Modified

cluster/distributed_coordinator.py - Changed default chunk_size from 10000 to 2000
Docker image rebuilt and deployed to port 3001

6.7 KiB Raw Blame History Unescape Escape

Cluster Start Button Fix - COMPLETE (Nov 30 + Dec 1, 2025)

Problem History

Original Issue (Nov 30, 2025)

Second Issue (Dec 1, 2025) - DATABASE STALE STATE

Solutions Implemented

Fix 1: Coordinator Chunk Size (Nov 30, 2025)

Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX

Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX

Verification Checklist

Testing Instructions

Status

Git Commits

Verification

Result

Next Steps

Files Modified

6.7 KiB

Raw Blame History