docs: Major documentation reorganization + ENV variable reference

**Documentation Structure:** - Created docs/ subdirectory organization (analysis/, architecture/, bugs/, cluster/, deployments/, roadmaps/, setup/, archived/) - Moved 68 root markdown files to appropriate categories - Root directory now clean (only README.md remains) - Total: 83 markdown files now organized by purpose **New Content:** - Added comprehensive Environment Variable Reference to copilot-instructions.md - 100+ ENV variables documented with types, defaults, purpose, notes - Organized by category: Required (Drift/RPC/Pyth), Trading Config (quality/ leverage/sizing), ATR System, Runner System, Risk Limits, Notifications, etc. - Includes usage examples (correct vs wrong patterns) **File Distribution:** - docs/analysis/ - Performance analyses, blocked signals, profit projections - docs/architecture/ - Adaptive leverage, ATR trailing, indicator tracking - docs/bugs/ - CRITICAL_*.md, FIXES_*.md bug reports (7 files) - docs/cluster/ - EPYC setup, distributed computing docs (3 files) - docs/deployments/ - *_COMPLETE.md, DEPLOYMENT_*.md status (12 files) - docs/roadmaps/ - All *ROADMAP*.md strategic planning files (7 files) - docs/setup/ - TradingView guides, signal quality, n8n setup (8 files) - docs/archived/2025_pre_nov/ - Obsolete verification checklist (1 file) **Key Improvements:** - ENV variable reference: Single source of truth for all configuration - Common Pitfalls #68-71: Already complete, verified during audit - Better findability: Category-based navigation vs 68 files in root - Preserves history: All files git mv (rename), not copy/delete - Zero broken functionality: Only documentation moved, no code changes **Verification:** - 83 markdown files now in docs/ subdirectories - Root directory cleaned: 68 files → 0 files (except README.md) - Git history preserved for all moved files - Container running: trading-bot-v4 (no restart needed) **Next Steps:** - Create README.md files in each docs subdirectory - Add navigation index - Update main README.md with new structure - Consolidate duplicate deployment docs - Archive truly obsolete files (old SQL backups) See: docs/analysis/CLEANUP_PLAN.md for complete reorganization strategy
2025-12-04 08:29:59 +01:00
parent e48332e347
commit 4c36fa2bc3
61 changed files with 520 additions and 37 deletions
--- a/docs/cluster/CLUSTER_START_BUTTON_FIX.md
+++ b/docs/cluster/CLUSTER_START_BUTTON_FIX.md
@@ -0,0 +1,178 @@
+# Cluster Start Button Fix - COMPLETE (Nov 30 + Dec 1, 2025)
+
+## Problem History
+
+### Original Issue (Nov 30, 2025)
+The cluster start button in the web dashboard was executing the coordinator command successfully, but the coordinator would exit immediately without doing any work.
+
+**Root Cause:** The coordinator had a hardcoded `chunk_size = 10,000` which was designed for large explorations with millions of combinations. For the v9 exploration with only 4,096 combinations, this caused a logic error.
+
+### Second Issue (Dec 1, 2025) - DATABASE STALE STATE
+**Symptom:** Start button showed "already running" when cluster wasn't actually running
+
+**Root Cause:** Database had stale chunks in `status='running'` state from previously crashed/killed coordinator process, but no actual coordinator process was running.
+
+**Impact:** User could not start cluster for parameter optimization work (4,000 combinations pending).
+
+## Solutions Implemented
+
+### Fix 1: Coordinator Chunk Size (Nov 30, 2025)
+Changed hardcoded chunk_size from 10,000 to dynamic calculation based on total combinations.
+
+### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
+
+### Fix 2: Database Cleanup in Control Endpoint (Dec 1, 2025) - CRITICAL FIX
+
+**File:** `app/api/cluster/control/route.ts`
+
+**Problem:** Control endpoint didn't check database state, only process state. This meant:
+- Crashed coordinator left chunks in "running" state
+- Status API checked database → saw "running" → reported "active"
+- Start button disabled when status = "active"
+- User couldn't start cluster even though nothing was running
+
+**Solution Implemented:**
+
+1. **Enhanced Start Action:**
+   - Check if coordinator already running (prevent duplicates)
+   - Reset any stale "running" chunks to "pending" before starting
+   - Verify coordinator actually started, return log output on failure
+
+```typescript
+// Check if coordinator is already running
+const checkCmd = 'ps aux | grep distributed_coordinator.py | grep -v grep | wc -l'
+const { stdout: checkStdout } = await execAsync(checkCmd)
+const alreadyRunning = parseInt(checkStdout.trim()) > 0
+
+if (alreadyRunning) {
+  return NextResponse.json({
+    success: false,
+    error: 'Coordinator is already running',
+  }, { status: 400 })
+}
+
+// Reset any stale "running" chunks (orphaned from crashed coordinator)
+const dbPath = path.join(process.cwd(), 'cluster', 'exploration.db')
+const resetCmd = `sqlite3 ${dbPath} "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"`
+await execAsync(resetCmd)
+console.log('✅ Database cleanup complete')
+
+// Start the coordinator
+const startCmd = 'cd /home/icke/traderv4/cluster && nohup python3 distributed_coordinator.py > coordinator.log 2>&1 &'
+await execAsync(startCmd)
+```
+
+2. **Enhanced Stop Action:**
+   - Reset running chunks to pending when stopping
+   - Prevents future stale database states
+   - Graceful handling if no processes found
+
+**Immediate Fix Applied (Nov 30):**
+```bash
+sqlite3 cluster/exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running';"
+```
+
+**Result:** Cluster status changed from "active" to "idle", start button functional again.
+
+## Verification Checklist
+
+- [x] Fix 1: Coordinator chunk size adjusted (Nov 30)
+- [x] Fix 2: Database cleanup applied (Dec 1)
+- [x] Cluster status shows "idle" (verified)
+- [x] Control endpoint enhanced (committed 5d07fbb)
+- [x] Docker container rebuilt and restarted
+- [x] Code committed and pushed
+- [ ] **USER ACTION NEEDED:** Test start button functionality
+- [ ] **USER ACTION NEEDED:** Verify coordinator starts and workers begin processing
+
+## Testing Instructions
+
+1. **Open cluster UI:** http://localhost:3001/cluster
+2. **Click Start Cluster button**
+3. **Expected behavior:**
+   - Button should trigger start action
+   - Cluster status should change from "idle" to "active"
+   - Active workers should increase from 0 to 2
+   - Workers should begin processing parameter combinations
+
+4. **Verify on EPYC servers:**
+   ```bash
+   # Check coordinator running
+   ssh root@10.10.254.106 "ps aux | grep distributed_coordinator | grep -v grep"
+   
+   # Check workers running  
+   ssh root@10.10.254.106 "ps aux | grep distributed_worker | wc -l"
+   ```
+
+5. **Check database state:**
+   ```bash
+   sqlite3 cluster/exploration.db "SELECT id, status, assigned_worker FROM chunks ORDER BY id;"
+   ```
+
+## Status
+
+✅ **FIX 1 COMPLETE** (Nov 30, 2025)
+- Coordinator chunk size fixed
+- Verified coordinator can process 4,096 combinations
+
+✅ **FIX 2 DEPLOYED** (Dec 1, 2025 08:38 UTC)
+- Container rebuilt: 77s build time
+- Container restarted: trading-bot-v4 running
+- Cluster status: idle (correct)
+- Database cleanup logic active in start/stop actions
+- Ready for user testing
+
+⏳ **PENDING USER VERIFICATION**
+- User needs to test start button functionality
+- User needs to verify coordinator starts successfully
+- User needs to confirm workers begin processing
+
+## Git Commits
+
+**Nov 30:** Coordinator chunk size fix
+**Dec 1 (5d07fbb):** "critical: Fix EPYC cluster start button - database cleanup before start"
+**Files Changed:** `app/api/cluster/control/route.ts` (61 insertions, 5 deletions)
+
+```
+
+The coordinator calculated that chunk 1 would start at combo 10,000 (chunk_size × chunk_id), but since 10,000 > 4,096 total combos, it thought all work was complete and exited immediately.
+
+## Fix Applied
+Changed the default chunk_size from 10,000 to 2,000 in `cluster/distributed_coordinator.py`:
+
+```python
+# Before:
+parser.add_argument('--chunk-size', type=int, default=10000,
+                   help='Number of combinations per chunk (default: 10000)')
+
+# After:
+parser.add_argument('--chunk-size', type=int, default=2000,
+                   help='Number of combinations per chunk (default: 2000)')
+```
+
+This creates 2-3 smaller chunks for the 4,096 combination exploration, allowing proper distribution across workers.
+
+## Verification
+1. ✅ Manual coordinator run created chunks successfully
+2. ✅ Both workers (worker1 and worker2) started processing
+3. ✅ Docker image rebuilt with fix
+4. ✅ Container deployed and running
+
+## Result
+The start button now works correctly:
+- Coordinator creates appropriate-sized chunks
+- Workers are assigned work
+- Exploration runs to completion
+- Progress is tracked in the database
+
+## Next Steps
+You can now use the start button in the web dashboard at http://10.0.0.48:3001/cluster to start explorations. The system will:
+1. Create 2-3 chunks of ~2,000 combinations each
+2. Distribute to worker1 and worker2
+3. Run for ~30-60 minutes to complete 4,096 combinations
+4. Save top 100 results to CSV
+5. Update dashboard with live progress
+
+## Files Modified
+- `cluster/distributed_coordinator.py` - Changed default chunk_size from 10000 to 2000
+- Docker image rebuilt and deployed to port 3001
--- a/docs/cluster/DUAL_SWEEP_README.md
+++ b/docs/cluster/DUAL_SWEEP_README.md
@@ -0,0 +1,167 @@
+# Dual v9 Parameter Sweep Package
+
+**Purpose:** Run two INDEPENDENT parameter sweeps to compare which performs better
+
+## What This Tests
+
+**TWO SEPARATE SWEEPS** (not combined):
+
+1. **Raw v9 Sweep**: v9 Money Line indicator WITHOUT any filter  
+   - Baseline performance across all parameters
+   - File: `scripts/run_backtest_sweep.py`
+   - Output: `sweep_v9_raw.csv`
+
+2. **RSI Filtered Sweep**: v9 Money Line indicator WITH RSI divergence filter  
+   - Same parameters, but only trades with RSI divergence
+   - File: `scripts/run_backtest_sweep_rsi.py`
+   - Output: `sweep_v9_rsi_divergence.csv`
+
+**Both test 65,536 parameter combinations independently, then we compare best results.**
+
+## Package Contents
+- `data/solusdt_5m.csv` - OHLCV data (Aug 1 - Nov 28, 2024, 34,273 candles)
+- `backtester/` - Core backtesting modules
+- `scripts/run_backtest_sweep.py` - Vanilla v9 sweep
+- `scripts/run_backtest_sweep_rsi.py` - RSI divergence filtered sweep
+- `setup_dual_sweep.sh` - Setup script
+- `run_sweep_vanilla_epyc.sh` - Launch vanilla sweep
+- `run_sweep_rsi_epyc.sh` - Launch RSI sweep
+
+## Quick Start (EPYC Servers)
+
+### EPYC Server 1 - Raw v9 Sweep (No Filter)
+```bash
+# Extract package
+tar -xzf backtest_v9_dual_sweep.tar.gz
+cd backtest
+
+# Setup environment
+./setup_dual_sweep.sh
+
+# Run raw v9 sweep (65,536 combinations, ~12-13h with 24 workers)
+./run_sweep_vanilla_epyc.sh
+
+# Monitor progress
+tail -f v9_vanilla_sweep.log
+```
+
+### EPYC Server 2 - RSI Filtered v9 Sweep
+```bash
+# Extract package
+tar -xzf backtest_v9_dual_sweep.tar.gz
+cd backtest
+
+# Setup environment
+./setup_dual_sweep.sh
+
+# Run RSI sweep (65,536 combinations, ~12-13h with 24 workers)
+./run_sweep_rsi_epyc.sh
+
+# Monitor progress
+tail -f v9_rsi_sweep.log
+```
+
+## Parameter Grid
+Both sweeps test the same 8 parameters (4 values each = 65,536 combinations):
+
+- **flip_threshold:** 0.4, 0.5, 0.6, 0.7
+- **ma_gap:** 0.20, 0.30, 0.40, 0.50
+- **momentum_adx:** 18, 21, 24, 27
+- **momentum_long_pos:** 60, 65, 70, 75
+- **momentum_short_pos:** 20, 25, 30, 35
+- **cooldown_bars:** 1, 2, 3, 4
+- **momentum_spacing:** 2, 3, 4, 5
+- **momentum_cooldown:** 1, 2, 3, 4
+
+## Expected Outputs
+
+### Vanilla Sweep
+- **File:** `sweep_v9_vanilla_epyc.csv`
+- **Columns:** All 8 parameters + trades, total_pnl, win_rate, avg_pnl, max_drawdown, profit_factor
+- **Sorted by:** total_pnl (descending)
+- **Baseline Performance (default params):** $405.88, 569 trades, 60.98% WR
+
+### RSI Divergence Sweep
+- **File:** `sweep_v9_rsi_divergence.csv`
+- **Columns:** Same as vanilla
+- **Sorted by:** total_pnl (descending)
+- **Filter:** Only trades with RSI divergence (bullish/bearish patterns, 20-bar lookback)
+- **Top:** 100 results only (to keep file size manageable)
+- **Baseline Performance:** $423.46, 224 trades, 63.39% WR (39% fewer trades but better quality)
+
+## Key Differences
+
+### Vanilla v9
+- All signals execute (no post-filter)
+- Tests which parameters maximize profit across all market conditions
+- Higher trade frequency
+
+### RSI Divergence v9
+- Post-simulation filter: only keeps trades with RSI divergence detected
+- Tests which parameters work best when combined with divergence confirmation
+- Lower trade frequency but potentially higher win rate
+
+## Performance Estimates
+- **Hardware:** AMD EPYC 7282 (16-core) or similar
+- **Workers:** 24 parallel processes
+- **Speed:** ~1.6s per combination
+- **Total Time:** ~29 hours for 65,536 combinations
+- **Output Size:** ~5-10 MB per CSV (vanilla full results, RSI top 100)
+
+## Comparison Strategy
+After both sweeps complete:
+1. Find best vanilla result: `head -1 sweep_v9_vanilla_epyc.csv`
+2. Find best RSI result: `head -1 sweep_v9_rsi_divergence.csv`
+3. Compare total P&L, trade count, win rate
+4. Decision: Implement whichever yields highest total profit
+
+## Monitoring Commands
+```bash
+# Check sweep status
+ps aux | grep run_backtest_sweep
+
+# Watch progress (vanilla)
+tail -f v9_vanilla_sweep.log
+
+# Watch progress (RSI)
+tail -f v9_rsi_sweep.log
+
+# Check completion
+ls -lh sweep_v9_*.csv
+
+# Kill sweep if needed
+pkill -f run_backtest_sweep
+```
+
+## Troubleshooting
+
+### Import Errors
+- Ensure .venv is activated: `source .venv/bin/activate`
+- Check pandas/numpy installed: `pip list | grep -E 'pandas|numpy'`
+
+### Memory Issues
+- Reduce workers: Edit run script, change `--workers 24` to `--workers 16`
+- Monitor: `htop` or `free -h`
+
+### Slow Progress
+- Check CPU usage: `htop` (should see 24 python processes at 100%)
+- Check I/O: `iostat -x 1` (shouldn't be bottleneck with CSV in memory)
+
+## Expected Results Format
+
+### Vanilla CSV Example
+```
+flip_threshold,ma_gap,momentum_adx,momentum_long_pos,momentum_short_pos,cooldown_bars,momentum_spacing,momentum_cooldown,trades,total_pnl,win_rate,avg_pnl,max_drawdown,profit_factor
+0.6,0.35,23,70,25,2,3,2,569,405.88,60.98,0.71,1360.58,1.022
+```
+
+### RSI Divergence CSV Example
+```
+flip_threshold,ma_gap,momentum_adx,momentum_long_pos,momentum_short_pos,cooldown_bars,momentum_spacing,momentum_cooldown,total_pnl,num_trades,win_rate,profit_factor,max_drawdown,avg_win,avg_loss
+0.6,0.35,23,70,25,2,3,2,423.46,224,63.39,1.087,1124.33,5.23,-3.42
+```
+
+## Package Info
+- **Size:** 1.1 MB compressed
+- **MD5:** d540906b1a9a3eaa0404bbd800349c59
+- **Created:** November 29, 2025
--- a/docs/cluster/EPYC_SETUP_COMPREHENSIVE.md
+++ b/docs/cluster/EPYC_SETUP_COMPREHENSIVE.md
@@ -0,0 +1,149 @@
+# Running Comprehensive Sweep on EPYC Server
+
+## Transfer Package to EPYC
+
+```bash
+# From your local machine
+scp comprehensive_sweep_package.tar.gz root@72.62.39.24:/root/
+```
+
+## Setup on EPYC
+
+```bash
+# SSH to EPYC
+ssh root@72.62.39.24
+
+# Extract package
+cd /root
+tar -xzf comprehensive_sweep_package.tar.gz
+cd comprehensive_sweep
+
+# Setup Python environment
+python3 -m venv .venv
+source .venv/bin/activate
+pip install pandas numpy
+
+# Create logs directory
+mkdir -p backtester/logs
+
+# Make scripts executable
+chmod +x run_comprehensive_sweep.sh
+chmod +x backtester/scripts/comprehensive_sweep.py
+```
+
+## Run the Sweep
+
+```bash
+# Start the sweep in background
+./run_comprehensive_sweep.sh
+
+# Or manually with more control:
+cd /root/comprehensive_sweep
+source .venv/bin/activate
+nohup python3 backtester/scripts/comprehensive_sweep.py > sweep.log 2>&1 &
+
+# Get the PID
+echo $! > sweep.pid
+```
+
+## Monitor Progress
+
+```bash
+# Watch live progress (updates every 100 configs)
+tail -f backtester/logs/sweep_comprehensive_*.log
+
+# Or if using manual method:
+tail -f sweep.log
+
+# See current best result
+grep 'Best so far' backtester/logs/sweep_comprehensive_*.log | tail -5
+
+# Check if still running
+ps aux | grep comprehensive_sweep
+
+# Check CPU usage
+htop
+```
+
+## Stop if Needed
+
+```bash
+# Using PID file:
+kill $(cat sweep.pid)
+
+# Or by name:
+pkill -f comprehensive_sweep
+```
+
+## EPYC Performance Estimate
+
+- **Your EPYC:** 16 cores/32 threads
+- **Local Server:** 6 cores
+- **Speedup:** ~5-6× faster on EPYC
+
+**Total combinations:** 14,929,920
+
+**Estimated times:**
+- Local (6 cores): ~30-40 hours
+- EPYC (16 cores): ~6-8 hours 🚀
+
+## Retrieve Results
+
+```bash
+# After completion, download results
+scp root@72.62.39.24:/root/comprehensive_sweep/sweep_comprehensive.csv .
+
+# Check top results on server first:
+head -21 /root/comprehensive_sweep/sweep_comprehensive.csv
+```
+
+## Results Format
+
+CSV columns:
+- rank
+- trades
+- win_rate
+- total_pnl
+- pnl_per_1k (most important - profitability per $1000)
+- flip_threshold
+- ma_gap
+- adx_min
+- long_pos_max
+- short_pos_min
+- cooldown
+- position_size
+- tp1_mult
+- tp2_mult
+- sl_mult
+- tp1_close_pct
+- trailing_mult
+- vol_min
+- max_bars
+
+## Quick Test
+
+Before running full sweep, test that everything works:
+
+```bash
+cd /root/comprehensive_sweep
+source .venv/bin/activate
+
+# Quick test with just 10 combinations
+python3 -c "
+from pathlib import Path
+from backtester.data_loader import load_csv
+from backtester.simulator import simulate_money_line, TradeConfig
+from backtester.indicators.money_line import MoneyLineInputs
+
+data_slice = load_csv(Path('backtester/data/solusdt_5m_aug_nov.csv'), 'SOL-PERP', '5m')
+print(f'Loaded {len(data_slice.data)} candles')
+
+inputs = MoneyLineInputs(flip_threshold_percent=0.6)
+config = TradeConfig(position_size=210.0)
+results = simulate_money_line(data_slice.data, 'SOL-PERP', inputs, config)
+print(f'Test: {len(results.trades)} trades, {results.win_rate*100:.1f}% WR, \${results.total_pnl:.2f} P&L')
+print('✅ Everything working!')
+"
+```
+
+If test passes, run the full sweep!