CRITICAL FIX (Nov 30, 2025):
- Dashboard showed 'idle' despite 22+ worker processes running
- Root cause: SSH-based worker detection timing out
- Solution: Check database for running chunks FIRST
Changes:
1. app/api/cluster/status/route.ts:
- Query exploration database before SSH detection
- If running chunks exist, mark workers 'active' even if SSH fails
- Override worker status: 'offline' → 'active' when chunks running
- Log: '✅ Cluster status: ACTIVE (database shows running chunks)'
- Database is source of truth, SSH only for supplementary metrics
2. app/cluster/page.tsx:
- Stop button ALREADY EXISTS (conditionally shown)
- Shows Start when status='idle', Stop when status='active'
- No code changes needed - fixed by status detection
Result:
- Dashboard now shows 'ACTIVE' with 2 workers (correct)
- Workers show 'active' status (was 'offline')
- Stop button automatically visible when cluster active
- System resilient to SSH timeouts/network issues
Verified:
- Container restarted: Nov 30 21:18 UTC
- API tested: Returns status='active', activeWorkers=2
- Logs confirm: Database-first logic working
- Workers confirmed running: 22+ processes on worker1, workers on worker2
3.0 KiB
3.0 KiB
Running Comprehensive Sweep on EPYC Server
Transfer Package to EPYC
# From your local machine
scp comprehensive_sweep_package.tar.gz root@72.62.39.24:/root/
Setup on EPYC
# SSH to EPYC
ssh root@72.62.39.24
# Extract package
cd /root
tar -xzf comprehensive_sweep_package.tar.gz
cd comprehensive_sweep
# Setup Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install pandas numpy
# Create logs directory
mkdir -p backtester/logs
# Make scripts executable
chmod +x run_comprehensive_sweep.sh
chmod +x backtester/scripts/comprehensive_sweep.py
Run the Sweep
# Start the sweep in background
./run_comprehensive_sweep.sh
# Or manually with more control:
cd /root/comprehensive_sweep
source .venv/bin/activate
nohup python3 backtester/scripts/comprehensive_sweep.py > sweep.log 2>&1 &
# Get the PID
echo $! > sweep.pid
Monitor Progress
# Watch live progress (updates every 100 configs)
tail -f backtester/logs/sweep_comprehensive_*.log
# Or if using manual method:
tail -f sweep.log
# See current best result
grep 'Best so far' backtester/logs/sweep_comprehensive_*.log | tail -5
# Check if still running
ps aux | grep comprehensive_sweep
# Check CPU usage
htop
Stop if Needed
# Using PID file:
kill $(cat sweep.pid)
# Or by name:
pkill -f comprehensive_sweep
EPYC Performance Estimate
- Your EPYC: 16 cores/32 threads
- Local Server: 6 cores
- Speedup: ~5-6× faster on EPYC
Total combinations: 14,929,920
Estimated times:
- Local (6 cores): ~30-40 hours
- EPYC (16 cores): ~6-8 hours 🚀
Retrieve Results
# After completion, download results
scp root@72.62.39.24:/root/comprehensive_sweep/sweep_comprehensive.csv .
# Check top results on server first:
head -21 /root/comprehensive_sweep/sweep_comprehensive.csv
Results Format
CSV columns:
- rank
- trades
- win_rate
- total_pnl
- pnl_per_1k (most important - profitability per $1000)
- flip_threshold
- ma_gap
- adx_min
- long_pos_max
- short_pos_min
- cooldown
- position_size
- tp1_mult
- tp2_mult
- sl_mult
- tp1_close_pct
- trailing_mult
- vol_min
- max_bars
Quick Test
Before running full sweep, test that everything works:
cd /root/comprehensive_sweep
source .venv/bin/activate
# Quick test with just 10 combinations
python3 -c "
from pathlib import Path
from backtester.data_loader import load_csv
from backtester.simulator import simulate_money_line, TradeConfig
from backtester.indicators.money_line import MoneyLineInputs
data_slice = load_csv(Path('backtester/data/solusdt_5m_aug_nov.csv'), 'SOL-PERP', '5m')
print(f'Loaded {len(data_slice.data)} candles')
inputs = MoneyLineInputs(flip_threshold_percent=0.6)
config = TradeConfig(position_size=210.0)
results = simulate_money_line(data_slice.data, 'SOL-PERP', inputs, config)
print(f'Test: {len(results.trades)} trades, {results.win_rate*100:.1f}% WR, \${results.total_pnl:.2f} P&L')
print('✅ Everything working!')
"
If test passes, run the full sweep!