feat: Continuous optimization cluster for 2 EPYC servers
- Master controller with job queue and result aggregation - Worker scripts for parallel backtesting (22 workers per server) - SQLite database for strategy ranking and performance tracking - File-based job queue (simple, robust, survives crashes) - Auto-setup script for both EPYC servers - Status dashboard for monitoring progress - Comprehensive deployment guide Architecture: - Master: Job generation, worker coordination, result collection - Worker 1 (pve-nu-monitor01): AMD EPYC 7282, 22 parallel jobs - Worker 2 (srv-bd-host01): AMD EPYC 7302, 22 parallel jobs - Total capacity: ~49,000 backtests/day (44 cores @ 70%) Initial focus: v9 parameter refinement (27 configurations) Target: Find strategies >00/1k P&L (current baseline 92/1k) Files: - cluster/master.py: Main controller (570 lines) - cluster/worker.py: Worker execution script (220 lines) - cluster/setup_cluster.sh: Automated deployment - cluster/status.py: Real-time status dashboard - cluster/README.md: Operational documentation - cluster/DEPLOYMENT.md: Step-by-step deployment guide
This commit is contained in:
415
cluster/DEPLOYMENT.md
Normal file
415
cluster/DEPLOYMENT.md
Normal file
@@ -0,0 +1,415 @@
|
||||
# 🚀 Continuous Optimization Cluster - Deployment Guide
|
||||
|
||||
## ⚡ Quick Deploy (5 minutes)
|
||||
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
|
||||
# 1. Setup both EPYC servers
|
||||
./setup_cluster.sh
|
||||
|
||||
# 2. Start master controller
|
||||
python3 master.py
|
||||
|
||||
# 3. Monitor status (separate terminal)
|
||||
watch -n 10 'python3 status.py'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📋 Prerequisites Checklist
|
||||
|
||||
- [x] **SSH Access:** Keys configured for both EPYC servers
|
||||
- `root@10.10.254.106` (pve-nu-monitor01)
|
||||
- `root@10.20.254.100` (srv-bd-host01 via monitor01)
|
||||
|
||||
- [x] **Python 3.7+** installed on all servers
|
||||
|
||||
- [x] **OHLCV Data:** `backtester/data/solusdt_5m.csv` (139,678 rows)
|
||||
|
||||
- [ ] **Backtester Code:** In `/home/icke/traderv4/backtester/`
|
||||
- `backtester_core.py`
|
||||
- `v9_moneyline_ma_gap.py`
|
||||
- `moneyline_core.py`
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Step-by-Step Setup
|
||||
|
||||
### Step 1: Verify Backtester Works Locally
|
||||
|
||||
```bash
|
||||
cd /home/icke/traderv4/backtester
|
||||
|
||||
# Test v9 backtest
|
||||
python3 backtester_core.py \
|
||||
--data data/solusdt_5m.csv \
|
||||
--indicator v9 \
|
||||
--flip-threshold 0.6 \
|
||||
--ma-gap 0.35 \
|
||||
--momentum-adx 23 \
|
||||
--output json
|
||||
```
|
||||
|
||||
**Expected output:**
|
||||
```json
|
||||
{
|
||||
"pnl": 192.50,
|
||||
"trades": 569,
|
||||
"win_rate": 60.98,
|
||||
"profit_factor": 1.022,
|
||||
"max_drawdown": 1360.58
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2: Deploy to EPYC Servers
|
||||
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
./setup_cluster.sh
|
||||
```
|
||||
|
||||
**This will:**
|
||||
1. Create `/root/optimization-cluster` on both servers
|
||||
2. Install Python venv + pandas/numpy
|
||||
3. Copy backtester code
|
||||
4. Copy worker.py script
|
||||
5. Copy OHLCV data
|
||||
6. Verify installation
|
||||
|
||||
**Expected output:**
|
||||
```
|
||||
🚀 Setting up optimization cluster...
|
||||
|
||||
Setting up Worker 1 (root@10.10.254.106)...
|
||||
📦 Installing Python packages...
|
||||
📁 Copying backtester modules...
|
||||
📄 Installing worker script...
|
||||
📊 Copying OHLCV data...
|
||||
✅ Verifying installation...
|
||||
✅ Worker 1 (pve-nu-monitor01) setup complete
|
||||
|
||||
Setting up Worker 2 (root@10.20.254.100)...
|
||||
📦 Installing Python packages...
|
||||
📁 Copying backtester modules...
|
||||
📄 Installing worker script...
|
||||
📊 Copying OHLCV data...
|
||||
✅ Verifying installation...
|
||||
✅ Worker 2 (srv-bd-host01) setup complete
|
||||
|
||||
🎉 Cluster setup complete!
|
||||
```
|
||||
|
||||
### Step 3: Start Master Controller
|
||||
|
||||
```bash
|
||||
python3 master.py
|
||||
```
|
||||
|
||||
**Master will:**
|
||||
1. Initialize SQLite database (`strategies.db`)
|
||||
2. Generate initial v9 parameter sweep (27 jobs)
|
||||
3. Start monitoring loop (60-second intervals)
|
||||
4. Assign jobs to idle workers
|
||||
5. Collect and rank results
|
||||
|
||||
**Expected output:**
|
||||
```
|
||||
🚀 Cluster Master starting...
|
||||
📊 Workers: 2
|
||||
💾 Database: /home/icke/traderv4/cluster/strategies.db
|
||||
📁 Queue: /home/icke/traderv4/cluster/queue
|
||||
|
||||
🔧 Generating v9 parameter sweep jobs...
|
||||
✅ Created job: v9_moneyline_1701234567890 (priority 1)
|
||||
✅ Created job: v9_moneyline_1701234567891 (priority 1)
|
||||
...
|
||||
✅ Created 27 v9 refinement jobs
|
||||
|
||||
============================================================
|
||||
🔄 Iteration 1 - 2025-11-29 15:30:00
|
||||
|
||||
📤 Assigning v9_moneyline_1701234567890.json to worker1...
|
||||
✅ Job started on worker1
|
||||
|
||||
📤 Assigning v9_moneyline_1701234567891.json to worker2...
|
||||
✅ Job started on worker2
|
||||
|
||||
📊 Status: 25 queued | 2 running | 0 completed
|
||||
```
|
||||
|
||||
### Step 4: Monitor Progress
|
||||
|
||||
**Terminal 1 - Master logs:**
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
python3 master.py
|
||||
```
|
||||
|
||||
**Terminal 2 - Status dashboard:**
|
||||
```bash
|
||||
watch -n 10 'python3 status.py'
|
||||
```
|
||||
|
||||
**Terminal 3 - Queue size:**
|
||||
```bash
|
||||
watch -n 5 'ls -1 cluster/queue/*.json 2>/dev/null | wc -l'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Understanding Results
|
||||
|
||||
### Status Dashboard Output
|
||||
|
||||
```
|
||||
======================================================================
|
||||
🎯 OPTIMIZATION CLUSTER STATUS
|
||||
======================================================================
|
||||
📅 2025-11-29 15:45:00
|
||||
|
||||
📋 Queue: 15 jobs waiting
|
||||
Running: 2
|
||||
Completed: 10
|
||||
|
||||
🏆 TOP 5 STRATEGIES:
|
||||
----------------------------------------------------------------------
|
||||
Rank Strategy P&L/1k Trades WR% PF
|
||||
----------------------------------------------------------------------
|
||||
1 v9_flip0.7_ma0.40_adx25 $215.80 587 62.3% 1.18
|
||||
2 v9_flip0.6_ma0.35_adx23 $208.40 569 61.5% 1.12
|
||||
3 v9_flip0.7_ma0.35_adx25 $205.20 601 60.8% 1.09
|
||||
4 v9_flip0.6_ma0.40_adx21 $198.70 553 61.2% 1.07
|
||||
5 v9_flip0.5_ma0.35_adx23 $192.50 569 60.9% 1.02
|
||||
|
||||
📊 BASELINE COMPARISON:
|
||||
v9 baseline: $192.00/1k (current production)
|
||||
Best found: $215.80/1k (+12.4% improvement) ✅
|
||||
|
||||
======================================================================
|
||||
```
|
||||
|
||||
### Strategy Naming Convention
|
||||
|
||||
Format: `{indicator}_{param1}_{param2}_{param3}...`
|
||||
|
||||
Example: `v9_flip0.7_ma0.40_adx25`
|
||||
- `v9`: Money Line indicator
|
||||
- `flip0.7`: flip_threshold = 0.7 (70% EMA flip confirmation)
|
||||
- `ma0.40`: ma_gap = 0.40 (MA50-MA200 gap bonus threshold)
|
||||
- `adx25`: momentum_adx = 25 (ADX requirement for momentum filter)
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### Problem: Workers not picking up jobs
|
||||
|
||||
**Check worker health:**
|
||||
```bash
|
||||
ssh root@10.10.254.106 'pgrep -f backtester || echo IDLE'
|
||||
ssh root@10.10.254.106 'ssh root@10.20.254.100 "pgrep -f backtester || echo IDLE"'
|
||||
```
|
||||
|
||||
**View worker logs:**
|
||||
```bash
|
||||
ssh root@10.10.254.106 'tail -f /root/optimization-cluster/logs/worker.log'
|
||||
```
|
||||
|
||||
**Restart worker manually:**
|
||||
```bash
|
||||
ssh root@10.10.254.106 'cd /root/optimization-cluster && python3 worker.py jobs/v9_moneyline_*.json'
|
||||
```
|
||||
|
||||
### Problem: Jobs stuck in "running" status
|
||||
|
||||
**Reset stale jobs (>30 minutes):**
|
||||
```bash
|
||||
sqlite3 cluster/strategies.db <<EOF
|
||||
UPDATE jobs
|
||||
SET status = 'queued', worker_id = NULL, started_at = NULL
|
||||
WHERE status = 'running'
|
||||
AND started_at < datetime('now', '-30 minutes');
|
||||
EOF
|
||||
```
|
||||
|
||||
### Problem: Database locked
|
||||
|
||||
**Kill master process:**
|
||||
```bash
|
||||
pkill -f 'python3 master.py'
|
||||
```
|
||||
|
||||
**Remove lock file:**
|
||||
```bash
|
||||
rm -f cluster/strategies.db-journal
|
||||
```
|
||||
|
||||
**Restart master:**
|
||||
```bash
|
||||
python3 master.py
|
||||
```
|
||||
|
||||
### Problem: Out of disk space
|
||||
|
||||
**Archive old results:**
|
||||
```bash
|
||||
cd cluster/results
|
||||
tar -czf archive_$(date +%Y%m%d).tar.gz archive/
|
||||
mv archive_$(date +%Y%m%d).tar.gz ~/backups/
|
||||
rm -rf archive/*
|
||||
```
|
||||
|
||||
**Clean worker temp files:**
|
||||
```bash
|
||||
ssh root@10.10.254.106 'rm -rf /root/optimization-cluster/results/archive/*'
|
||||
ssh root@10.10.254.106 'ssh root@10.20.254.100 "rm -rf /root/optimization-cluster/results/archive/*"'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Adding Custom Strategies
|
||||
|
||||
### Example: Volume Profile Indicator
|
||||
|
||||
1. **Implement backtester module:**
|
||||
```bash
|
||||
vim backtester/volume_profile.py
|
||||
```
|
||||
|
||||
2. **Add job generation in master.py:**
|
||||
```python
|
||||
def generate_volume_jobs(self):
|
||||
"""Generate volume profile jobs"""
|
||||
for window in [20, 50, 100]:
|
||||
for threshold in [0.6, 0.7, 0.8]:
|
||||
params = {
|
||||
'profile_window': window,
|
||||
'entry_threshold': threshold,
|
||||
'stop_loss_atr': 3.0
|
||||
}
|
||||
|
||||
self.queue.create_job(
|
||||
'volume_profile',
|
||||
params,
|
||||
PRIORITY_MEDIUM
|
||||
)
|
||||
```
|
||||
|
||||
3. **Update worker.py to handle new indicator:**
|
||||
```python
|
||||
def execute(self):
|
||||
if self.job['indicator'] == 'v9_moneyline':
|
||||
result = self.run_v9_backtest()
|
||||
elif self.job['indicator'] == 'volume_profile':
|
||||
result = self.run_volume_backtest()
|
||||
# ...
|
||||
```
|
||||
|
||||
4. **Deploy updates:**
|
||||
```bash
|
||||
./setup_cluster.sh # Redeploy with new code
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 Performance Expectations
|
||||
|
||||
### Cluster Capacity
|
||||
|
||||
**Hardware:**
|
||||
- 64 total cores (32 per server)
|
||||
- 44 cores @ 70% utilization
|
||||
- ~22 parallel backtests
|
||||
|
||||
**Throughput:**
|
||||
- ~1.6s per v9 backtest (EPYC 7282/7302)
|
||||
- ~49,000 backtests per day
|
||||
- ~1.47 million backtests per month
|
||||
|
||||
### Optimization Timeline
|
||||
|
||||
**Week 1 (v9 refinement):**
|
||||
- 27 initial jobs (flip × ma_gap × adx)
|
||||
- Expand to 81 jobs (add long_pos, short_pos variations)
|
||||
- Target: Find >$200/1k P&L configuration
|
||||
|
||||
**Week 2-3 (New indicators):**
|
||||
- Volume profile: 27 configurations
|
||||
- Order flow: 27 configurations
|
||||
- Market structure: 27 configurations
|
||||
- Target: Find >$250/1k P&L strategy
|
||||
|
||||
**Week 4+ (Advanced):**
|
||||
- Multi-timeframe: 81 configurations
|
||||
- ML-based scoring: 100+ hyperparameter combinations
|
||||
- Target: Find >$300/1k P&L strategy
|
||||
|
||||
---
|
||||
|
||||
## 🔒 Safety & Deployment
|
||||
|
||||
### Validation Gates
|
||||
|
||||
Before deploying strategy to production:
|
||||
|
||||
1. **Trade Count:** ≥700 trades (statistical significance)
|
||||
2. **Win Rate:** 63-68% realistic range
|
||||
3. **Profit Factor:** ≥1.5 solid edge
|
||||
4. **Max Drawdown:** <20% manageable
|
||||
5. **Sharpe Ratio:** ≥1.0 risk-adjusted
|
||||
6. **Consistency:** Top 3 for 7 days straight
|
||||
|
||||
### Manual Deployment Process
|
||||
|
||||
```bash
|
||||
# 1. Review top strategy
|
||||
sqlite3 cluster/strategies.db "SELECT * FROM strategies WHERE name = 'v9_flip0.7_ma0.40_adx25'"
|
||||
|
||||
# 2. Extract parameters
|
||||
# flip_threshold: 0.7
|
||||
# ma_gap: 0.40
|
||||
# momentum_adx: 25
|
||||
|
||||
# 3. Update TradingView indicator
|
||||
# Edit moneyline_v9_ma_gap.pinescript
|
||||
# Change parameters to winning values
|
||||
|
||||
# 4. Update TradingView alerts
|
||||
# Verify alerts fire with new parameters
|
||||
|
||||
# 5. Monitor first 10 trades
|
||||
# Ensure behavior matches backtest
|
||||
|
||||
# 6. Full deployment after validation
|
||||
```
|
||||
|
||||
### Auto-Deployment (Future)
|
||||
|
||||
Once confident in system:
|
||||
|
||||
1. Master marks top strategy as `status = 'staging'`
|
||||
2. User reviews via web dashboard
|
||||
3. User clicks "Deploy" button
|
||||
4. System auto-updates TradingView via API
|
||||
5. Alerts regenerated with new parameters
|
||||
6. First 10 trades monitored closely
|
||||
|
||||
---
|
||||
|
||||
## 📞 Support
|
||||
|
||||
- **Main docs:** `/home/icke/traderv4/.github/copilot-instructions.md`
|
||||
- **Cluster README:** `/home/icke/traderv4/cluster/README.md`
|
||||
- **Backtester docs:** `/home/icke/traderv4/backtester/README.md` (if exists)
|
||||
|
||||
---
|
||||
|
||||
**Ready to deploy?**
|
||||
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
./setup_cluster.sh
|
||||
```
|
||||
|
||||
Let the machines find better strategies while you sleep! 🚀
|
||||
Reference in New Issue
Block a user