- Create moneyline_1min_price_feed.pinescript (70% smaller payload) - Remove ATR/ADX/RSI/VOL/POS from 1-minute alerts (not used for decisions) - Keep only price + symbol + timeframe for market data cache - Document rationale in docs/1MIN_SIMPLIFIED_FEED.md - Fix: 5-minute trading signals being dropped due to 1-minute flood (60/hour) - Impact: Preserve priority for actual trading signals
292 lines
8.4 KiB
Markdown
292 lines
8.4 KiB
Markdown
# Distributed Computing & EPYC Cluster
|
||
|
||
**Infrastructure for large-scale parameter optimization and backtesting.**
|
||
|
||
This directory contains documentation for the EPYC cluster setup, distributed backtesting coordination, and multi-server infrastructure.
|
||
|
||
---
|
||
|
||
## 🖥️ Cluster Documentation
|
||
|
||
### **EPYC Server Setup**
|
||
- `EPYC_SETUP_COMPREHENSIVE.md` - **Complete setup guide**
|
||
- Hardware: AMD EPYC 7282 16-Core Processor (Debian 12 Bookworm)
|
||
- Python environment: 3.11.2 with pandas 2.3.3, numpy 2.3.5
|
||
- SSH configuration: Nested hop (master → worker1 → worker2)
|
||
- Package deployment: tar.gz transfer with virtual environment
|
||
- **Status:** ✅ OPERATIONAL (24 workers processing 65,536 combos)
|
||
|
||
### **Distributed Architecture**
|
||
- `DUAL_SWEEP_README.md` - **Parallel sweep execution**
|
||
- Coordinator: Assigns chunks to workers
|
||
- Workers: Execute parameter combinations in parallel
|
||
- Database: SQLite exploration.db for state tracking
|
||
- Results: CSV files with top N configurations
|
||
- **Use case:** v9 exhaustive parameter optimization (Nov 28-29, 2025)
|
||
|
||
### **Cluster Control**
|
||
- `CLUSTER_START_BUTTON_FIX.md` - **Web UI integration**
|
||
- Dashboard: http://localhost:3001/cluster
|
||
- Start/Stop buttons with status detection
|
||
- Database-first status (SSH supplementary)
|
||
- Real-time progress tracking
|
||
- **Status:** ✅ DEPLOYED (Nov 30, 2025)
|
||
|
||
---
|
||
|
||
## 🏗️ Cluster Architecture
|
||
|
||
### **Physical Infrastructure**
|
||
```
|
||
Master Server (local development machine)
|
||
├── Coordinator Process (assigns chunks)
|
||
├── Database (exploration.db)
|
||
└── Web Dashboard (Next.js)
|
||
↓ [SSH]
|
||
Worker1 (EPYC 10.10.254.106)
|
||
├── 12 worker processes
|
||
├── 64GB RAM
|
||
└── Direct SSH connection
|
||
↓ [SSH ProxyJump]
|
||
Worker2 (EPYC 10.20.254.100)
|
||
├── 12 worker processes
|
||
├── 64GB RAM
|
||
└── Via worker1 hop
|
||
```
|
||
|
||
### **Data Flow**
|
||
```
|
||
1. Coordinator creates chunks (2,000 combos each)
|
||
↓
|
||
2. Marks chunk status='pending' in database
|
||
↓
|
||
3. Worker queries database for pending chunks
|
||
↓
|
||
4. Coordinator assigns chunk to worker via SSH
|
||
↓
|
||
5. Worker updates status='running'
|
||
↓
|
||
6. Worker processes combinations in parallel
|
||
↓
|
||
7. Worker saves results to strategies table
|
||
↓
|
||
8. Worker updates status='completed'
|
||
↓
|
||
9. Coordinator assigns next pending chunk
|
||
↓
|
||
10. Dashboard shows real-time progress
|
||
```
|
||
|
||
### **Database Schema**
|
||
```sql
|
||
-- chunks table: Work distribution
|
||
CREATE TABLE chunks (
|
||
id TEXT PRIMARY KEY, -- v9_chunk_000000
|
||
start_combo INTEGER, -- 0, 2000, 4000, etc.
|
||
end_combo INTEGER, -- 2000, 4000, 6000, etc.
|
||
status TEXT, -- 'pending', 'running', 'completed'
|
||
assigned_worker TEXT, -- 'worker1', 'worker2'
|
||
started_at INTEGER,
|
||
completed_at INTEGER
|
||
);
|
||
|
||
-- strategies table: Results storage
|
||
CREATE TABLE strategies (
|
||
id INTEGER PRIMARY KEY,
|
||
chunk_id TEXT,
|
||
params TEXT, -- JSON of parameter values
|
||
pnl REAL,
|
||
win_rate REAL,
|
||
profit_factor REAL,
|
||
max_drawdown REAL,
|
||
total_trades INTEGER
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 Using the Cluster
|
||
|
||
### **Starting a Sweep**
|
||
```bash
|
||
# 1. Prepare package on master
|
||
cd /home/icke/traderv4/backtester
|
||
tar -czf backtest_v9_sweep.tar.gz data/ backtester_core.py v9_moneyline_ma_gap.py moneyline_core.py
|
||
|
||
# 2. Transfer to EPYC workers
|
||
scp backtest_v9_sweep.tar.gz root@10.10.254.106:/home/backtest/
|
||
ssh root@10.10.254.106 "scp backtest_v9_sweep.tar.gz root@10.20.254.100:/home/backtest/"
|
||
|
||
# 3. Extract on workers
|
||
ssh root@10.10.254.106 "cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz"
|
||
ssh root@10.10.254.106 "ssh root@10.20.254.100 'cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz'"
|
||
|
||
# 4. Start via web dashboard or CLI
|
||
# Web: http://localhost:3001/cluster → Click "Start Cluster"
|
||
# CLI: cd /home/icke/traderv4/cluster && python v9_advanced_coordinator.py
|
||
```
|
||
|
||
### **Monitoring Progress**
|
||
```bash
|
||
# Dashboard
|
||
curl -s http://localhost:3001/api/cluster/status | jq
|
||
|
||
# Database query
|
||
sqlite3 cluster/exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;"
|
||
|
||
# Worker processes
|
||
ssh root@10.10.254.106 "ps aux | grep [p]ython | grep backtest | wc -l"
|
||
```
|
||
|
||
### **Collecting Results**
|
||
```bash
|
||
# Results saved to cluster/results/
|
||
ls -lh cluster/results/sweep_v9_*.csv
|
||
|
||
# Top 100 configurations by P&L
|
||
sqlite3 cluster/exploration.db "SELECT params, pnl, win_rate FROM strategies ORDER BY pnl DESC LIMIT 100;"
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 Configuration
|
||
|
||
### **Coordinator Settings**
|
||
```python
|
||
# cluster/v9_advanced_coordinator.py
|
||
WORKERS = {
|
||
'worker1': {'host': '10.10.254.106', 'port': 22},
|
||
'worker2': {'host': '10.20.254.100', 'port': 22, 'proxy_jump': '10.10.254.106'}
|
||
}
|
||
|
||
CHUNK_SIZE = 2000 # Combinations per chunk
|
||
MAX_WORKERS = 24 # 12 per server
|
||
CHECK_INTERVAL = 60 # Status check frequency (seconds)
|
||
```
|
||
|
||
### **Worker Settings**
|
||
```python
|
||
# cluster/distributed_worker.py
|
||
NUM_PROCESSES = 12 # Parallel backtests
|
||
BATCH_SIZE = 100 # Save results every N combos
|
||
TIMEOUT = 120 # Per-combo timeout (seconds)
|
||
```
|
||
|
||
### **SSH Configuration**
|
||
```bash
|
||
# ~/.ssh/config
|
||
Host worker1
|
||
HostName 10.10.254.106
|
||
User root
|
||
StrictHostKeyChecking no
|
||
ServerAliveInterval 30
|
||
|
||
Host worker2
|
||
HostName 10.20.254.100
|
||
User root
|
||
ProxyJump worker1
|
||
StrictHostKeyChecking no
|
||
ServerAliveInterval 30
|
||
```
|
||
|
||
---
|
||
|
||
## 🐛 Common Issues
|
||
|
||
### **SSH Timeout Errors**
|
||
**Symptom:** "SSH command timed out for worker2"
|
||
**Root Cause:** Nested hop requires 60s timeout (not 30s)
|
||
**Fix:** Common Pitfall #64 - Increase subprocess timeout
|
||
```python
|
||
result = subprocess.run(ssh_cmd, timeout=60) # Not 30
|
||
```
|
||
|
||
### **Database Lock Errors**
|
||
**Symptom:** "database is locked"
|
||
**Root Cause:** Multiple workers writing simultaneously
|
||
**Fix:** Use WAL mode + increase busy_timeout
|
||
```python
|
||
connection.execute('PRAGMA journal_mode=WAL')
|
||
connection.execute('PRAGMA busy_timeout=10000')
|
||
```
|
||
|
||
### **Worker Not Processing**
|
||
**Symptom:** Chunk status='running' but no worker processes
|
||
**Root Cause:** Worker crashed or SSH session died
|
||
**Fix:** Cleanup database + restart
|
||
```bash
|
||
# Mark stuck chunks as pending
|
||
sqlite3 exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL WHERE status='running' AND started_at < (strftime('%s','now') - 600);"
|
||
```
|
||
|
||
### **Status Shows "Idle" When Running**
|
||
**Symptom:** Dashboard shows idle despite workers running
|
||
**Root Cause:** SSH detection timing out, database not queried first
|
||
**Fix:** Database-first status detection (Common Pitfall #71)
|
||
```typescript
|
||
// Check database BEFORE SSH
|
||
const hasRunningChunks = explorationData.chunks.running > 0
|
||
if (hasRunningChunks) clusterStatus = 'active'
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 Performance Metrics
|
||
|
||
**v9 Exhaustive Sweep (65,536 combos):**
|
||
- **Duration:** ~29 hours (24 workers)
|
||
- **Speed:** 1.60s per combo (4× faster than 6 local workers)
|
||
- **Throughput:** ~37 combos/minute across cluster
|
||
- **Data processed:** 139,678 OHLCV rows × 65,536 combos = 9.16B calculations
|
||
- **Results:** Top 100 saved to CSV (~10KB file)
|
||
|
||
**Cost Analysis:**
|
||
- Local (6 workers): 72 hours estimated
|
||
- EPYC (24 workers): 29 hours actual
|
||
- **Time savings:** 43 hours (60% faster)
|
||
- **Resource utilization:** 64 cores utilized vs 6 local
|
||
|
||
---
|
||
|
||
## 📝 Adding Cluster Features
|
||
|
||
**When to Use Cluster:**
|
||
- Parameter sweeps >10,000 combinations
|
||
- Backtests requiring >24 hours on local machine
|
||
- Multi-strategy comparison (need parallel execution)
|
||
- Production validation (test many configs simultaneously)
|
||
|
||
**Scaling Guidelines:**
|
||
- **<1,000 combos:** Local machine sufficient
|
||
- **1,000-10,000:** Single EPYC server (12 workers)
|
||
- **10,000-100,000:** Both EPYC servers (24 workers)
|
||
- **100,000+:** Consider cloud scaling (AWS Batch, etc.)
|
||
|
||
---
|
||
|
||
## ⚠️ Important Notes
|
||
|
||
**Data Transfer:**
|
||
- Always compress packages: `tar -czf` (1.9MB → 1.1MB)
|
||
- Verify checksums after transfer
|
||
- Use rsync for incremental updates
|
||
|
||
**Process Management:**
|
||
- Always use `nohup` or `screen` for long-running coordinators
|
||
- Workers auto-terminate when chunks complete
|
||
- Coordinator sends Telegram notification on completion
|
||
|
||
**Database Safety:**
|
||
- SQLite WAL mode prevents most lock errors
|
||
- Backup exploration.db before major sweeps
|
||
- Never edit chunks table manually while coordinator running
|
||
|
||
**SSH Reliability:**
|
||
- ServerAliveInterval prevents silent disconnects
|
||
- StrictHostKeyChecking=no avoids interactive prompts
|
||
- ProxyJump handles nested hops automatically
|
||
|
||
---
|
||
|
||
See `../README.md` for overall documentation structure.
|