Files
trading_bot_v4/docs/cluster/README.md
mindesbunister dc674ec6d5 docs: Add 1-minute simplified price feed to reduce TradingView alert queue pressure
- Create moneyline_1min_price_feed.pinescript (70% smaller payload)
- Remove ATR/ADX/RSI/VOL/POS from 1-minute alerts (not used for decisions)
- Keep only price + symbol + timeframe for market data cache
- Document rationale in docs/1MIN_SIMPLIFIED_FEED.md
- Fix: 5-minute trading signals being dropped due to 1-minute flood (60/hour)
- Impact: Preserve priority for actual trading signals
2025-12-04 11:19:04 +01:00

292 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Distributed Computing & EPYC Cluster
**Infrastructure for large-scale parameter optimization and backtesting.**
This directory contains documentation for the EPYC cluster setup, distributed backtesting coordination, and multi-server infrastructure.
---
## 🖥️ Cluster Documentation
### **EPYC Server Setup**
- `EPYC_SETUP_COMPREHENSIVE.md` - **Complete setup guide**
- Hardware: AMD EPYC 7282 16-Core Processor (Debian 12 Bookworm)
- Python environment: 3.11.2 with pandas 2.3.3, numpy 2.3.5
- SSH configuration: Nested hop (master → worker1 → worker2)
- Package deployment: tar.gz transfer with virtual environment
- **Status:** ✅ OPERATIONAL (24 workers processing 65,536 combos)
### **Distributed Architecture**
- `DUAL_SWEEP_README.md` - **Parallel sweep execution**
- Coordinator: Assigns chunks to workers
- Workers: Execute parameter combinations in parallel
- Database: SQLite exploration.db for state tracking
- Results: CSV files with top N configurations
- **Use case:** v9 exhaustive parameter optimization (Nov 28-29, 2025)
### **Cluster Control**
- `CLUSTER_START_BUTTON_FIX.md` - **Web UI integration**
- Dashboard: http://localhost:3001/cluster
- Start/Stop buttons with status detection
- Database-first status (SSH supplementary)
- Real-time progress tracking
- **Status:** ✅ DEPLOYED (Nov 30, 2025)
---
## 🏗️ Cluster Architecture
### **Physical Infrastructure**
```
Master Server (local development machine)
├── Coordinator Process (assigns chunks)
├── Database (exploration.db)
└── Web Dashboard (Next.js)
↓ [SSH]
Worker1 (EPYC 10.10.254.106)
├── 12 worker processes
├── 64GB RAM
└── Direct SSH connection
↓ [SSH ProxyJump]
Worker2 (EPYC 10.20.254.100)
├── 12 worker processes
├── 64GB RAM
└── Via worker1 hop
```
### **Data Flow**
```
1. Coordinator creates chunks (2,000 combos each)
2. Marks chunk status='pending' in database
3. Worker queries database for pending chunks
4. Coordinator assigns chunk to worker via SSH
5. Worker updates status='running'
6. Worker processes combinations in parallel
7. Worker saves results to strategies table
8. Worker updates status='completed'
9. Coordinator assigns next pending chunk
10. Dashboard shows real-time progress
```
### **Database Schema**
```sql
-- chunks table: Work distribution
CREATE TABLE chunks (
id TEXT PRIMARY KEY, -- v9_chunk_000000
start_combo INTEGER, -- 0, 2000, 4000, etc.
end_combo INTEGER, -- 2000, 4000, 6000, etc.
status TEXT, -- 'pending', 'running', 'completed'
assigned_worker TEXT, -- 'worker1', 'worker2'
started_at INTEGER,
completed_at INTEGER
);
-- strategies table: Results storage
CREATE TABLE strategies (
id INTEGER PRIMARY KEY,
chunk_id TEXT,
params TEXT, -- JSON of parameter values
pnl REAL,
win_rate REAL,
profit_factor REAL,
max_drawdown REAL,
total_trades INTEGER
);
```
---
## 🚀 Using the Cluster
### **Starting a Sweep**
```bash
# 1. Prepare package on master
cd /home/icke/traderv4/backtester
tar -czf backtest_v9_sweep.tar.gz data/ backtester_core.py v9_moneyline_ma_gap.py moneyline_core.py
# 2. Transfer to EPYC workers
scp backtest_v9_sweep.tar.gz root@10.10.254.106:/home/backtest/
ssh root@10.10.254.106 "scp backtest_v9_sweep.tar.gz root@10.20.254.100:/home/backtest/"
# 3. Extract on workers
ssh root@10.10.254.106 "cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz"
ssh root@10.10.254.106 "ssh root@10.20.254.100 'cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz'"
# 4. Start via web dashboard or CLI
# Web: http://localhost:3001/cluster → Click "Start Cluster"
# CLI: cd /home/icke/traderv4/cluster && python v9_advanced_coordinator.py
```
### **Monitoring Progress**
```bash
# Dashboard
curl -s http://localhost:3001/api/cluster/status | jq
# Database query
sqlite3 cluster/exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;"
# Worker processes
ssh root@10.10.254.106 "ps aux | grep [p]ython | grep backtest | wc -l"
```
### **Collecting Results**
```bash
# Results saved to cluster/results/
ls -lh cluster/results/sweep_v9_*.csv
# Top 100 configurations by P&L
sqlite3 cluster/exploration.db "SELECT params, pnl, win_rate FROM strategies ORDER BY pnl DESC LIMIT 100;"
```
---
## 🔧 Configuration
### **Coordinator Settings**
```python
# cluster/v9_advanced_coordinator.py
WORKERS = {
'worker1': {'host': '10.10.254.106', 'port': 22},
'worker2': {'host': '10.20.254.100', 'port': 22, 'proxy_jump': '10.10.254.106'}
}
CHUNK_SIZE = 2000 # Combinations per chunk
MAX_WORKERS = 24 # 12 per server
CHECK_INTERVAL = 60 # Status check frequency (seconds)
```
### **Worker Settings**
```python
# cluster/distributed_worker.py
NUM_PROCESSES = 12 # Parallel backtests
BATCH_SIZE = 100 # Save results every N combos
TIMEOUT = 120 # Per-combo timeout (seconds)
```
### **SSH Configuration**
```bash
# ~/.ssh/config
Host worker1
HostName 10.10.254.106
User root
StrictHostKeyChecking no
ServerAliveInterval 30
Host worker2
HostName 10.20.254.100
User root
ProxyJump worker1
StrictHostKeyChecking no
ServerAliveInterval 30
```
---
## 🐛 Common Issues
### **SSH Timeout Errors**
**Symptom:** "SSH command timed out for worker2"
**Root Cause:** Nested hop requires 60s timeout (not 30s)
**Fix:** Common Pitfall #64 - Increase subprocess timeout
```python
result = subprocess.run(ssh_cmd, timeout=60) # Not 30
```
### **Database Lock Errors**
**Symptom:** "database is locked"
**Root Cause:** Multiple workers writing simultaneously
**Fix:** Use WAL mode + increase busy_timeout
```python
connection.execute('PRAGMA journal_mode=WAL')
connection.execute('PRAGMA busy_timeout=10000')
```
### **Worker Not Processing**
**Symptom:** Chunk status='running' but no worker processes
**Root Cause:** Worker crashed or SSH session died
**Fix:** Cleanup database + restart
```bash
# Mark stuck chunks as pending
sqlite3 exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL WHERE status='running' AND started_at < (strftime('%s','now') - 600);"
```
### **Status Shows "Idle" When Running**
**Symptom:** Dashboard shows idle despite workers running
**Root Cause:** SSH detection timing out, database not queried first
**Fix:** Database-first status detection (Common Pitfall #71)
```typescript
// Check database BEFORE SSH
const hasRunningChunks = explorationData.chunks.running > 0
if (hasRunningChunks) clusterStatus = 'active'
```
---
## 📊 Performance Metrics
**v9 Exhaustive Sweep (65,536 combos):**
- **Duration:** ~29 hours (24 workers)
- **Speed:** 1.60s per combo (4× faster than 6 local workers)
- **Throughput:** ~37 combos/minute across cluster
- **Data processed:** 139,678 OHLCV rows × 65,536 combos = 9.16B calculations
- **Results:** Top 100 saved to CSV (~10KB file)
**Cost Analysis:**
- Local (6 workers): 72 hours estimated
- EPYC (24 workers): 29 hours actual
- **Time savings:** 43 hours (60% faster)
- **Resource utilization:** 64 cores utilized vs 6 local
---
## 📝 Adding Cluster Features
**When to Use Cluster:**
- Parameter sweeps >10,000 combinations
- Backtests requiring >24 hours on local machine
- Multi-strategy comparison (need parallel execution)
- Production validation (test many configs simultaneously)
**Scaling Guidelines:**
- **<1,000 combos:** Local machine sufficient
- **1,000-10,000:** Single EPYC server (12 workers)
- **10,000-100,000:** Both EPYC servers (24 workers)
- **100,000+:** Consider cloud scaling (AWS Batch, etc.)
---
## ⚠️ Important Notes
**Data Transfer:**
- Always compress packages: `tar -czf` (1.9MB → 1.1MB)
- Verify checksums after transfer
- Use rsync for incremental updates
**Process Management:**
- Always use `nohup` or `screen` for long-running coordinators
- Workers auto-terminate when chunks complete
- Coordinator sends Telegram notification on completion
**Database Safety:**
- SQLite WAL mode prevents most lock errors
- Backup exploration.db before major sweeps
- Never edit chunks table manually while coordinator running
**SSH Reliability:**
- ServerAliveInterval prevents silent disconnects
- StrictHostKeyChecking=no avoids interactive prompts
- ProxyJump handles nested hops automatically
---
See `../README.md` for overall documentation structure.