docs: Add 1-minute simplified price feed to reduce TradingView alert queue pressure

- Create moneyline_1min_price_feed.pinescript (70% smaller payload) - Remove ATR/ADX/RSI/VOL/POS from 1-minute alerts (not used for decisions) - Keep only price + symbol + timeframe for market data cache - Document rationale in docs/1MIN_SIMPLIFIED_FEED.md - Fix: 5-minute trading signals being dropped due to 1-minute flood (60/hour) - Impact: Preserve priority for actual trading signals
2025-12-04 11:19:04 +01:00
parent 4c36fa2bc3
commit dc674ec6d5
12 changed files with 2476 additions and 0 deletions
--- a/docs/cluster/README.md
+++ b/docs/cluster/README.md
@@ -0,0 +1,291 @@
+# Distributed Computing & EPYC Cluster
+
+**Infrastructure for large-scale parameter optimization and backtesting.**
+
+This directory contains documentation for the EPYC cluster setup, distributed backtesting coordination, and multi-server infrastructure.
+
+---
+
+## 🖥️ Cluster Documentation
+
+### **EPYC Server Setup**
+- `EPYC_SETUP_COMPREHENSIVE.md` - **Complete setup guide**
+  - Hardware: AMD EPYC 7282 16-Core Processor (Debian 12 Bookworm)
+  - Python environment: 3.11.2 with pandas 2.3.3, numpy 2.3.5
+  - SSH configuration: Nested hop (master → worker1 → worker2)
+  - Package deployment: tar.gz transfer with virtual environment
+  - **Status:** ✅ OPERATIONAL (24 workers processing 65,536 combos)
+
+### **Distributed Architecture**
+- `DUAL_SWEEP_README.md` - **Parallel sweep execution**
+  - Coordinator: Assigns chunks to workers
+  - Workers: Execute parameter combinations in parallel
+  - Database: SQLite exploration.db for state tracking
+  - Results: CSV files with top N configurations
+  - **Use case:** v9 exhaustive parameter optimization (Nov 28-29, 2025)
+
+### **Cluster Control**
+- `CLUSTER_START_BUTTON_FIX.md` - **Web UI integration**
+  - Dashboard: http://localhost:3001/cluster
+  - Start/Stop buttons with status detection
+  - Database-first status (SSH supplementary)
+  - Real-time progress tracking
+  - **Status:** ✅ DEPLOYED (Nov 30, 2025)
+
+---
+
+## 🏗️ Cluster Architecture
+
+### **Physical Infrastructure**
+```
+Master Server (local development machine)
+  ├── Coordinator Process (assigns chunks)
+  ├── Database (exploration.db)
+  └── Web Dashboard (Next.js)
+       ↓ [SSH]
+Worker1 (EPYC 10.10.254.106)
+  ├── 12 worker processes
+  ├── 64GB RAM
+  └── Direct SSH connection
+       ↓ [SSH ProxyJump]
+Worker2 (EPYC 10.20.254.100)
+  ├── 12 worker processes
+  ├── 64GB RAM
+  └── Via worker1 hop
+```
+
+### **Data Flow**
+```
+1. Coordinator creates chunks (2,000 combos each)
+   ↓
+2. Marks chunk status='pending' in database
+   ↓
+3. Worker queries database for pending chunks
+   ↓
+4. Coordinator assigns chunk to worker via SSH
+   ↓
+5. Worker updates status='running'
+   ↓
+6. Worker processes combinations in parallel
+   ↓
+7. Worker saves results to strategies table
+   ↓
+8. Worker updates status='completed'
+   ↓
+9. Coordinator assigns next pending chunk
+   ↓
+10. Dashboard shows real-time progress
+```
+
+### **Database Schema**
+```sql
+-- chunks table: Work distribution
+CREATE TABLE chunks (
+  id TEXT PRIMARY KEY,           -- v9_chunk_000000
+  start_combo INTEGER,           -- 0, 2000, 4000, etc.
+  end_combo INTEGER,             -- 2000, 4000, 6000, etc.
+  status TEXT,                   -- 'pending', 'running', 'completed'
+  assigned_worker TEXT,          -- 'worker1', 'worker2'
+  started_at INTEGER,
+  completed_at INTEGER
+);
+
+-- strategies table: Results storage
+CREATE TABLE strategies (
+  id INTEGER PRIMARY KEY,
+  chunk_id TEXT,
+  params TEXT,                   -- JSON of parameter values
+  pnl REAL,
+  win_rate REAL,
+  profit_factor REAL,
+  max_drawdown REAL,
+  total_trades INTEGER
+);
+```
+
+---
+
+## 🚀 Using the Cluster
+
+### **Starting a Sweep**
+```bash
+# 1. Prepare package on master
+cd /home/icke/traderv4/backtester
+tar -czf backtest_v9_sweep.tar.gz data/ backtester_core.py v9_moneyline_ma_gap.py moneyline_core.py
+
+# 2. Transfer to EPYC workers
+scp backtest_v9_sweep.tar.gz root@10.10.254.106:/home/backtest/
+ssh root@10.10.254.106 "scp backtest_v9_sweep.tar.gz root@10.20.254.100:/home/backtest/"
+
+# 3. Extract on workers
+ssh root@10.10.254.106 "cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz"
+ssh root@10.10.254.106 "ssh root@10.20.254.100 'cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz'"
+
+# 4. Start via web dashboard or CLI
+# Web: http://localhost:3001/cluster → Click "Start Cluster"
+# CLI: cd /home/icke/traderv4/cluster && python v9_advanced_coordinator.py
+```
+
+### **Monitoring Progress**
+```bash
+# Dashboard
+curl -s http://localhost:3001/api/cluster/status | jq
+
+# Database query
+sqlite3 cluster/exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;"
+
+# Worker processes
+ssh root@10.10.254.106 "ps aux | grep [p]ython | grep backtest | wc -l"
+```
+
+### **Collecting Results**
+```bash
+# Results saved to cluster/results/
+ls -lh cluster/results/sweep_v9_*.csv
+
+# Top 100 configurations by P&L
+sqlite3 cluster/exploration.db "SELECT params, pnl, win_rate FROM strategies ORDER BY pnl DESC LIMIT 100;"
+```
+
+---
+
+## 🔧 Configuration
+
+### **Coordinator Settings**
+```python
+# cluster/v9_advanced_coordinator.py
+WORKERS = {
+    'worker1': {'host': '10.10.254.106', 'port': 22},
+    'worker2': {'host': '10.20.254.100', 'port': 22, 'proxy_jump': '10.10.254.106'}
+}
+
+CHUNK_SIZE = 2000          # Combinations per chunk
+MAX_WORKERS = 24           # 12 per server
+CHECK_INTERVAL = 60        # Status check frequency (seconds)
+```
+
+### **Worker Settings**
+```python
+# cluster/distributed_worker.py
+NUM_PROCESSES = 12         # Parallel backtests
+BATCH_SIZE = 100           # Save results every N combos
+TIMEOUT = 120              # Per-combo timeout (seconds)
+```
+
+### **SSH Configuration**
+```bash
+# ~/.ssh/config
+Host worker1
+    HostName 10.10.254.106
+    User root
+    StrictHostKeyChecking no
+    ServerAliveInterval 30
+
+Host worker2
+    HostName 10.20.254.100
+    User root
+    ProxyJump worker1
+    StrictHostKeyChecking no
+    ServerAliveInterval 30
+```
+
+---
+
+## 🐛 Common Issues
+
+### **SSH Timeout Errors**
+**Symptom:** "SSH command timed out for worker2"
+**Root Cause:** Nested hop requires 60s timeout (not 30s)
+**Fix:** Common Pitfall #64 - Increase subprocess timeout
+```python
+result = subprocess.run(ssh_cmd, timeout=60)  # Not 30
+```
+
+### **Database Lock Errors**
+**Symptom:** "database is locked"
+**Root Cause:** Multiple workers writing simultaneously
+**Fix:** Use WAL mode + increase busy_timeout
+```python
+connection.execute('PRAGMA journal_mode=WAL')
+connection.execute('PRAGMA busy_timeout=10000')
+```
+
+### **Worker Not Processing**
+**Symptom:** Chunk status='running' but no worker processes
+**Root Cause:** Worker crashed or SSH session died
+**Fix:** Cleanup database + restart
+```bash
+# Mark stuck chunks as pending
+sqlite3 exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL WHERE status='running' AND started_at < (strftime('%s','now') - 600);"
+```
+
+### **Status Shows "Idle" When Running**
+**Symptom:** Dashboard shows idle despite workers running
+**Root Cause:** SSH detection timing out, database not queried first
+**Fix:** Database-first status detection (Common Pitfall #71)
+```typescript
+// Check database BEFORE SSH
+const hasRunningChunks = explorationData.chunks.running > 0
+if (hasRunningChunks) clusterStatus = 'active'
+```
+
+---
+
+## 📊 Performance Metrics
+
+**v9 Exhaustive Sweep (65,536 combos):**
+- **Duration:** ~29 hours (24 workers)
+- **Speed:** 1.60s per combo (4× faster than 6 local workers)
+- **Throughput:** ~37 combos/minute across cluster
+- **Data processed:** 139,678 OHLCV rows × 65,536 combos = 9.16B calculations
+- **Results:** Top 100 saved to CSV (~10KB file)
+
+**Cost Analysis:**
+- Local (6 workers): 72 hours estimated
+- EPYC (24 workers): 29 hours actual
+- **Time savings:** 43 hours (60% faster)
+- **Resource utilization:** 64 cores utilized vs 6 local
+
+---
+
+## 📝 Adding Cluster Features
+
+**When to Use Cluster:**
+- Parameter sweeps >10,000 combinations
+- Backtests requiring >24 hours on local machine
+- Multi-strategy comparison (need parallel execution)
+- Production validation (test many configs simultaneously)
+
+**Scaling Guidelines:**
+- **<1,000 combos:** Local machine sufficient
+- **1,000-10,000:** Single EPYC server (12 workers)
+- **10,000-100,000:** Both EPYC servers (24 workers)
+- **100,000+:** Consider cloud scaling (AWS Batch, etc.)
+
+---
+
+## ⚠️ Important Notes
+
+**Data Transfer:**
+- Always compress packages: `tar -czf` (1.9MB → 1.1MB)
+- Verify checksums after transfer
+- Use rsync for incremental updates
+
+**Process Management:**
+- Always use `nohup` or `screen` for long-running coordinators
+- Workers auto-terminate when chunks complete
+- Coordinator sends Telegram notification on completion
+
+**Database Safety:**
+- SQLite WAL mode prevents most lock errors
+- Backup exploration.db before major sweeps
+- Never edit chunks table manually while coordinator running
+
+**SSH Reliability:**
+- ServerAliveInterval prevents silent disconnects
+- StrictHostKeyChecking=no avoids interactive prompts
+- ProxyJump handles nested hops automatically
+
+---
+
+See `../README.md` for overall documentation structure.