Files
trading_bot_v4/docs/cluster/README.md
mindesbunister dc674ec6d5 docs: Add 1-minute simplified price feed to reduce TradingView alert queue pressure
- Create moneyline_1min_price_feed.pinescript (70% smaller payload)
- Remove ATR/ADX/RSI/VOL/POS from 1-minute alerts (not used for decisions)
- Keep only price + symbol + timeframe for market data cache
- Document rationale in docs/1MIN_SIMPLIFIED_FEED.md
- Fix: 5-minute trading signals being dropped due to 1-minute flood (60/hour)
- Impact: Preserve priority for actual trading signals
2025-12-04 11:19:04 +01:00

8.4 KiB
Raw Blame History

Distributed Computing & EPYC Cluster

Infrastructure for large-scale parameter optimization and backtesting.

This directory contains documentation for the EPYC cluster setup, distributed backtesting coordination, and multi-server infrastructure.


🖥️ Cluster Documentation

EPYC Server Setup

  • EPYC_SETUP_COMPREHENSIVE.md - Complete setup guide
    • Hardware: AMD EPYC 7282 16-Core Processor (Debian 12 Bookworm)
    • Python environment: 3.11.2 with pandas 2.3.3, numpy 2.3.5
    • SSH configuration: Nested hop (master → worker1 → worker2)
    • Package deployment: tar.gz transfer with virtual environment
    • Status: OPERATIONAL (24 workers processing 65,536 combos)

Distributed Architecture

  • DUAL_SWEEP_README.md - Parallel sweep execution
    • Coordinator: Assigns chunks to workers
    • Workers: Execute parameter combinations in parallel
    • Database: SQLite exploration.db for state tracking
    • Results: CSV files with top N configurations
    • Use case: v9 exhaustive parameter optimization (Nov 28-29, 2025)

Cluster Control

  • CLUSTER_START_BUTTON_FIX.md - Web UI integration
    • Dashboard: http://localhost:3001/cluster
    • Start/Stop buttons with status detection
    • Database-first status (SSH supplementary)
    • Real-time progress tracking
    • Status: DEPLOYED (Nov 30, 2025)

🏗️ Cluster Architecture

Physical Infrastructure

Master Server (local development machine)
  ├── Coordinator Process (assigns chunks)
  ├── Database (exploration.db)
  └── Web Dashboard (Next.js)
       ↓ [SSH]
Worker1 (EPYC 10.10.254.106)
  ├── 12 worker processes
  ├── 64GB RAM
  └── Direct SSH connection
       ↓ [SSH ProxyJump]
Worker2 (EPYC 10.20.254.100)
  ├── 12 worker processes
  ├── 64GB RAM
  └── Via worker1 hop

Data Flow

1. Coordinator creates chunks (2,000 combos each)
   ↓
2. Marks chunk status='pending' in database
   ↓
3. Worker queries database for pending chunks
   ↓
4. Coordinator assigns chunk to worker via SSH
   ↓
5. Worker updates status='running'
   ↓
6. Worker processes combinations in parallel
   ↓
7. Worker saves results to strategies table
   ↓
8. Worker updates status='completed'
   ↓
9. Coordinator assigns next pending chunk
   ↓
10. Dashboard shows real-time progress

Database Schema

-- chunks table: Work distribution
CREATE TABLE chunks (
  id TEXT PRIMARY KEY,           -- v9_chunk_000000
  start_combo INTEGER,           -- 0, 2000, 4000, etc.
  end_combo INTEGER,             -- 2000, 4000, 6000, etc.
  status TEXT,                   -- 'pending', 'running', 'completed'
  assigned_worker TEXT,          -- 'worker1', 'worker2'
  started_at INTEGER,
  completed_at INTEGER
);

-- strategies table: Results storage
CREATE TABLE strategies (
  id INTEGER PRIMARY KEY,
  chunk_id TEXT,
  params TEXT,                   -- JSON of parameter values
  pnl REAL,
  win_rate REAL,
  profit_factor REAL,
  max_drawdown REAL,
  total_trades INTEGER
);

🚀 Using the Cluster

Starting a Sweep

# 1. Prepare package on master
cd /home/icke/traderv4/backtester
tar -czf backtest_v9_sweep.tar.gz data/ backtester_core.py v9_moneyline_ma_gap.py moneyline_core.py

# 2. Transfer to EPYC workers
scp backtest_v9_sweep.tar.gz root@10.10.254.106:/home/backtest/
ssh root@10.10.254.106 "scp backtest_v9_sweep.tar.gz root@10.20.254.100:/home/backtest/"

# 3. Extract on workers
ssh root@10.10.254.106 "cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz"
ssh root@10.10.254.106 "ssh root@10.20.254.100 'cd /home/backtest && tar -xzf backtest_v9_sweep.tar.gz'"

# 4. Start via web dashboard or CLI
# Web: http://localhost:3001/cluster → Click "Start Cluster"
# CLI: cd /home/icke/traderv4/cluster && python v9_advanced_coordinator.py

Monitoring Progress

# Dashboard
curl -s http://localhost:3001/api/cluster/status | jq

# Database query
sqlite3 cluster/exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;"

# Worker processes
ssh root@10.10.254.106 "ps aux | grep [p]ython | grep backtest | wc -l"

Collecting Results

# Results saved to cluster/results/
ls -lh cluster/results/sweep_v9_*.csv

# Top 100 configurations by P&L
sqlite3 cluster/exploration.db "SELECT params, pnl, win_rate FROM strategies ORDER BY pnl DESC LIMIT 100;"

🔧 Configuration

Coordinator Settings

# cluster/v9_advanced_coordinator.py
WORKERS = {
    'worker1': {'host': '10.10.254.106', 'port': 22},
    'worker2': {'host': '10.20.254.100', 'port': 22, 'proxy_jump': '10.10.254.106'}
}

CHUNK_SIZE = 2000          # Combinations per chunk
MAX_WORKERS = 24           # 12 per server
CHECK_INTERVAL = 60        # Status check frequency (seconds)

Worker Settings

# cluster/distributed_worker.py
NUM_PROCESSES = 12         # Parallel backtests
BATCH_SIZE = 100           # Save results every N combos
TIMEOUT = 120              # Per-combo timeout (seconds)

SSH Configuration

# ~/.ssh/config
Host worker1
    HostName 10.10.254.106
    User root
    StrictHostKeyChecking no
    ServerAliveInterval 30

Host worker2
    HostName 10.20.254.100
    User root
    ProxyJump worker1
    StrictHostKeyChecking no
    ServerAliveInterval 30

🐛 Common Issues

SSH Timeout Errors

Symptom: "SSH command timed out for worker2" Root Cause: Nested hop requires 60s timeout (not 30s) Fix: Common Pitfall #64 - Increase subprocess timeout

result = subprocess.run(ssh_cmd, timeout=60)  # Not 30

Database Lock Errors

Symptom: "database is locked" Root Cause: Multiple workers writing simultaneously Fix: Use WAL mode + increase busy_timeout

connection.execute('PRAGMA journal_mode=WAL')
connection.execute('PRAGMA busy_timeout=10000')

Worker Not Processing

Symptom: Chunk status='running' but no worker processes Root Cause: Worker crashed or SSH session died Fix: Cleanup database + restart

# Mark stuck chunks as pending
sqlite3 exploration.db "UPDATE chunks SET status='pending', assigned_worker=NULL WHERE status='running' AND started_at < (strftime('%s','now') - 600);"

Status Shows "Idle" When Running

Symptom: Dashboard shows idle despite workers running Root Cause: SSH detection timing out, database not queried first Fix: Database-first status detection (Common Pitfall #71)

// Check database BEFORE SSH
const hasRunningChunks = explorationData.chunks.running > 0
if (hasRunningChunks) clusterStatus = 'active'

📊 Performance Metrics

v9 Exhaustive Sweep (65,536 combos):

  • Duration: ~29 hours (24 workers)
  • Speed: 1.60s per combo (4× faster than 6 local workers)
  • Throughput: ~37 combos/minute across cluster
  • Data processed: 139,678 OHLCV rows × 65,536 combos = 9.16B calculations
  • Results: Top 100 saved to CSV (~10KB file)

Cost Analysis:

  • Local (6 workers): 72 hours estimated
  • EPYC (24 workers): 29 hours actual
  • Time savings: 43 hours (60% faster)
  • Resource utilization: 64 cores utilized vs 6 local

📝 Adding Cluster Features

When to Use Cluster:

  • Parameter sweeps >10,000 combinations
  • Backtests requiring >24 hours on local machine
  • Multi-strategy comparison (need parallel execution)
  • Production validation (test many configs simultaneously)

Scaling Guidelines:

  • <1,000 combos: Local machine sufficient
  • 1,000-10,000: Single EPYC server (12 workers)
  • 10,000-100,000: Both EPYC servers (24 workers)
  • 100,000+: Consider cloud scaling (AWS Batch, etc.)

⚠️ Important Notes

Data Transfer:

  • Always compress packages: tar -czf (1.9MB → 1.1MB)
  • Verify checksums after transfer
  • Use rsync for incremental updates

Process Management:

  • Always use nohup or screen for long-running coordinators
  • Workers auto-terminate when chunks complete
  • Coordinator sends Telegram notification on completion

Database Safety:

  • SQLite WAL mode prevents most lock errors
  • Backup exploration.db before major sweeps
  • Never edit chunks table manually while coordinator running

SSH Reliability:

  • ServerAliveInterval prevents silent disconnects
  • StrictHostKeyChecking=no avoids interactive prompts
  • ProxyJump handles nested hops automatically

See ../README.md for overall documentation structure.