trading_bot_v4

Author	SHA1	Message	Date
mindesbunister	1f83a7d7c4	feat: Add coordinator log viewer to cluster UI - Created /api/cluster/logs endpoint to read coordinator.log - Added real-time log display in cluster UI (updates every 3s) - Shows last 100 lines of coordinator.log in terminal-style display - Includes manual refresh button - Improves debugging experience - no need to SSH for logs User feedback: 'why dont we add the output of the log at the bottom of the page so i know whats going on' This addresses poor visibility into coordinator errors and failures. Next step: Fix SSH timeout issue blocking worker execution.	2025-12-01 11:49:23 +01:00
mindesbunister	ef371a19b9	fix: EPYC cluster SSH timeout - increase timeout 30s→60s + add SSH options CRITICAL FIX (Dec 1, 2025): Cluster start was failing with 'operation failed' Problem: - SSH commands timing out after 30s (too short for 2-hop SSH to worker2) - Missing SSH options caused prompts/delays - Result: Coordinator failed to start worker processes Solution: - Increased timeout from 30s to 60s for nested SSH hops - Added SSH options: -o StrictHostKeyChecking=no -o ConnectTimeout=10 - Applied options to both ssh_command() and worker startup commands Verification (Dec 1, 09:40): - Worker1: 23 processes running (chunk 0-2000) - Worker2: 24 processes running (chunk 2000-4000) - Cluster status: ACTIVE with 2 workers - Both chunks processing successfully Files changed: - cluster/distributed_coordinator.py (lines 302-314, 388-414)	2025-12-01 09:41:42 +01:00
mindesbunister	67ef5b1ac6	feat: Add direction-specific quality thresholds and dynamic collateral display - Split QUALITY_LEVERAGE_THRESHOLD into separate LONG and SHORT variants - Added /api/drift/account-health endpoint for real-time collateral data - Updated settings UI to show separate controls for LONG/SHORT thresholds - Position size calculations now use dynamic collateral from Drift account - Updated .env and docker-compose.yml with new environment variables - LONG threshold: 95, SHORT threshold: 90 (configurable independently) Files changed: - app/api/drift/account-health/route.ts (NEW) - Account health API endpoint - app/settings/page.tsx - Added collateral state, separate threshold inputs - app/api/settings/route.ts - GET/POST handlers for LONG/SHORT thresholds - .env - Added QUALITY_LEVERAGE_THRESHOLD_LONG/SHORT variables - docker-compose.yml - Added new env vars with fallback defaults Impact: - Users can now configure quality thresholds independently for LONG vs SHORT signals - Position size display dynamically updates based on actual Drift account collateral - More flexible risk management with direction-specific leverage tiers	2025-12-01 09:09:30 +01:00
mindesbunister	c5a8f5e32d	docs: Add comprehensive status detection fix documentation	2025-11-30 22:27:08 +01:00
mindesbunister	cc56b72df2	fix: Database-first cluster status detection + Stop button clarification CRITICAL FIX (Nov 30, 2025): - Dashboard showed 'idle' despite 22+ worker processes running - Root cause: SSH-based worker detection timing out - Solution: Check database for running chunks FIRST Changes: 1. app/api/cluster/status/route.ts: - Query exploration database before SSH detection - If running chunks exist, mark workers 'active' even if SSH fails - Override worker status: 'offline' → 'active' when chunks running - Log: '✅ Cluster status: ACTIVE (database shows running chunks)' - Database is source of truth, SSH only for supplementary metrics 2. app/cluster/page.tsx: - Stop button ALREADY EXISTS (conditionally shown) - Shows Start when status='idle', Stop when status='active' - No code changes needed - fixed by status detection Result: - Dashboard now shows 'ACTIVE' with 2 workers (correct) - Workers show 'active' status (was 'offline') - Stop button automatically visible when cluster active - System resilient to SSH timeouts/network issues Verified: - Container restarted: Nov 30 21:18 UTC - API tested: Returns status='active', activeWorkers=2 - Logs confirm: Database-first logic working - Workers confirmed running: 22+ processes on worker1, workers on worker2	2025-11-30 22:23:01 +01:00
mindesbunister	83b4915d98	fix: Reduce coordinator chunk_size from 10k to 2k for small explorations - Changed default chunk_size from 10,000 to 2,000 - Fixes bug where coordinator exited immediately for 4,096 combo exploration - Coordinator was calculating: chunk 1 starts at 10,000 > 4,096 total = 'all done' - Now creates 2-3 appropriately-sized chunks for distribution - Verified: Workers now start and process assigned chunks - Status: ✅ Docker rebuilt and deployed to port 3001	2025-11-30 22:07:59 +01:00
mindesbunister	b77282b560	feat: Add EPYC cluster distributed sweep with web UI New Features: - Distributed coordinator orchestrates 2x AMD EPYC 16-core servers - 64 total cores processing 12M parameter combinations (70% CPU limit) - Worker1 (pve-nu-monitor01): Direct SSH access at 10.10.254.106 - Worker2 (bd-host01): 2-hop SSH through worker1 (10.20.254.100) - Web UI at /cluster shows real-time status and AI recommendations - API endpoint /api/cluster/status serves cluster metrics - Auto-refresh every 30s with top strategies and actionable insights Files Added: - cluster/distributed_coordinator.py (510 lines) - Main orchestrator - cluster/distributed_worker.py (271 lines) - Worker1 script - cluster/distributed_worker_bd_clean.py (275 lines) - Worker2 script - cluster/monitor_bd_host01.sh - Monitoring script - app/api/cluster/status/route.ts (274 lines) - API endpoint - app/cluster/page.tsx (258 lines) - Web UI - cluster/CLUSTER_SETUP.md - Complete setup and access documentation Technical Details: - SQLite database tracks chunk assignments - 10,000 combinations per chunk (1,195 total chunks) - Multiprocessing.Pool with 70% CPU limit (22 cores per EPYC) - SSH/SCP for deployment and result collection - Handles 2-hop SSH for bd-host01 access - Results in CSV format with top strategies ranked Access Documentation: - Worker1: ssh root@10.10.254.106 - Worker2: ssh root@10.10.254.106 "ssh root@10.20.254.100" - Web UI: http://localhost:3001/cluster - See CLUSTER_SETUP.md for complete guide Status: Deployed and operational	2025-11-30 13:02:18 +01:00
mindesbunister	2a8e04fe57	feat: Continuous optimization cluster for 2 EPYC servers - Master controller with job queue and result aggregation - Worker scripts for parallel backtesting (22 workers per server) - SQLite database for strategy ranking and performance tracking - File-based job queue (simple, robust, survives crashes) - Auto-setup script for both EPYC servers - Status dashboard for monitoring progress - Comprehensive deployment guide Architecture: - Master: Job generation, worker coordination, result collection - Worker 1 (pve-nu-monitor01): AMD EPYC 7282, 22 parallel jobs - Worker 2 (srv-bd-host01): AMD EPYC 7302, 22 parallel jobs - Total capacity: ~49,000 backtests/day (44 cores @ 70%) Initial focus: v9 parameter refinement (27 configurations) Target: Find strategies >00/1k P&L (current baseline 92/1k) Files: - cluster/master.py: Main controller (570 lines) - cluster/worker.py: Worker execution script (220 lines) - cluster/setup_cluster.sh: Automated deployment - cluster/status.py: Real-time status dashboard - cluster/README.md: Operational documentation - cluster/DEPLOYMENT.md: Step-by-step deployment guide	2025-11-29 22:34:52 +01:00

8 Commits