docs: Add comprehensive cluster status detection to copilot instructions

- Document database-first architecture pattern - Include problem, root cause, and solution details - Add verification methodology with before/after examples - Document cluster control system (Start/Stop buttons) - Include database schema and operational state - Add lessons learned about infrastructure vs business logic - Reference STATUS_DETECTION_FIX_COMPLETE.md for full details - Current state: 2 workers active, processing 4000 combinations
2025-11-30 22:38:06 +01:00
parent c5a8f5e32d
commit 887ae3b924
1 changed files with 241 additions and 0 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -5168,13 +5168,254 @@ const tracker = new StopHuntTracker()  // ❌ Don't do this
 5. **Python environments matter:** Always activate venv before running backtests on remote servers
 6. **Portable packages enable distributed computing:** 1.1MB tar.gz enables 16-core EPYC utilization
 ## Cluster Status Detection: Database-First Architecture (Nov 30, 2025)
 **Purpose:** Distributed parameter sweep cluster monitoring system with database-driven status detection
 **Critical Problem Discovered (Nov 30, 2025):**
 - **Symptom:** Web dashboard showed "IDLE" status with 0 active workers despite 22+ worker processes running on EPYC cluster
 - **Root Cause:** SSH-based status detection timing out due to network latency → catch blocks returning "offline" → false negative cluster status
 - **Impact:** System appeared idle when actually processing 4,000 parameter combinations across 2 active chunks
 - **Financial Risk:** In production trading system, false idle status could prevent monitoring of critical distributed processes
 **Solution: Database-First Status Detection**
 **Architectural Principle:** **Database is the source of truth for business logic, NOT infrastructure availability**
 **Implementation (app/api/cluster/status/route.ts):**
 ```typescript
 export async function GET(request: NextRequest) {
  try {
    // CRITICAL FIX (Nov 30, 2025): Check database FIRST before SSH detection
    // Database shows actual work state, SSH just provides supplementary metrics
    const explorationData = await getExplorationData()
    const hasRunningChunks = explorationData.chunks.running > 0
    console.log(`📊 Database status: ${explorationData.chunks.running} running chunks`)
    // Get SSH status for supplementary metrics (CPU, load, process count)
    const [worker1Status, worker2Status] = await Promise.all([
      getWorkerStatus('worker1', WORKERS.worker1.host, WORKERS.worker1.port),
      getWorkerStatus('worker2', WORKERS.worker2.host, WORKERS.worker2.port, {
        proxyJump: WORKERS.worker1.host
      })
    ])
    // DATABASE-FIRST: Override SSH "offline" status if database shows running chunks
    const workers = [worker1Status, worker2Status].map(w => {
      if (hasRunningChunks && w.status === 'offline') {
        console.log(`✅ ${w.name}: Database shows running chunks - overriding SSH offline to active`)
        return { 
          ...w, 
          status: 'active' as const, 
          activeProcesses: w.activeProcesses || 1 
        }
      }
      return w
    })
    // DATABASE-FIRST cluster status
    let clusterStatus: 'active' | 'idle' = 'idle'
    if (hasRunningChunks) {
      clusterStatus = 'active'
      console.log('✅ Cluster status: ACTIVE (database shows running chunks)')
    } else if (workers.some(w => w.status === 'active')) {
      clusterStatus = 'active'
      console.log('✅ Cluster status: ACTIVE (workers detected via SSH)')
    }
    return NextResponse.json({
      cluster: {
        status: clusterStatus,
        activeWorkers: workers.filter(w => w.status === 'active').length,
        totalStrategiesExplored: explorationData.strategies.explored,
        totalStrategiesToExplore: explorationData.strategies.total,
      },
      workers,
      chunks: {
        pending: explorationData.chunks.pending,
        running: explorationData.chunks.running,
        completed: explorationData.chunks.completed,
        total: explorationData.chunks.total,
      },
    })
  } catch (error) {
    console.error('❌ Error getting cluster status:', error)
    return NextResponse.json({ error: 'Failed to get cluster status' }, { status: 500 })
  }
 }
 ```
 **Why This Approach:**
 1. **Database persistence:** SQLite exploration.db records chunk assignments with status='running'
 2. **Business logic integrity:** Work state exists in database regardless of SSH availability
 3. **SSH supplementary only:** Process counts, CPU metrics are nice-to-have, not critical
 4. **Network resilience:** SSH timeouts don't cause false negative status
 5. **Single source of truth:** All cluster control operations write to database first
 **Verification Methodology (Nov 30, 2025):**
 **Before Fix:**
 ```bash
 curl -s http://localhost:3001/api/cluster/status | jq '.cluster'
 {
  "status": "idle",
  "activeWorkers": 0,
  "totalStrategiesExplored": 0,
  "totalStrategiesToExplore": 4096
 }
 ```
 **After Fix:**
 ```bash
 curl -s http://localhost:3001/api/cluster/status | jq '.cluster'
 {
  "status": "active",
  "activeWorkers": 2,
  "totalStrategiesExplored": 0,
  "totalStrategiesToExplore": 4096
 }
 ```
 **Container Logs Showing Fix Working:**
 ```
 📊 Database status: 2 running chunks
 ✅ worker1: Database shows running chunks - overriding SSH offline to active
 ✅ worker2: Database shows running chunks - overriding SSH offline to active
 ✅ Cluster status: ACTIVE (database shows running chunks)
 ```
 **Database State Verification:**
 ```bash
 sqlite3 cluster/exploration.db "SELECT id, start_combo, end_combo, status, assigned_worker FROM chunks WHERE status='running';"
 v9_chunk_000000|0|2000|running|worker1
 v9_chunk_000001|2000|4000|running|worker2
 ```
 **SSH Process Verification (Manual):**
 ```bash
 ssh root@10.10.254.106 "ps aux | grep [p]ython | grep backtest | wc -l"
 22  # 22 worker processes actively running
 ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep [p]ython | grep backtest | wc -l'"
 18  # 18 worker processes on worker2 via hop
 ```
 **Cluster Control System:**
 **Start Button (app/cluster/page.tsx):**
 ```tsx
 {status.cluster.status === 'idle' ? (
  <button 
    onClick={() => handleControl('start')}
    className="bg-green-600 hover:bg-green-700"
  >
    ▶️ Start Cluster
  </button>
 ) : (
  <button 
    onClick={() => handleControl('stop')}
    className="bg-red-600 hover:bg-red-700"
  >
    ⏹️ Stop Cluster
  </button>
 )}
 ```
 **Control API (app/api/cluster/control/route.ts):**
 - **start:** Runs distributed_coordinator.py → creates chunks in database → starts workers via SSH
 - **stop:** Kills coordinator process → workers auto-stop when chunks complete → database cleanup
 - **status:** Returns coordinator process status (supplementary to database status)
 **Database Schema (exploration.db):**
 ```sql
 CREATE TABLE chunks (
  id TEXT PRIMARY KEY,           -- v9_chunk_000000, v9_chunk_000001, etc.
  start_combo INTEGER NOT NULL,  -- Starting combination index (0, 2000, 4000, etc.)
  end_combo INTEGER NOT NULL,    -- Ending combination index (exclusive)
  total_combos INTEGER NOT NULL, -- Total combinations in chunk (2000)
  status TEXT NOT NULL,          -- 'pending', 'running', 'completed', 'failed'
  assigned_worker TEXT,          -- 'worker1', 'worker2', NULL for pending
  started_at INTEGER,            -- Unix timestamp when work started
  completed_at INTEGER,          -- Unix timestamp when work completed
  created_at INTEGER DEFAULT (strftime('%s', 'now'))
 );
 CREATE TABLE strategies (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  chunk_id TEXT NOT NULL,
  params TEXT NOT NULL,         -- JSON of parameter values
  pnl REAL NOT NULL,
  win_rate REAL NOT NULL,
  profit_factor REAL NOT NULL,
  max_drawdown REAL NOT NULL,
  total_trades INTEGER NOT NULL,
  created_at INTEGER DEFAULT (strftime('%s', 'now')),
  FOREIGN KEY (chunk_id) REFERENCES chunks(id)
 );
 ```
 **Deployment Details:**
 - **Container:** trading-bot-v4 on port 3001
 - **Build Time:** Nov 30 21:12 UTC (TypeScript compilation 77.4s)
 - **Restart Time:** Nov 30 21:18 UTC with `--force-recreate`
 - **Volume Mount:** `./cluster:/app/cluster` (database persistence)
 - **Git Commits:** 
  * cc56b72 "fix: Database-first cluster status detection"
  * c5a8f5e "docs: Add comprehensive cluster status fix documentation"
 **Lessons Learned:**
 1. **Infrastructure availability ≠ business logic state**
   - SSH timeouts are infrastructure failures
   - Running chunks in database are business state
   - Never let infrastructure failures dictate false business states
 2. **Database as source of truth**
   - All state-changing operations write to database first
   - Status detection reads from database first
   - External checks (SSH, API calls) are supplementary metrics only
 3. **Fail-open vs fail-closed**
   - SSH timeout → assume active if database says so (fail-open)
   - Database unavailable → hard error, don't guess (fail-closed)
   - Business logic requires authoritative data source
 4. **Verification before declaration**
   - curl test confirmed API response changed
   - Log analysis confirmed database-first logic executing
   - Manual SSH verification confirmed workers actually running
   - NEVER say "fixed" without testing deployed container
 5. **Conditional UI rendering**
   - Stop button already existed in codebase
   - Shown conditionally based on cluster status
   - Status detection fix made Stop button visible automatically
   - Search codebase before claiming features are "missing"
 **Documentation References:**
 - **Full technical details:** `cluster/STATUS_DETECTION_FIX_COMPLETE.md`
 - **Database queries:** `cluster/lib/db.ts` - getExplorationData()
 - **Worker management:** `cluster/distributed_coordinator.py` - chunk creation and assignment
 - **Status API:** `app/api/cluster/status/route.ts` - database-first implementation
 **Current Operational State (Nov 30, 2025):**
 - **Cluster:** ACTIVE with 2 workers processing 4,000 combinations
 - **Database:** 2 chunks status='running' (0-2000 on worker1, 2000-4000 on worker2)
 - **Remaining:** 96 combinations (4000-4096) will be assigned after current chunks complete
 - **Dashboard:** Shows accurate "active" status with 2 active workers
 - **SSH Status:** May show "offline" due to latency, but database override ensures accurate cluster status
 ## Integration Points
 - **n8n:** Expects exact response format from `/api/trading/execute` (see n8n-complete-workflow.json)
 - **Drift Protocol:** Uses SDK v2.75.0 - check docs at docs.drift.trade for API changes
 - **Pyth Network:** WebSocket + HTTP fallback for price feeds (handles reconnection)
 - **PostgreSQL:** Version 16-alpine, must be running before bot starts
 - **EPYC Cluster:** Database-first status detection via SQLite exploration.db (SSH supplementary)
 ---
 **Key Mental Model:** Think of this as two parallel systems (on-chain orders + software monitoring) working together. The Position Manager is the "backup brain" that constantly watches and acts if on-chain orders fail. Both write to the same database for complete trade history.
 **Cluster Mental Model:** Database is the authoritative source of cluster state. SSH detection is supplementary metrics. If database shows running chunks, cluster is active regardless of SSH availability. Infrastructure failures don't change business logic state.