Files
trading_bot_v4/docs/deployments/CLUSTER_STOP_BUTTON_FIX_COMPLETE.md
mindesbunister 4c36fa2bc3 docs: Major documentation reorganization + ENV variable reference
**Documentation Structure:**
- Created docs/ subdirectory organization (analysis/, architecture/, bugs/,
  cluster/, deployments/, roadmaps/, setup/, archived/)
- Moved 68 root markdown files to appropriate categories
- Root directory now clean (only README.md remains)
- Total: 83 markdown files now organized by purpose

**New Content:**
- Added comprehensive Environment Variable Reference to copilot-instructions.md
- 100+ ENV variables documented with types, defaults, purpose, notes
- Organized by category: Required (Drift/RPC/Pyth), Trading Config (quality/
  leverage/sizing), ATR System, Runner System, Risk Limits, Notifications, etc.
- Includes usage examples (correct vs wrong patterns)

**File Distribution:**
- docs/analysis/ - Performance analyses, blocked signals, profit projections
- docs/architecture/ - Adaptive leverage, ATR trailing, indicator tracking
- docs/bugs/ - CRITICAL_*.md, FIXES_*.md bug reports (7 files)
- docs/cluster/ - EPYC setup, distributed computing docs (3 files)
- docs/deployments/ - *_COMPLETE.md, DEPLOYMENT_*.md status (12 files)
- docs/roadmaps/ - All *ROADMAP*.md strategic planning files (7 files)
- docs/setup/ - TradingView guides, signal quality, n8n setup (8 files)
- docs/archived/2025_pre_nov/ - Obsolete verification checklist (1 file)

**Key Improvements:**
- ENV variable reference: Single source of truth for all configuration
- Common Pitfalls #68-71: Already complete, verified during audit
- Better findability: Category-based navigation vs 68 files in root
- Preserves history: All files git mv (rename), not copy/delete
- Zero broken functionality: Only documentation moved, no code changes

**Verification:**
- 83 markdown files now in docs/ subdirectories
- Root directory cleaned: 68 files → 0 files (except README.md)
- Git history preserved for all moved files
- Container running: trading-bot-v4 (no restart needed)

**Next Steps:**
- Create README.md files in each docs subdirectory
- Add navigation index
- Update main README.md with new structure
- Consolidate duplicate deployment docs
- Archive truly obsolete files (old SQL backups)

See: docs/analysis/CLEANUP_PLAN.md for complete reorganization strategy
2025-12-04 08:29:59 +01:00

18 KiB

Cluster Stop Button Fix - COMPLETE

Date: December 1, 2025
Status: DEPLOYED AND VERIFIED
Commit: db33af9


Executive Summary

Successfully fixed two critical cluster management issues:

  1. Stop button database reset - Now works even when coordinator crashed
  2. Stale metrics display - UI now shows accurate 4-state system

Key Achievement: Database-first architecture ensures clean cluster state regardless of process crashes.


Issues Resolved

Issue #1: Stop Button Appears Broken

Original Problem:

  • User clicks Stop button
  • Button appears to fail (shows error in UI)
  • Database still shows chunks as "running"
  • Can't restart cluster cleanly

Root Cause:

  • Database already in stale state BEFORE Stop clicked
  • Old logic: pkill processes → wait → reset database
  • If coordinator crashed earlier, database never got reset
  • Stop button tried to reset but stale data made it look failed

Fix Applied:

// NEW ORDER: Database reset FIRST, then pkill
if (action === 'stop') {
  // 1. Reset database state FIRST (even if coordinator already gone)
  const db = await open({ filename: dbPath, driver: sqlite3.Database })
  await db.run(`UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running'`)
  const pendingCount = await db.get(`SELECT COUNT(*) as count FROM chunks WHERE status='pending'`)
  await db.close()
  
  // 2. THEN try to stop any running processes
  const stopCmd = 'pkill -9 -f distributed_coordinator; pkill -9 -f distributed_worker'
  try {
    await execAsync(stopCmd)
  } catch (err) {
    console.log('📝 No processes to kill (already stopped)')
  }
}

Verification:

# Before fix:
curl POST /api/cluster/control '{"action":"stop"}'
# → {"success": false, "error": "sqlite3: not found"}

# After fix:
curl POST /api/cluster/control '{"action":"stop"}'
# → {"success": true, "message": "Cluster stopped and database reset to pending"}

# Database state:
sqlite3 exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;"
# Before: running|3
# After: pending|3 ✅

Issue #2: Stale Metrics Display

Original Problem:

  • UI shows "ACTIVE" status with 3 running chunks
  • Progress bar shows 0.00% (no actual work happening)
  • Confusing state: looks active but nothing running
  • Missing "Idle" state when no work queued

Root Cause:

  • Coordinator crashed without updating database
  • Status API trusts database without verification
  • UI only showed 3 states (Processing/Pending/Complete)
  • No visual indicator for "no work at all" state

Fix Applied:

// UI Enhancement - 4-state display system
{status.exploration.chunks.running > 0 ? (
  <span className="text-yellow-400"> Processing</span>
) : status.exploration.chunks.pending > 0 ? (
  <span className="text-blue-400"> Pending</span>
) : status.exploration.chunks.completed === status.exploration.chunks.total ? (
  <span className="text-green-400"> Complete</span>
) : (
  <span className="text-gray-400">⏸️ Idle</span>  // NEW STATE
)}

// Show pending chunk count
{status.exploration.chunks.pending > 0 && (
  <span className="text-gray-400">({status.exploration.chunks.pending} pending)</span>
)}

Verification:

curl -s http://localhost:3001/api/cluster/status | jq '.exploration'
# After Stop button:
{
  "totalCombinations": 4096,
  "testedCombinations": 0,
  "progress": 0,
  "chunks": {
    "total": 3,
    "completed": 0,
    "running": 0,    # ✅ Was 3 before fix
    "pending": 3     # ✅ Correctly reset
  }
}

Technical Implementation

Database Operations Refactor

Problem: Original code used sqlite3 CLI commands

// ❌ DOESN'T WORK IN DOCKER
const resetCmd = `sqlite3 ${dbPath} "UPDATE chunks SET status='pending'..."`
await execAsync(resetCmd)
// Error: /bin/sh: sqlite3: not found

Solution: Use Node.js sqlite library

// ✅ WORKS IN DOCKER
const db = await open({
  filename: dbPath,
  driver: sqlite3.Database
})
await db.run(`UPDATE chunks SET status=?, assigned_worker=NULL WHERE status=?`, 
  ['pending', 'running'])
const result = await db.get(`SELECT COUNT(*) as count FROM chunks WHERE status=?`, 
  ['pending'])
await db.close()

Why This Matters:

  • Docker container uses node:20-alpine (minimal Linux)
  • Alpine doesn't include sqlite3 CLI by default
  • Node.js sqlite3 library already installed (used in status API)
  • Cleaner code, better error handling, no shell dependencies

File Permissions Fix

Problem: Database readonly in container

SQLITE_READONLY: attempt to write a readonly database

Root Cause:

  • Database file owned by root (UID 0)
  • Container runs as nextjs user (UID 1001)
  • SQLite needs write access + directory write for lock files

Solution:

# Fix database file ownership
chown 1001:1001 /home/icke/traderv4/cluster/exploration.db
chmod 664 /home/icke/traderv4/cluster/exploration.db

# Fix directory permissions (for lock files)
chown 1001:1001 /home/icke/traderv4/cluster
chmod 775 /home/icke/traderv4/cluster

# Verification:
ls -la /home/icke/traderv4/cluster/
drwxrwxr-x 4 1001 1001   30 Dec  1 10:02 .
-rw-rw-r-- 1 1001 1001 40960 Dec  1 10:02 exploration.db

Deployment Process

Build & Deploy Steps

  1. Code Changes:

    • Updated control/route.ts with database-first logic
    • Enhanced page.tsx with 4-state display
    • Replaced sqlite3 CLI with Node.js API
  2. Docker Build:

    docker compose build trading-bot
    # Build time: ~73 seconds
    # Image: sha256:7b830abb...
    
  3. Container Restart:

    docker compose up -d --force-recreate trading-bot
    # Container: trading-bot-v4 started successfully
    
  4. Permission Fix:

    chown 1001:1001 /home/icke/traderv4/cluster/exploration.db
    chmod 664 /home/icke/traderv4/cluster/exploration.db
    chown 1001:1001 /home/icke/traderv4/cluster
    chmod 775 /home/icke/traderv4/cluster
    
  5. Testing:

    # Test Stop button
    curl -X POST http://localhost:3001/api/cluster/control \
      -H "Content-Type: application/json" \
      -d '{"action":"stop"}' | jq .
    
    # Result: {"success": true, "message": "Cluster stopped..."}
    
  6. Verification:

    # Check database state
    sqlite3 exploration.db "SELECT status, COUNT(*) FROM chunks GROUP BY status;"
    # Result: pending|3 ✅
    
    # Check status API
    curl -s http://localhost:3001/api/cluster/status | jq .exploration
    # Result: running=0, pending=3, progress=0% ✅
    

Container Logs

Stop Button Operation (Successful):

🛑 Stopping cluster...
🔧 Resetting database chunks to pending...
✅ Database cleanup complete - 3 chunks reset to pending (total pending: 3)
📝 No processes to kill (already stopped)

Key Log Messages:

  • 🔧 Resetting database chunks to pending... - Database operation started
  • ✅ Database cleanup complete - 3 chunks reset - Success confirmation
  • Shows count of chunks reset and total pending
  • Gracefully handles "no processes to kill" (already crashed scenario)

Testing Results

Test Case: Stale Database from Coordinator Crash

Initial State:

SELECT status, COUNT(*) FROM chunks GROUP BY status;
-- Result: running|3

Stop Button Action:

curl -X POST http://localhost:3001/api/cluster/control \
  -H "Content-Type: application/json" \
  -d '{"action":"stop"}' | jq .

Response:

{
  "success": true,
  "message": "Cluster stopped and database reset to pending",
  "isRunning": false,
  "note": "All processes stopped, chunks reset"
}

Final State:

SELECT status, COUNT(*) FROM chunks GROUP BY status;
-- Result: pending|3 ✅

Status API After Stop:

{
  "totalCombinations": 4096,
  "testedCombinations": 0,
  "progress": 0,
  "chunks": {
    "total": 3,
    "completed": 0,
    "running": 0,    // Was 3 before
    "pending": 3     // Correctly reset
  }
}

Architecture Decision: Database-First

Why Database Reset Comes First

Old Approach (WRONG):

Stop Button → pkill processes → wait → reset database
Problem: If processes already crashed, database never resets

New Approach (CORRECT):

Stop Button → reset database → pkill processes → verify
Benefit: Database always clean regardless of process state

Rationale:

  1. Idempotency: Database reset safe to run multiple times
  2. Crash Recovery: Works even when coordinator already dead
  3. User Intent: "Stop" means "clean up everything" not just "kill processes"
  4. Restart Readiness: Fresh database state enables immediate restart
  5. Error Isolation: Process kill failure doesn't block database cleanup

Real-World Scenario:

  • Coordinator crashes at 2 AM (out of memory, network issue, etc.)
  • Database left with 3 chunks in "running" state
  • User wakes up at 9 AM, sees stale "ACTIVE" status
  • Clicks Stop button
  • OLD: Would fail because processes already gone
  • NEW: Succeeds because database reset happens first

Files Changed

app/api/cluster/control/route.ts (Major Refactor)

Before:

// Start action - shell command (doesn't work in Docker)
const resetCmd = `sqlite3 ${dbPath} "UPDATE chunks..."`
await execAsync(resetCmd)

// Stop action - pkill first, database second
const stopCmd = 'pkill -9 -f distributed_coordinator'
await execAsync(stopCmd)
// ... then reset database

After:

// Imports added:
import sqlite3 from 'sqlite3'
import { open } from 'sqlite'

// Start action - Node.js API
const db = await open({ filename: dbPath, driver: sqlite3.Database })
await db.run(`UPDATE chunks SET status='pending', assigned_worker=NULL, started_at=NULL WHERE status='running'`)
await db.close()

// Stop action - DATABASE FIRST
// 1. Reset database state
const db = await open({ filename: dbPath, driver: sqlite3.Database })
const result = await db.run(`UPDATE chunks SET status='pending'...`)
const pendingCount = await db.get(`SELECT COUNT(*) as count FROM chunks WHERE status='pending'`)
await db.close()

// 2. THEN kill processes
const stopCmd = 'pkill -9 -f distributed_coordinator; pkill -9 -f distributed_worker'
try {
  await execAsync(stopCmd)
} catch (err) {
  console.log('📝 No processes to kill (already stopped)')
}

Changes:

  • Added sqlite3/sqlite imports (lines 5-6)
  • Replaced 3 sqlite3 CLI calls with Node.js API
  • Reordered stop logic (database first, pkill second)
  • Enhanced error handling and logging
  • Added count verification for reset chunks

app/cluster/page.tsx (UI Enhancement)

Before:

// Only 3 states
{status.exploration.chunks.running > 0 ? (
  <span> Processing</span>
) : status.exploration.chunks.pending > 0 ? (
  <span> Pending</span>
) : (
  <span> Complete</span>
)}

After:

// 4 states + pending count display
{status.exploration.chunks.running > 0 ? (
  <span className="text-yellow-400"> Processing</span>
) : status.exploration.chunks.pending > 0 ? (
  <span className="text-blue-400"> Pending</span>
) : status.exploration.chunks.completed === status.exploration.chunks.total && 
   status.exploration.chunks.total > 0 ? (
  <span className="text-green-400"> Complete</span>
) : (
  <span className="text-gray-400">⏸️ Idle</span>  // NEW
)}

// Show pending count when present
{status.exploration.chunks.pending > 0 && status.exploration.chunks.running === 0 && (
  <span className="text-gray-400 ml-2">({status.exploration.chunks.pending} pending)</span>
)}

Changes:

  • Added 4th state: "⏸️ Idle" (no work queued)
  • Shows pending chunk count when work queued but not running
  • Better color coding (yellow/blue/green/gray)
  • More precise state logic (checks total > 0 for Complete)

Lessons Learned

1. Docker Environment Constraints

Discovery: Shell commands that work on host may not exist in Docker container

Example:

  • Host: sqlite3 CLI installed system-wide
  • Container: node:20-alpine minimal image (no sqlite3)
  • Solution: Use native libraries already in node_modules

Takeaway: Always test in target environment (container), not just on host


2. Database-First Architecture

Principle: Critical state cleanup should happen BEFORE process management

Example:

  • OLD: Kill processes → reset database (fails if processes already dead)
  • NEW: Reset database → kill processes (always works)

Takeaway: State cleanup operations should be idempotent and order matters


3. Container User Permissions

Discovery: Container runs as UID 1001 (nextjs), not root

Impact:

  • Files created by root (UID 0) are readonly to container
  • SQLite needs write access to both file AND directory (for lock files)
  • Permission 644 not enough, need 664 (group write)

Solution:

chown 1001:1001 cluster/exploration.db  # Match container user
chmod 664 cluster/exploration.db        # Group write
chown 1001:1001 cluster/                # Directory ownership
chmod 775 cluster/                      # Directory write for locks

Takeaway: Always match host file permissions to container user UID


4. Comprehensive Logging

Before: Simple success/failure messages

After: Detailed operation flow with counts

console.log('🔧 Resetting database chunks to pending...')
console.log(`✅ Database cleanup complete - ${result.changes} chunks reset to pending (total pending: ${pendingCount?.count})`)
console.log('📝 No processes to kill (already stopped)')

Benefits:

  • User sees exactly what happened
  • Debugging issues easier (know which step failed)
  • Confirms operation success with verification counts
  • Distinguishes between "no processes found" vs "kill failed"

Takeaway: Verbose logging in infrastructure operations pays off during troubleshooting


Future Enhancements

1. Automatic Permission Handling

Current: Manual chown/chmod required for database

Proposed: Docker entrypoint script that fixes permissions on startup

#!/bin/sh
# Fix cluster directory permissions
chown -R nextjs:nodejs /app/cluster
chmod -R u+rw,g+rw /app/cluster
exec "$@"

2. Database Lock File Cleanup

Current: SQLite may leave lock files on crash

Proposed: Check for stale lock files on Stop

// In stop action:
const lockFiles = fs.readdirSync(clusterDir).filter(f => f.includes('.db-'))
if (lockFiles.length > 0) {
  console.log(`🗑️ Removing ${lockFiles.length} stale lock files`)
  lockFiles.forEach(f => fs.unlinkSync(path.join(clusterDir, f)))
}

3. Status API Health Checks

Current: Status API trusts database without verification

Proposed: Cross-check process existence

// In status API:
if (runningChunks > 0) {
  const psOutput = await execAsync('ps aux | grep -c "[d]istributed_coordinator"')
  const processCount = parseInt(psOutput.stdout)
  if (processCount === 0) {
    console.warn('⚠️ Database shows running chunks but no coordinator process!')
    // Auto-fix: Reset database to pending
  }
}

Verification Checklist

Stop Button Functionality

  • Resets database chunks from "running" to "pending"
  • Works even when coordinator already crashed
  • Returns success response with counts
  • Logs detailed operation flow
  • Handles "no processes to kill" gracefully

UI State Display

  • Shows " Processing" when chunks running
  • Shows " Pending" when work queued but not running
  • Shows " Complete" when all chunks done
  • Shows "⏸️ Idle" when no work at all (NEW)
  • Displays pending chunk count when present

Database Operations

  • Uses Node.js sqlite library (not CLI)
  • Works inside Docker container
  • Proper error handling for database failures
  • Returns verification counts after operations
  • Handles readonly database errors with clear messages

Container Deployment

  • Docker build completes successfully
  • Container starts with new code
  • Trading bot service unaffected
  • No errors in startup logs
  • Database operations work in production

File Permissions

  • Database file owned by UID 1001 (nextjs)
  • Directory owned by UID 1001 (nextjs)
  • File permissions 664 (group write)
  • Directory permissions 775 (group write + execute)
  • SQLite can create lock files

Git History

Commit: db33af9
Date: December 1, 2025
Message: fix: Stop button database reset + UI state display (DATABASE-FIRST ARCHITECTURE)

Changes:

  • app/api/cluster/control/route.ts (55 insertions, 17 deletions)
  • app/cluster/page.tsx (enhanced state display)

Verified:

  • Stop button successfully reset 3 'running' chunks → 'pending'
  • UI correctly shows Idle state after Stop
  • Container logs show detailed operation flow
  • Database operations work in Docker environment

Summary

What We Fixed

  1. Stop Button Database Reset

    • Reordered logic: database cleanup FIRST, process kill second
    • Replaced sqlite3 CLI with Node.js API (Docker compatible)
    • Fixed file permissions for container write access
    • Added comprehensive logging and error handling
  2. Stale Metrics Display

    • Added 4th UI state: "⏸️ Idle" (no work queued)
    • Show pending chunk count when work queued
    • Better visual differentiation (colors, emojis)
    • Accurate state after Stop button operation

Why It Matters

User Impact:

  • Can now confidently restart cluster after crashes
  • Clear visual feedback of cluster state
  • No confusion from stale "ACTIVE" displays
  • Reliable cleanup operation

Technical Impact:

  • Database-first architecture prevents state corruption
  • Container-compatible implementation (no shell dependencies)
  • Proper error handling and verification
  • Comprehensive logging for debugging

Research Impact:

  • Reliable parameter exploration infrastructure
  • Can recover from crashes without manual intervention
  • Clean database state enables systematic experimentation
  • No wasted compute from stuck "running" chunks

Status: COMPLETE

All issues resolved and verified in production environment.

Next Steps:

  • Monitor cluster operations for additional edge cases
  • Consider implementing automated permission handling
  • Add health checks to status API for process verification

No Further Action Required: System working correctly with database-first architecture.