Files
trading_bot_v4/cluster/STATUS_REPORT_DEC4_2025.md
mindesbunister c4cc16ede2 docs: EPYC cluster status report Dec 4, 2025
- Worker2 time restriction implementation complete
- Stuck chunk 14 resolved
- Performance impact analysis
- Monitoring commands and verification tests
- Expected behavior documentation
2025-12-04 15:19:21 +01:00

8.5 KiB
Raw Permalink Blame History

EPYC Cluster Status Report - December 4, 2025

Report Time: 15:18 CET (office hours)
Issue: Node 2 noise constraint during office hours (06:00-19:00)
Status: RESOLVED - Time-restricted scheduling implemented


Summary

Successfully implemented time-based worker scheduling to manage Worker2 (bd-host01) noise constraint. Worker2 will now only process parameter sweep chunks during off-hours (19:00-06:00), while Worker1 continues 24/7 operation.


Current Cluster Status

Sweep Progress

  • Total Chunks: 1,693
  • Completed: 64 (3.8%)
  • Running: 1 (worker1: chunk 14)
  • Pending: 1,628

Worker Status (15:18 - Office Hours)

Worker Status Current Load Restriction Notes
Worker1 Active Processing chunk 14 None 24/7 operation
Worker2 ⏸️ Idle 0 processes 19:00-06:00 only Waiting for off-hours

Changes Implemented

1. Time-Restricted Worker Configuration

Added to v9_advanced_coordinator.py:

WORKERS = {
    'worker1': {
        'host': 'root@10.10.254.106',
        'workspace': '/home/comprehensive_sweep',
        # No restriction - runs 24/7
    },
    'worker2': {
        'host': 'root@10.20.254.100', 
        'workspace': '/home/backtest_dual/backtest',
        'ssh_hop': 'root@10.10.254.106',
        'time_restricted': True,      # Enable time control
        'allowed_start_hour': 19,     # 7 PM start
        'allowed_end_hour': 6,        # 6 AM end
    }
}

2. Time Validation Function

def is_worker_allowed_to_run(worker_name: str) -> bool:
    """Check if worker is allowed to run based on time restrictions"""
    worker = WORKERS[worker_name]
    
    if not worker.get('time_restricted', False):
        return True
    
    current_hour = datetime.now().hour
    start_hour = worker['allowed_start_hour']
    end_hour = worker['allowed_end_hour']
    
    # Handle overnight range (19:00-06:00)
    if start_hour > end_hour:
        allowed = current_hour >= start_hour or current_hour < end_hour
    else:
        allowed = start_hour <= current_hour < end_hour
    
    return allowed

3. Coordinator Integration

Worker assignment loop now checks time restrictions:

for worker_name in WORKERS.keys():
    # Check if worker is allowed to run
    if not is_worker_allowed_to_run(worker_name):
        if iteration % 10 == 0:  # Log every 10 iterations
            print(f"⏰ {worker_name} not allowed (office hours, noise restriction)")
        continue
    
    # ... proceed with work assignment ...

Issues Resolved

Stuck Chunk Problem

  • Chunk ID: v9_advanced_chunk_0014
  • Issue: Stuck in "running" state since Dec 2, 15:14 (46+ hours)
  • Cause: Worker2 process failed but database wasn't updated
  • Resolution: Reset to pending status
  • Current Status: Reassigned to worker1, processing now
UPDATE v9_advanced_chunks 
SET status='pending', assigned_worker=NULL 
WHERE id='v9_advanced_chunk_0014';

Performance Impact

Operating Hours Comparison

Before:

  • Worker1: 32 cores × 24h = 768 core-hours/day
  • Worker2: 32 cores × 24h = 768 core-hours/day
  • Total: 1,536 core-hours/day

After:

  • Worker1: 32 cores × 24h = 768 core-hours/day
  • Worker2: 32 cores × 11h = 352 core-hours/day (19:00-06:00)
  • Total: 1,120 core-hours/day

Impact: -27% daily throughput (acceptable for quiet office hours)

Estimated Completion Time

  • Original Estimate: ~40 days (both workers 24/7)
  • With Restriction: ~54 days (worker2 time-limited)
  • Delta: +14 days
  • Trade-off: Worth it for quiet work environment

Verification Tests

Time Restriction Logic Test (14:18 CET)

$ python3 -c "
from datetime import datetime
current_hour = datetime.now().hour
allowed = current_hour >= 19 or current_hour < 6
print(f'Current hour: {current_hour}')
print(f'Worker2 allowed: {allowed}')
"

Current hour: 14
Worker2 allowed: False  ✅ CORRECT

Worker Assignment Check

$ sqlite3 exploration.db "
SELECT assigned_worker, COUNT(*) 
FROM v9_advanced_chunks 
WHERE status='running' 
GROUP BY assigned_worker;
"

worker1|1  ✅ Only worker1 active during office hours

Expected Behavior

During Office Hours (06:00 - 19:00)

  • Worker1: Processing chunks continuously
  • ⏸️ Worker2: Idle (no processes, no noise)
  • 📋 Coordinator: Logs " worker2 not allowed (office hours)" every 10 iterations

During Off-Hours (19:00 - 06:00)

  • Worker1: Processing chunks continuously
  • Worker2: Processing chunks at full capacity
  • 🚀 Both workers: Maximum throughput (64 cores combined)

Transition Times

  • 19:00 (7 PM): Worker2 becomes active, starts processing pending chunks
  • 06:00 (6 AM): Worker2 finishes current chunk, becomes idle until 19:00

Monitoring Commands

Check Current Worker Status

# Worker1 processes
ssh root@10.10.254.106 "ps aux | grep v9_advanced_worker | wc -l"

# Worker2 processes (should be 0 during office hours)
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep v9_advanced_worker | wc -l'"

Check Sweep Progress

cd /home/icke/traderv4/cluster
sqlite3 exploration.db "
SELECT 
    status, 
    COUNT(*) as chunks,
    ROUND(100.0 * COUNT(*) / 1693, 1) as percent
FROM v9_advanced_chunks 
GROUP BY status 
ORDER BY status;
"

Watch Coordinator Logs

# Real-time monitoring
tail -f /home/icke/traderv4/cluster/v9_advanced_coordinator.log

# Watch for time restriction messages
tail -f /home/icke/traderv4/cluster/v9_advanced_coordinator.log | grep "⏰"

Check Worker Assignments

sqlite3 exploration.db "
SELECT 
    assigned_worker,
    status,
    COUNT(*) as chunks
FROM v9_advanced_chunks
WHERE assigned_worker IS NOT NULL
GROUP BY assigned_worker, status;
"

Manual Overrides (If Needed)

Temporarily Disable Time Restriction

If urgent sweep needed during office hours:

  1. Edit /home/icke/traderv4/cluster/v9_advanced_coordinator.py
  2. Comment out time restriction:
'worker2': {
    # 'time_restricted': True,  # DISABLED FOR URGENT SWEEP
    'allowed_start_hour': 19,
    'allowed_end_hour': 6,
}
  1. Restart coordinator: pkill -f v9_advanced_coordinator && nohup python3 -u v9_advanced_coordinator.py >> v9_advanced_coordinator.log 2>&1 &

Adjust Operating Hours

To change worker2 hours (e.g., 20:00-05:00):

'worker2': {
    'time_restricted': True,
    'allowed_start_hour': 20,  # 8 PM
    'allowed_end_hour': 5,     # 5 AM
}

Then restart coordinator.


Git Commits

Changes committed and pushed to repository:

  1. f40fd66 - feat: Add time-restricted scheduling for worker2

    • Worker configuration with time restrictions
    • Coordinator loop integration
  2. f2f2992 - fix: Add is_worker_allowed_to_run function definition

    • Time validation function implementation
  3. 0babd1e - docs: Add worker2 time restriction documentation

    • Complete guide (WORKER2_TIME_RESTRICTION.md)

Next Steps

Tonight (19:00 - After Hours)

  • Worker2 will automatically activate at 19:00
  • Monitor first few chunks to ensure smooth operation
  • Check logs: tail -f v9_advanced_coordinator.log | grep worker2

Tomorrow Morning (06:00)

  • Verify worker2 stopped automatically at 06:00
  • Check overnight progress: How many chunks completed?
  • Confirm office remains quiet during work hours

Weekly Monitoring

  • Track worker2 contribution rate (chunks/night)
  • Compare worker1 24/7 vs worker2 11h/day productivity
  • Adjust hours if needed based on office schedule

Support

Files Modified:

  • /home/icke/traderv4/cluster/v9_advanced_coordinator.py - Time restriction logic
  • /home/icke/traderv4/cluster/exploration.db - Reset stuck chunk 14

Documentation:

  • /home/icke/traderv4/cluster/WORKER2_TIME_RESTRICTION.md - Complete guide
  • /home/icke/traderv4/cluster/STATUS_REPORT_DEC4_2025.md - This report

Key Personnel:

  • Implementation: AI Agent (Dec 4, 2025)
  • Requirement: User (noise constraint 06:00-19:00)

Conclusion

Worker2 time restriction successfully implemented
Stuck chunk 14 resolved
Worker1 continues 24/7 processing
⏸️ Worker2 waiting for 19:00 to start
📊 Sweep progress: 3.8% (64/1693 chunks)

System operating as expected. Worker2 will automatically activate tonight at 19:00 and process chunks until 06:00 tomorrow morning. Office hours remain quiet while maintaining sweep progress through worker1.