docs: EPYC cluster status report Dec 4, 2025

- Worker2 time restriction implementation complete
- Stuck chunk 14 resolved
- Performance impact analysis
- Monitoring commands and verification tests
- Expected behavior documentation
This commit is contained in:
mindesbunister
2025-12-04 15:19:21 +01:00
parent f2f2992a98
commit c4cc16ede2

View File

@@ -0,0 +1,323 @@
# EPYC Cluster Status Report - December 4, 2025
**Report Time:** 15:18 CET (office hours)
**Issue:** Node 2 noise constraint during office hours (06:00-19:00)
**Status:** ✅ RESOLVED - Time-restricted scheduling implemented
---
## Summary
Successfully implemented time-based worker scheduling to manage Worker2 (bd-host01) noise constraint. Worker2 will now only process parameter sweep chunks during off-hours (19:00-06:00), while Worker1 continues 24/7 operation.
---
## Current Cluster Status
### Sweep Progress
- **Total Chunks:** 1,693
- **Completed:** 64 (3.8%)
- **Running:** 1 (worker1: chunk 14)
- **Pending:** 1,628
### Worker Status (15:18 - Office Hours)
| Worker | Status | Current Load | Restriction | Notes |
|--------|--------|--------------|-------------|-------|
| Worker1 | ✅ Active | Processing chunk 14 | None | 24/7 operation |
| Worker2 | ⏸️ Idle | 0 processes | **19:00-06:00 only** | Waiting for off-hours |
---
## Changes Implemented
### 1. Time-Restricted Worker Configuration
Added to `v9_advanced_coordinator.py`:
```python
WORKERS = {
'worker1': {
'host': 'root@10.10.254.106',
'workspace': '/home/comprehensive_sweep',
# No restriction - runs 24/7
},
'worker2': {
'host': 'root@10.20.254.100',
'workspace': '/home/backtest_dual/backtest',
'ssh_hop': 'root@10.10.254.106',
'time_restricted': True, # Enable time control
'allowed_start_hour': 19, # 7 PM start
'allowed_end_hour': 6, # 6 AM end
}
}
```
### 2. Time Validation Function
```python
def is_worker_allowed_to_run(worker_name: str) -> bool:
"""Check if worker is allowed to run based on time restrictions"""
worker = WORKERS[worker_name]
if not worker.get('time_restricted', False):
return True
current_hour = datetime.now().hour
start_hour = worker['allowed_start_hour']
end_hour = worker['allowed_end_hour']
# Handle overnight range (19:00-06:00)
if start_hour > end_hour:
allowed = current_hour >= start_hour or current_hour < end_hour
else:
allowed = start_hour <= current_hour < end_hour
return allowed
```
### 3. Coordinator Integration
Worker assignment loop now checks time restrictions:
```python
for worker_name in WORKERS.keys():
# Check if worker is allowed to run
if not is_worker_allowed_to_run(worker_name):
if iteration % 10 == 0: # Log every 10 iterations
print(f"{worker_name} not allowed (office hours, noise restriction)")
continue
# ... proceed with work assignment ...
```
---
## Issues Resolved
### Stuck Chunk Problem
- **Chunk ID:** v9_advanced_chunk_0014
- **Issue:** Stuck in "running" state since Dec 2, 15:14 (46+ hours)
- **Cause:** Worker2 process failed but database wasn't updated
- **Resolution:** Reset to pending status
- **Current Status:** Reassigned to worker1, processing now
```sql
UPDATE v9_advanced_chunks
SET status='pending', assigned_worker=NULL
WHERE id='v9_advanced_chunk_0014';
```
---
## Performance Impact
### Operating Hours Comparison
**Before:**
- Worker1: 32 cores × 24h = 768 core-hours/day
- Worker2: 32 cores × 24h = 768 core-hours/day
- **Total:** 1,536 core-hours/day
**After:**
- Worker1: 32 cores × 24h = 768 core-hours/day
- Worker2: 32 cores × 11h = 352 core-hours/day (19:00-06:00)
- **Total:** 1,120 core-hours/day
**Impact:** -27% daily throughput (acceptable for quiet office hours)
### Estimated Completion Time
- **Original Estimate:** ~40 days (both workers 24/7)
- **With Restriction:** ~54 days (worker2 time-limited)
- **Delta:** +14 days
- **Trade-off:** Worth it for quiet work environment
---
## Verification Tests
### Time Restriction Logic Test (14:18 CET)
```bash
$ python3 -c "
from datetime import datetime
current_hour = datetime.now().hour
allowed = current_hour >= 19 or current_hour < 6
print(f'Current hour: {current_hour}')
print(f'Worker2 allowed: {allowed}')
"
Current hour: 14
Worker2 allowed: False ✅ CORRECT
```
### Worker Assignment Check
```bash
$ sqlite3 exploration.db "
SELECT assigned_worker, COUNT(*)
FROM v9_advanced_chunks
WHERE status='running'
GROUP BY assigned_worker;
"
worker1|1 ✅ Only worker1 active during office hours
```
---
## Expected Behavior
### During Office Hours (06:00 - 19:00)
- ✅ Worker1: Processing chunks continuously
- ⏸️ Worker2: Idle (no processes, no noise)
- 📋 Coordinator: Logs "⏰ worker2 not allowed (office hours)" every 10 iterations
### During Off-Hours (19:00 - 06:00)
- ✅ Worker1: Processing chunks continuously
- ✅ Worker2: Processing chunks at full capacity
- 🚀 Both workers: Maximum throughput (64 cores combined)
### Transition Times
- **19:00 (7 PM):** Worker2 becomes active, starts processing pending chunks
- **06:00 (6 AM):** Worker2 finishes current chunk, becomes idle until 19:00
---
## Monitoring Commands
### Check Current Worker Status
```bash
# Worker1 processes
ssh root@10.10.254.106 "ps aux | grep v9_advanced_worker | wc -l"
# Worker2 processes (should be 0 during office hours)
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep v9_advanced_worker | wc -l'"
```
### Check Sweep Progress
```bash
cd /home/icke/traderv4/cluster
sqlite3 exploration.db "
SELECT
status,
COUNT(*) as chunks,
ROUND(100.0 * COUNT(*) / 1693, 1) as percent
FROM v9_advanced_chunks
GROUP BY status
ORDER BY status;
"
```
### Watch Coordinator Logs
```bash
# Real-time monitoring
tail -f /home/icke/traderv4/cluster/v9_advanced_coordinator.log
# Watch for time restriction messages
tail -f /home/icke/traderv4/cluster/v9_advanced_coordinator.log | grep "⏰"
```
### Check Worker Assignments
```bash
sqlite3 exploration.db "
SELECT
assigned_worker,
status,
COUNT(*) as chunks
FROM v9_advanced_chunks
WHERE assigned_worker IS NOT NULL
GROUP BY assigned_worker, status;
"
```
---
## Manual Overrides (If Needed)
### Temporarily Disable Time Restriction
If urgent sweep needed during office hours:
1. Edit `/home/icke/traderv4/cluster/v9_advanced_coordinator.py`
2. Comment out time restriction:
```python
'worker2': {
# 'time_restricted': True, # DISABLED FOR URGENT SWEEP
'allowed_start_hour': 19,
'allowed_end_hour': 6,
}
```
3. Restart coordinator: `pkill -f v9_advanced_coordinator && nohup python3 -u v9_advanced_coordinator.py >> v9_advanced_coordinator.log 2>&1 &`
### Adjust Operating Hours
To change worker2 hours (e.g., 20:00-05:00):
```python
'worker2': {
'time_restricted': True,
'allowed_start_hour': 20, # 8 PM
'allowed_end_hour': 5, # 5 AM
}
```
Then restart coordinator.
---
## Git Commits
Changes committed and pushed to repository:
1. **f40fd66** - `feat: Add time-restricted scheduling for worker2`
- Worker configuration with time restrictions
- Coordinator loop integration
2. **f2f2992** - `fix: Add is_worker_allowed_to_run function definition`
- Time validation function implementation
3. **0babd1e** - `docs: Add worker2 time restriction documentation`
- Complete guide (WORKER2_TIME_RESTRICTION.md)
---
## Next Steps
### Tonight (19:00 - After Hours)
- Worker2 will automatically activate at 19:00
- Monitor first few chunks to ensure smooth operation
- Check logs: `tail -f v9_advanced_coordinator.log | grep worker2`
### Tomorrow Morning (06:00)
- Verify worker2 stopped automatically at 06:00
- Check overnight progress: How many chunks completed?
- Confirm office remains quiet during work hours
### Weekly Monitoring
- Track worker2 contribution rate (chunks/night)
- Compare worker1 24/7 vs worker2 11h/day productivity
- Adjust hours if needed based on office schedule
---
## Support
**Files Modified:**
- `/home/icke/traderv4/cluster/v9_advanced_coordinator.py` - Time restriction logic
- `/home/icke/traderv4/cluster/exploration.db` - Reset stuck chunk 14
**Documentation:**
- `/home/icke/traderv4/cluster/WORKER2_TIME_RESTRICTION.md` - Complete guide
- `/home/icke/traderv4/cluster/STATUS_REPORT_DEC4_2025.md` - This report
**Key Personnel:**
- Implementation: AI Agent (Dec 4, 2025)
- Requirement: User (noise constraint 06:00-19:00)
---
## Conclusion
**Worker2 time restriction successfully implemented**
**Stuck chunk 14 resolved**
**Worker1 continues 24/7 processing**
⏸️ **Worker2 waiting for 19:00 to start**
📊 **Sweep progress: 3.8% (64/1693 chunks)**
System operating as expected. Worker2 will automatically activate tonight at 19:00 and process chunks until 06:00 tomorrow morning. Office hours remain quiet while maintaining sweep progress through worker1.