docs: EPYC cluster status report Dec 4, 2025
- Worker2 time restriction implementation complete - Stuck chunk 14 resolved - Performance impact analysis - Monitoring commands and verification tests - Expected behavior documentation
This commit is contained in:
323
cluster/STATUS_REPORT_DEC4_2025.md
Normal file
323
cluster/STATUS_REPORT_DEC4_2025.md
Normal file
@@ -0,0 +1,323 @@
|
|||||||
|
# EPYC Cluster Status Report - December 4, 2025
|
||||||
|
|
||||||
|
**Report Time:** 15:18 CET (office hours)
|
||||||
|
**Issue:** Node 2 noise constraint during office hours (06:00-19:00)
|
||||||
|
**Status:** ✅ RESOLVED - Time-restricted scheduling implemented
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Successfully implemented time-based worker scheduling to manage Worker2 (bd-host01) noise constraint. Worker2 will now only process parameter sweep chunks during off-hours (19:00-06:00), while Worker1 continues 24/7 operation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current Cluster Status
|
||||||
|
|
||||||
|
### Sweep Progress
|
||||||
|
- **Total Chunks:** 1,693
|
||||||
|
- **Completed:** 64 (3.8%)
|
||||||
|
- **Running:** 1 (worker1: chunk 14)
|
||||||
|
- **Pending:** 1,628
|
||||||
|
|
||||||
|
### Worker Status (15:18 - Office Hours)
|
||||||
|
| Worker | Status | Current Load | Restriction | Notes |
|
||||||
|
|--------|--------|--------------|-------------|-------|
|
||||||
|
| Worker1 | ✅ Active | Processing chunk 14 | None | 24/7 operation |
|
||||||
|
| Worker2 | ⏸️ Idle | 0 processes | **19:00-06:00 only** | Waiting for off-hours |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changes Implemented
|
||||||
|
|
||||||
|
### 1. Time-Restricted Worker Configuration
|
||||||
|
|
||||||
|
Added to `v9_advanced_coordinator.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
WORKERS = {
|
||||||
|
'worker1': {
|
||||||
|
'host': 'root@10.10.254.106',
|
||||||
|
'workspace': '/home/comprehensive_sweep',
|
||||||
|
# No restriction - runs 24/7
|
||||||
|
},
|
||||||
|
'worker2': {
|
||||||
|
'host': 'root@10.20.254.100',
|
||||||
|
'workspace': '/home/backtest_dual/backtest',
|
||||||
|
'ssh_hop': 'root@10.10.254.106',
|
||||||
|
'time_restricted': True, # Enable time control
|
||||||
|
'allowed_start_hour': 19, # 7 PM start
|
||||||
|
'allowed_end_hour': 6, # 6 AM end
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Time Validation Function
|
||||||
|
|
||||||
|
```python
|
||||||
|
def is_worker_allowed_to_run(worker_name: str) -> bool:
|
||||||
|
"""Check if worker is allowed to run based on time restrictions"""
|
||||||
|
worker = WORKERS[worker_name]
|
||||||
|
|
||||||
|
if not worker.get('time_restricted', False):
|
||||||
|
return True
|
||||||
|
|
||||||
|
current_hour = datetime.now().hour
|
||||||
|
start_hour = worker['allowed_start_hour']
|
||||||
|
end_hour = worker['allowed_end_hour']
|
||||||
|
|
||||||
|
# Handle overnight range (19:00-06:00)
|
||||||
|
if start_hour > end_hour:
|
||||||
|
allowed = current_hour >= start_hour or current_hour < end_hour
|
||||||
|
else:
|
||||||
|
allowed = start_hour <= current_hour < end_hour
|
||||||
|
|
||||||
|
return allowed
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Coordinator Integration
|
||||||
|
|
||||||
|
Worker assignment loop now checks time restrictions:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for worker_name in WORKERS.keys():
|
||||||
|
# Check if worker is allowed to run
|
||||||
|
if not is_worker_allowed_to_run(worker_name):
|
||||||
|
if iteration % 10 == 0: # Log every 10 iterations
|
||||||
|
print(f"⏰ {worker_name} not allowed (office hours, noise restriction)")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# ... proceed with work assignment ...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Issues Resolved
|
||||||
|
|
||||||
|
### Stuck Chunk Problem
|
||||||
|
- **Chunk ID:** v9_advanced_chunk_0014
|
||||||
|
- **Issue:** Stuck in "running" state since Dec 2, 15:14 (46+ hours)
|
||||||
|
- **Cause:** Worker2 process failed but database wasn't updated
|
||||||
|
- **Resolution:** Reset to pending status
|
||||||
|
- **Current Status:** Reassigned to worker1, processing now
|
||||||
|
|
||||||
|
```sql
|
||||||
|
UPDATE v9_advanced_chunks
|
||||||
|
SET status='pending', assigned_worker=NULL
|
||||||
|
WHERE id='v9_advanced_chunk_0014';
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Impact
|
||||||
|
|
||||||
|
### Operating Hours Comparison
|
||||||
|
|
||||||
|
**Before:**
|
||||||
|
- Worker1: 32 cores × 24h = 768 core-hours/day
|
||||||
|
- Worker2: 32 cores × 24h = 768 core-hours/day
|
||||||
|
- **Total:** 1,536 core-hours/day
|
||||||
|
|
||||||
|
**After:**
|
||||||
|
- Worker1: 32 cores × 24h = 768 core-hours/day
|
||||||
|
- Worker2: 32 cores × 11h = 352 core-hours/day (19:00-06:00)
|
||||||
|
- **Total:** 1,120 core-hours/day
|
||||||
|
|
||||||
|
**Impact:** -27% daily throughput (acceptable for quiet office hours)
|
||||||
|
|
||||||
|
### Estimated Completion Time
|
||||||
|
- **Original Estimate:** ~40 days (both workers 24/7)
|
||||||
|
- **With Restriction:** ~54 days (worker2 time-limited)
|
||||||
|
- **Delta:** +14 days
|
||||||
|
- **Trade-off:** Worth it for quiet work environment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verification Tests
|
||||||
|
|
||||||
|
### Time Restriction Logic Test (14:18 CET)
|
||||||
|
```bash
|
||||||
|
$ python3 -c "
|
||||||
|
from datetime import datetime
|
||||||
|
current_hour = datetime.now().hour
|
||||||
|
allowed = current_hour >= 19 or current_hour < 6
|
||||||
|
print(f'Current hour: {current_hour}')
|
||||||
|
print(f'Worker2 allowed: {allowed}')
|
||||||
|
"
|
||||||
|
|
||||||
|
Current hour: 14
|
||||||
|
Worker2 allowed: False ✅ CORRECT
|
||||||
|
```
|
||||||
|
|
||||||
|
### Worker Assignment Check
|
||||||
|
```bash
|
||||||
|
$ sqlite3 exploration.db "
|
||||||
|
SELECT assigned_worker, COUNT(*)
|
||||||
|
FROM v9_advanced_chunks
|
||||||
|
WHERE status='running'
|
||||||
|
GROUP BY assigned_worker;
|
||||||
|
"
|
||||||
|
|
||||||
|
worker1|1 ✅ Only worker1 active during office hours
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Expected Behavior
|
||||||
|
|
||||||
|
### During Office Hours (06:00 - 19:00)
|
||||||
|
- ✅ Worker1: Processing chunks continuously
|
||||||
|
- ⏸️ Worker2: Idle (no processes, no noise)
|
||||||
|
- 📋 Coordinator: Logs "⏰ worker2 not allowed (office hours)" every 10 iterations
|
||||||
|
|
||||||
|
### During Off-Hours (19:00 - 06:00)
|
||||||
|
- ✅ Worker1: Processing chunks continuously
|
||||||
|
- ✅ Worker2: Processing chunks at full capacity
|
||||||
|
- 🚀 Both workers: Maximum throughput (64 cores combined)
|
||||||
|
|
||||||
|
### Transition Times
|
||||||
|
- **19:00 (7 PM):** Worker2 becomes active, starts processing pending chunks
|
||||||
|
- **06:00 (6 AM):** Worker2 finishes current chunk, becomes idle until 19:00
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Commands
|
||||||
|
|
||||||
|
### Check Current Worker Status
|
||||||
|
```bash
|
||||||
|
# Worker1 processes
|
||||||
|
ssh root@10.10.254.106 "ps aux | grep v9_advanced_worker | wc -l"
|
||||||
|
|
||||||
|
# Worker2 processes (should be 0 during office hours)
|
||||||
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep v9_advanced_worker | wc -l'"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Sweep Progress
|
||||||
|
```bash
|
||||||
|
cd /home/icke/traderv4/cluster
|
||||||
|
sqlite3 exploration.db "
|
||||||
|
SELECT
|
||||||
|
status,
|
||||||
|
COUNT(*) as chunks,
|
||||||
|
ROUND(100.0 * COUNT(*) / 1693, 1) as percent
|
||||||
|
FROM v9_advanced_chunks
|
||||||
|
GROUP BY status
|
||||||
|
ORDER BY status;
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Watch Coordinator Logs
|
||||||
|
```bash
|
||||||
|
# Real-time monitoring
|
||||||
|
tail -f /home/icke/traderv4/cluster/v9_advanced_coordinator.log
|
||||||
|
|
||||||
|
# Watch for time restriction messages
|
||||||
|
tail -f /home/icke/traderv4/cluster/v9_advanced_coordinator.log | grep "⏰"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Worker Assignments
|
||||||
|
```bash
|
||||||
|
sqlite3 exploration.db "
|
||||||
|
SELECT
|
||||||
|
assigned_worker,
|
||||||
|
status,
|
||||||
|
COUNT(*) as chunks
|
||||||
|
FROM v9_advanced_chunks
|
||||||
|
WHERE assigned_worker IS NOT NULL
|
||||||
|
GROUP BY assigned_worker, status;
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Manual Overrides (If Needed)
|
||||||
|
|
||||||
|
### Temporarily Disable Time Restriction
|
||||||
|
If urgent sweep needed during office hours:
|
||||||
|
|
||||||
|
1. Edit `/home/icke/traderv4/cluster/v9_advanced_coordinator.py`
|
||||||
|
2. Comment out time restriction:
|
||||||
|
```python
|
||||||
|
'worker2': {
|
||||||
|
# 'time_restricted': True, # DISABLED FOR URGENT SWEEP
|
||||||
|
'allowed_start_hour': 19,
|
||||||
|
'allowed_end_hour': 6,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
3. Restart coordinator: `pkill -f v9_advanced_coordinator && nohup python3 -u v9_advanced_coordinator.py >> v9_advanced_coordinator.log 2>&1 &`
|
||||||
|
|
||||||
|
### Adjust Operating Hours
|
||||||
|
To change worker2 hours (e.g., 20:00-05:00):
|
||||||
|
|
||||||
|
```python
|
||||||
|
'worker2': {
|
||||||
|
'time_restricted': True,
|
||||||
|
'allowed_start_hour': 20, # 8 PM
|
||||||
|
'allowed_end_hour': 5, # 5 AM
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then restart coordinator.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Git Commits
|
||||||
|
|
||||||
|
Changes committed and pushed to repository:
|
||||||
|
|
||||||
|
1. **f40fd66** - `feat: Add time-restricted scheduling for worker2`
|
||||||
|
- Worker configuration with time restrictions
|
||||||
|
- Coordinator loop integration
|
||||||
|
|
||||||
|
2. **f2f2992** - `fix: Add is_worker_allowed_to_run function definition`
|
||||||
|
- Time validation function implementation
|
||||||
|
|
||||||
|
3. **0babd1e** - `docs: Add worker2 time restriction documentation`
|
||||||
|
- Complete guide (WORKER2_TIME_RESTRICTION.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
### Tonight (19:00 - After Hours)
|
||||||
|
- Worker2 will automatically activate at 19:00
|
||||||
|
- Monitor first few chunks to ensure smooth operation
|
||||||
|
- Check logs: `tail -f v9_advanced_coordinator.log | grep worker2`
|
||||||
|
|
||||||
|
### Tomorrow Morning (06:00)
|
||||||
|
- Verify worker2 stopped automatically at 06:00
|
||||||
|
- Check overnight progress: How many chunks completed?
|
||||||
|
- Confirm office remains quiet during work hours
|
||||||
|
|
||||||
|
### Weekly Monitoring
|
||||||
|
- Track worker2 contribution rate (chunks/night)
|
||||||
|
- Compare worker1 24/7 vs worker2 11h/day productivity
|
||||||
|
- Adjust hours if needed based on office schedule
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Support
|
||||||
|
|
||||||
|
**Files Modified:**
|
||||||
|
- `/home/icke/traderv4/cluster/v9_advanced_coordinator.py` - Time restriction logic
|
||||||
|
- `/home/icke/traderv4/cluster/exploration.db` - Reset stuck chunk 14
|
||||||
|
|
||||||
|
**Documentation:**
|
||||||
|
- `/home/icke/traderv4/cluster/WORKER2_TIME_RESTRICTION.md` - Complete guide
|
||||||
|
- `/home/icke/traderv4/cluster/STATUS_REPORT_DEC4_2025.md` - This report
|
||||||
|
|
||||||
|
**Key Personnel:**
|
||||||
|
- Implementation: AI Agent (Dec 4, 2025)
|
||||||
|
- Requirement: User (noise constraint 06:00-19:00)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
✅ **Worker2 time restriction successfully implemented**
|
||||||
|
✅ **Stuck chunk 14 resolved**
|
||||||
|
✅ **Worker1 continues 24/7 processing**
|
||||||
|
⏸️ **Worker2 waiting for 19:00 to start**
|
||||||
|
📊 **Sweep progress: 3.8% (64/1693 chunks)**
|
||||||
|
|
||||||
|
System operating as expected. Worker2 will automatically activate tonight at 19:00 and process chunks until 06:00 tomorrow morning. Office hours remain quiet while maintaining sweep progress through worker1.
|
||||||
Reference in New Issue
Block a user