docs: Add worker2 time restriction documentation
- Complete guide for noise constraint management - Time-based scheduling logic explained - Performance impact analysis (27% reduction) - Monitoring commands and troubleshooting - Fixed stuck chunk 14 documentation
This commit is contained in:
271
cluster/WORKER2_TIME_RESTRICTION.md
Normal file
271
cluster/WORKER2_TIME_RESTRICTION.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# Worker2 Time Restriction - Noise Constraint Management
|
||||
|
||||
**Date:** December 4, 2025
|
||||
**Issue:** Node 2 (bd-host01) generates excessive noise during office hours
|
||||
**Solution:** Time-restricted scheduling (19:00 - 06:00 only)
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
Worker2 (bd-host01 / 10.20.254.100) is an EPYC 16-core server that generates significant noise when running parameter sweeps at full load. This is disruptive during office hours (06:00 - 19:00).
|
||||
|
||||
---
|
||||
|
||||
## Solution Implemented
|
||||
|
||||
### Time-Based Worker Scheduling
|
||||
|
||||
**Configuration in `v9_advanced_coordinator.py`:**
|
||||
|
||||
```python
|
||||
WORKERS = {
|
||||
'worker1': {
|
||||
'host': 'root@10.10.254.106',
|
||||
'workspace': '/home/comprehensive_sweep',
|
||||
# No time restriction - runs 24/7
|
||||
},
|
||||
'worker2': {
|
||||
'host': 'root@10.20.254.100',
|
||||
'workspace': '/home/backtest_dual/backtest',
|
||||
'ssh_hop': 'root@10.10.254.106',
|
||||
'time_restricted': True, # Enable time-based control
|
||||
'allowed_start_hour': 19, # 7 PM
|
||||
'allowed_end_hour': 6, # 6 AM
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Logic Implementation
|
||||
|
||||
```python
|
||||
def is_worker_allowed_to_run(worker_name: str) -> bool:
|
||||
"""Check if worker is allowed to run based on time restrictions"""
|
||||
worker = WORKERS[worker_name]
|
||||
|
||||
# If no time restriction, always allowed
|
||||
if not worker.get('time_restricted', False):
|
||||
return True
|
||||
|
||||
# Check current hour (local time)
|
||||
current_hour = datetime.now().hour
|
||||
start_hour = worker['allowed_start_hour']
|
||||
end_hour = worker['allowed_end_hour']
|
||||
|
||||
# Handle time range that crosses midnight (e.g., 19:00 - 06:00)
|
||||
if start_hour > end_hour:
|
||||
allowed = current_hour >= start_hour or current_hour < end_hour
|
||||
else:
|
||||
allowed = start_hour <= current_hour < end_hour
|
||||
|
||||
return allowed
|
||||
```
|
||||
|
||||
### Coordinator Integration
|
||||
|
||||
The coordinator now checks time restrictions before assigning work:
|
||||
|
||||
```python
|
||||
# Assign work to idle workers
|
||||
for worker_name in WORKERS.keys():
|
||||
# Check if worker is allowed to run (time restrictions)
|
||||
if not is_worker_allowed_to_run(worker_name):
|
||||
if iteration % 10 == 0: # Log every 10 iterations to avoid spam
|
||||
print(f"⏰ {worker_name} not allowed (office hours, noise restriction)")
|
||||
continue
|
||||
|
||||
# ... continue with worker assignment ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Operating Hours
|
||||
|
||||
| Worker | Hours | Status | Reason |
|
||||
|--------|-------|--------|--------|
|
||||
| **Worker1** | 24/7 | Always active | No noise constraint |
|
||||
| **Worker2** | 19:00 - 06:00 | Time-restricted | Noise during office hours |
|
||||
|
||||
**Worker2 Schedule:**
|
||||
- **ACTIVE:** 7:00 PM - 6:00 AM (11 hours/day)
|
||||
- **IDLE:** 6:00 AM - 7:00 PM (13 hours/day)
|
||||
|
||||
---
|
||||
|
||||
## Impact on Sweep Performance
|
||||
|
||||
### Before Time Restriction
|
||||
- **Worker1:** 32 cores, 24/7 = 768 core-hours/day
|
||||
- **Worker2:** 32 cores, 24/7 = 768 core-hours/day
|
||||
- **Total:** 1,536 core-hours/day
|
||||
|
||||
### After Time Restriction
|
||||
- **Worker1:** 32 cores, 24/7 = 768 core-hours/day
|
||||
- **Worker2:** 32 cores, 11h/day = 352 core-hours/day
|
||||
- **Total:** 1,120 core-hours/day
|
||||
|
||||
**Performance Impact:** ~27% reduction in daily throughput (worker2 contributes 45.8% less)
|
||||
|
||||
### Sweep Progress Impact
|
||||
- **Chunks completed:** 63 / 1,693 (3.7%)
|
||||
- **Chunks pending:** 1,629
|
||||
- **Estimated completion time:**
|
||||
- Old: ~40 days (both workers 24/7)
|
||||
- New: ~54 days (worker2 time-restricted)
|
||||
- **Delta:** +14 days
|
||||
|
||||
**Acceptable trade-off:** Quiet office hours > slightly longer sweep time
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
### Test Current Time Restriction (Dec 4, 14:11)
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
python3 -c "
|
||||
from datetime import datetime
|
||||
current_hour = datetime.now().hour
|
||||
allowed = current_hour >= 19 or current_hour < 6
|
||||
print(f'Current hour: {current_hour}')
|
||||
print(f'Worker2 allowed: {allowed}')
|
||||
"
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```
|
||||
Current hour: 14
|
||||
Worker2 allowed: False ✅ Correct (office hours)
|
||||
```
|
||||
|
||||
### Monitor Coordinator Logs
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
tail -f v9_advanced_coordinator.log | grep "⏰"
|
||||
```
|
||||
|
||||
**Expected output during office hours:**
|
||||
```
|
||||
⏰ worker2 not allowed (office hours, noise restriction)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fixed Issues
|
||||
|
||||
### Stuck Chunk Problem (Dec 2 - Dec 4)
|
||||
|
||||
**Issue:** Chunk 14 assigned to worker2 on Dec 2 at 15:14, never completed
|
||||
- Database showed: `status='running'`
|
||||
- Reality: No processes running on worker2
|
||||
- Impact: Blocked new work assignment to worker2 for 46+ hours
|
||||
|
||||
**Resolution:**
|
||||
```sql
|
||||
UPDATE v9_advanced_chunks
|
||||
SET status='pending', assigned_worker=NULL
|
||||
WHERE id='v9_advanced_chunk_0014';
|
||||
```
|
||||
|
||||
Chunk 14 now available for reassignment during worker2's active hours (19:00-06:00).
|
||||
|
||||
---
|
||||
|
||||
## Manual Overrides
|
||||
|
||||
### Temporarily Disable Time Restriction
|
||||
If needed for urgent sweeps, modify coordinator:
|
||||
|
||||
```python
|
||||
# In WORKERS['worker2'], comment out time restriction:
|
||||
'worker2': {
|
||||
# 'time_restricted': True, # TEMPORARILY DISABLED
|
||||
'allowed_start_hour': 19,
|
||||
'allowed_end_hour': 6,
|
||||
}
|
||||
```
|
||||
|
||||
Then restart coordinator.
|
||||
|
||||
### Adjust Operating Hours
|
||||
To change allowed hours (e.g., extend to 8 PM - 5 AM):
|
||||
|
||||
```python
|
||||
'worker2': {
|
||||
'time_restricted': True,
|
||||
'allowed_start_hour': 20, # 8 PM
|
||||
'allowed_end_hour': 5, # 5 AM
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Commands
|
||||
|
||||
### Check Worker2 Status
|
||||
```bash
|
||||
# Check if worker2 has active processes
|
||||
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep v9_advanced_worker | grep -v grep | wc -l'"
|
||||
|
||||
# Check worker2 assignments in database
|
||||
cd /home/icke/traderv4/cluster
|
||||
sqlite3 exploration.db "SELECT COUNT(*) FROM v9_advanced_chunks WHERE assigned_worker='worker2' AND status='running';"
|
||||
```
|
||||
|
||||
### Check Time Restriction Status
|
||||
```bash
|
||||
cd /home/icke/traderv4/cluster
|
||||
sqlite3 exploration.db "
|
||||
SELECT
|
||||
assigned_worker,
|
||||
COUNT(*) as chunks,
|
||||
SUM(CASE WHEN status='completed' THEN 1 ELSE 0 END) as completed,
|
||||
SUM(CASE WHEN status='running' THEN 1 ELSE 0 END) as running
|
||||
FROM v9_advanced_chunks
|
||||
WHERE assigned_worker IS NOT NULL
|
||||
GROUP BY assigned_worker;
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Expected Behavior
|
||||
|
||||
### During Office Hours (06:00 - 19:00)
|
||||
- Worker1: ✅ Processing chunks
|
||||
- Worker2: ⏸️ Idle (time restriction active)
|
||||
- Coordinator logs: "⏰ worker2 not allowed (office hours, noise restriction)"
|
||||
|
||||
### During Off Hours (19:00 - 06:00)
|
||||
- Worker1: ✅ Processing chunks
|
||||
- Worker2: ✅ Processing chunks (if available)
|
||||
- Both workers: Full 32-core utilization
|
||||
|
||||
---
|
||||
|
||||
## Files Modified
|
||||
|
||||
- `cluster/v9_advanced_coordinator.py` - Added time restriction logic
|
||||
- `cluster/exploration.db` - Reset stuck chunk 14
|
||||
- `cluster/WORKER2_TIME_RESTRICTION.md` - This documentation
|
||||
|
||||
---
|
||||
|
||||
## Future Improvements
|
||||
|
||||
1. **Dynamic hour adjustment** via environment variables
|
||||
2. **Holiday/weekend override** (allow 24/7 on non-work days)
|
||||
3. **Load-based throttling** (reduce cores instead of full stop)
|
||||
4. **SMS alerts** when worker2 transitions active/idle
|
||||
|
||||
---
|
||||
|
||||
## Contact
|
||||
|
||||
For adjustments to worker2 operating hours or noise constraint issues, update the configuration in `v9_advanced_coordinator.py` and restart the coordinator.
|
||||
|
||||
**Current Status (Dec 4, 2025):**
|
||||
- ✅ Time restriction implemented
|
||||
- ✅ Stuck chunk 14 resolved
|
||||
- ✅ Worker1 processing continuously
|
||||
- ⏸️ Worker2 waiting for 19:00 (off-hours start)
|
||||
Reference in New Issue
Block a user