docs: Add worker2 time restriction documentation
- Complete guide for noise constraint management - Time-based scheduling logic explained - Performance impact analysis (27% reduction) - Monitoring commands and troubleshooting - Fixed stuck chunk 14 documentation
This commit is contained in:
271
cluster/WORKER2_TIME_RESTRICTION.md
Normal file
271
cluster/WORKER2_TIME_RESTRICTION.md
Normal file
@@ -0,0 +1,271 @@
|
|||||||
|
# Worker2 Time Restriction - Noise Constraint Management
|
||||||
|
|
||||||
|
**Date:** December 4, 2025
|
||||||
|
**Issue:** Node 2 (bd-host01) generates excessive noise during office hours
|
||||||
|
**Solution:** Time-restricted scheduling (19:00 - 06:00 only)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Worker2 (bd-host01 / 10.20.254.100) is an EPYC 16-core server that generates significant noise when running parameter sweeps at full load. This is disruptive during office hours (06:00 - 19:00).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Solution Implemented
|
||||||
|
|
||||||
|
### Time-Based Worker Scheduling
|
||||||
|
|
||||||
|
**Configuration in `v9_advanced_coordinator.py`:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
WORKERS = {
|
||||||
|
'worker1': {
|
||||||
|
'host': 'root@10.10.254.106',
|
||||||
|
'workspace': '/home/comprehensive_sweep',
|
||||||
|
# No time restriction - runs 24/7
|
||||||
|
},
|
||||||
|
'worker2': {
|
||||||
|
'host': 'root@10.20.254.100',
|
||||||
|
'workspace': '/home/backtest_dual/backtest',
|
||||||
|
'ssh_hop': 'root@10.10.254.106',
|
||||||
|
'time_restricted': True, # Enable time-based control
|
||||||
|
'allowed_start_hour': 19, # 7 PM
|
||||||
|
'allowed_end_hour': 6, # 6 AM
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Logic Implementation
|
||||||
|
|
||||||
|
```python
|
||||||
|
def is_worker_allowed_to_run(worker_name: str) -> bool:
|
||||||
|
"""Check if worker is allowed to run based on time restrictions"""
|
||||||
|
worker = WORKERS[worker_name]
|
||||||
|
|
||||||
|
# If no time restriction, always allowed
|
||||||
|
if not worker.get('time_restricted', False):
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Check current hour (local time)
|
||||||
|
current_hour = datetime.now().hour
|
||||||
|
start_hour = worker['allowed_start_hour']
|
||||||
|
end_hour = worker['allowed_end_hour']
|
||||||
|
|
||||||
|
# Handle time range that crosses midnight (e.g., 19:00 - 06:00)
|
||||||
|
if start_hour > end_hour:
|
||||||
|
allowed = current_hour >= start_hour or current_hour < end_hour
|
||||||
|
else:
|
||||||
|
allowed = start_hour <= current_hour < end_hour
|
||||||
|
|
||||||
|
return allowed
|
||||||
|
```
|
||||||
|
|
||||||
|
### Coordinator Integration
|
||||||
|
|
||||||
|
The coordinator now checks time restrictions before assigning work:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Assign work to idle workers
|
||||||
|
for worker_name in WORKERS.keys():
|
||||||
|
# Check if worker is allowed to run (time restrictions)
|
||||||
|
if not is_worker_allowed_to_run(worker_name):
|
||||||
|
if iteration % 10 == 0: # Log every 10 iterations to avoid spam
|
||||||
|
print(f"⏰ {worker_name} not allowed (office hours, noise restriction)")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# ... continue with worker assignment ...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Operating Hours
|
||||||
|
|
||||||
|
| Worker | Hours | Status | Reason |
|
||||||
|
|--------|-------|--------|--------|
|
||||||
|
| **Worker1** | 24/7 | Always active | No noise constraint |
|
||||||
|
| **Worker2** | 19:00 - 06:00 | Time-restricted | Noise during office hours |
|
||||||
|
|
||||||
|
**Worker2 Schedule:**
|
||||||
|
- **ACTIVE:** 7:00 PM - 6:00 AM (11 hours/day)
|
||||||
|
- **IDLE:** 6:00 AM - 7:00 PM (13 hours/day)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Impact on Sweep Performance
|
||||||
|
|
||||||
|
### Before Time Restriction
|
||||||
|
- **Worker1:** 32 cores, 24/7 = 768 core-hours/day
|
||||||
|
- **Worker2:** 32 cores, 24/7 = 768 core-hours/day
|
||||||
|
- **Total:** 1,536 core-hours/day
|
||||||
|
|
||||||
|
### After Time Restriction
|
||||||
|
- **Worker1:** 32 cores, 24/7 = 768 core-hours/day
|
||||||
|
- **Worker2:** 32 cores, 11h/day = 352 core-hours/day
|
||||||
|
- **Total:** 1,120 core-hours/day
|
||||||
|
|
||||||
|
**Performance Impact:** ~27% reduction in daily throughput (worker2 contributes 45.8% less)
|
||||||
|
|
||||||
|
### Sweep Progress Impact
|
||||||
|
- **Chunks completed:** 63 / 1,693 (3.7%)
|
||||||
|
- **Chunks pending:** 1,629
|
||||||
|
- **Estimated completion time:**
|
||||||
|
- Old: ~40 days (both workers 24/7)
|
||||||
|
- New: ~54 days (worker2 time-restricted)
|
||||||
|
- **Delta:** +14 days
|
||||||
|
|
||||||
|
**Acceptable trade-off:** Quiet office hours > slightly longer sweep time
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
### Test Current Time Restriction (Dec 4, 14:11)
|
||||||
|
```bash
|
||||||
|
cd /home/icke/traderv4/cluster
|
||||||
|
python3 -c "
|
||||||
|
from datetime import datetime
|
||||||
|
current_hour = datetime.now().hour
|
||||||
|
allowed = current_hour >= 19 or current_hour < 6
|
||||||
|
print(f'Current hour: {current_hour}')
|
||||||
|
print(f'Worker2 allowed: {allowed}')
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Output:**
|
||||||
|
```
|
||||||
|
Current hour: 14
|
||||||
|
Worker2 allowed: False ✅ Correct (office hours)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitor Coordinator Logs
|
||||||
|
```bash
|
||||||
|
cd /home/icke/traderv4/cluster
|
||||||
|
tail -f v9_advanced_coordinator.log | grep "⏰"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected output during office hours:**
|
||||||
|
```
|
||||||
|
⏰ worker2 not allowed (office hours, noise restriction)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Fixed Issues
|
||||||
|
|
||||||
|
### Stuck Chunk Problem (Dec 2 - Dec 4)
|
||||||
|
|
||||||
|
**Issue:** Chunk 14 assigned to worker2 on Dec 2 at 15:14, never completed
|
||||||
|
- Database showed: `status='running'`
|
||||||
|
- Reality: No processes running on worker2
|
||||||
|
- Impact: Blocked new work assignment to worker2 for 46+ hours
|
||||||
|
|
||||||
|
**Resolution:**
|
||||||
|
```sql
|
||||||
|
UPDATE v9_advanced_chunks
|
||||||
|
SET status='pending', assigned_worker=NULL
|
||||||
|
WHERE id='v9_advanced_chunk_0014';
|
||||||
|
```
|
||||||
|
|
||||||
|
Chunk 14 now available for reassignment during worker2's active hours (19:00-06:00).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Manual Overrides
|
||||||
|
|
||||||
|
### Temporarily Disable Time Restriction
|
||||||
|
If needed for urgent sweeps, modify coordinator:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In WORKERS['worker2'], comment out time restriction:
|
||||||
|
'worker2': {
|
||||||
|
# 'time_restricted': True, # TEMPORARILY DISABLED
|
||||||
|
'allowed_start_hour': 19,
|
||||||
|
'allowed_end_hour': 6,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then restart coordinator.
|
||||||
|
|
||||||
|
### Adjust Operating Hours
|
||||||
|
To change allowed hours (e.g., extend to 8 PM - 5 AM):
|
||||||
|
|
||||||
|
```python
|
||||||
|
'worker2': {
|
||||||
|
'time_restricted': True,
|
||||||
|
'allowed_start_hour': 20, # 8 PM
|
||||||
|
'allowed_end_hour': 5, # 5 AM
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Commands
|
||||||
|
|
||||||
|
### Check Worker2 Status
|
||||||
|
```bash
|
||||||
|
# Check if worker2 has active processes
|
||||||
|
ssh root@10.10.254.106 "ssh root@10.20.254.100 'ps aux | grep v9_advanced_worker | grep -v grep | wc -l'"
|
||||||
|
|
||||||
|
# Check worker2 assignments in database
|
||||||
|
cd /home/icke/traderv4/cluster
|
||||||
|
sqlite3 exploration.db "SELECT COUNT(*) FROM v9_advanced_chunks WHERE assigned_worker='worker2' AND status='running';"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Time Restriction Status
|
||||||
|
```bash
|
||||||
|
cd /home/icke/traderv4/cluster
|
||||||
|
sqlite3 exploration.db "
|
||||||
|
SELECT
|
||||||
|
assigned_worker,
|
||||||
|
COUNT(*) as chunks,
|
||||||
|
SUM(CASE WHEN status='completed' THEN 1 ELSE 0 END) as completed,
|
||||||
|
SUM(CASE WHEN status='running' THEN 1 ELSE 0 END) as running
|
||||||
|
FROM v9_advanced_chunks
|
||||||
|
WHERE assigned_worker IS NOT NULL
|
||||||
|
GROUP BY assigned_worker;
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Expected Behavior
|
||||||
|
|
||||||
|
### During Office Hours (06:00 - 19:00)
|
||||||
|
- Worker1: ✅ Processing chunks
|
||||||
|
- Worker2: ⏸️ Idle (time restriction active)
|
||||||
|
- Coordinator logs: "⏰ worker2 not allowed (office hours, noise restriction)"
|
||||||
|
|
||||||
|
### During Off Hours (19:00 - 06:00)
|
||||||
|
- Worker1: ✅ Processing chunks
|
||||||
|
- Worker2: ✅ Processing chunks (if available)
|
||||||
|
- Both workers: Full 32-core utilization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
|
||||||
|
- `cluster/v9_advanced_coordinator.py` - Added time restriction logic
|
||||||
|
- `cluster/exploration.db` - Reset stuck chunk 14
|
||||||
|
- `cluster/WORKER2_TIME_RESTRICTION.md` - This documentation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future Improvements
|
||||||
|
|
||||||
|
1. **Dynamic hour adjustment** via environment variables
|
||||||
|
2. **Holiday/weekend override** (allow 24/7 on non-work days)
|
||||||
|
3. **Load-based throttling** (reduce cores instead of full stop)
|
||||||
|
4. **SMS alerts** when worker2 transitions active/idle
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Contact
|
||||||
|
|
||||||
|
For adjustments to worker2 operating hours or noise constraint issues, update the configuration in `v9_advanced_coordinator.py` and restart the coordinator.
|
||||||
|
|
||||||
|
**Current Status (Dec 4, 2025):**
|
||||||
|
- ✅ Time restriction implemented
|
||||||
|
- ✅ Stuck chunk 14 resolved
|
||||||
|
- ✅ Worker1 processing continuously
|
||||||
|
- ⏸️ Worker2 waiting for 19:00 (off-hours start)
|
||||||
Reference in New Issue
Block a user