docs: Add Common Pitfall #65 - distributed worker quality_filter bug
This commit is contained in:
32
.github/copilot-instructions.md
vendored
32
.github/copilot-instructions.md
vendored
@@ -4807,6 +4807,38 @@ trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
|
||||
* Monitor coordinator logs for timeout patterns, increase if needed
|
||||
* Consider SSH multiplexing (ControlMaster) to speed up nested hops
|
||||
|
||||
65. **Distributed Worker Quality Filter - Dict vs Callable (CRITICAL - Fixed Dec 1, 2025):**
|
||||
- **Symptom:** ALL 2,096 distributed backtests returned 0 trades (expected 500-600 each)
|
||||
- **Error Message:** `Error testing config X: 'dict' object is not callable` repeated 2,096 times in worker logs
|
||||
- **Root Cause:** `cluster/distributed_worker.py` lines 67-70 passed dict `{'min_adx': 15, 'min_volume_ratio': vol_min}` instead of lambda function to `simulate_money_line()`
|
||||
- **Technical Detail:** `backtester/simulator.py:118` calls `quality_filter(signal)` - expects CALLABLE function, not dict configuration
|
||||
- **Silent Failure Pattern:** Worker's try/except caught exception and returned default values (0 trades, 0 P&L) without crashing
|
||||
- **Discovery:** Found only when analyzing result content - CSVs looked valid but all trades=0
|
||||
- **Fix (Commit 11a0ea3):**
|
||||
```python
|
||||
# BEFORE (BROKEN):
|
||||
quality_filter = {'min_adx': 15, 'min_volume_ratio': vol_min}
|
||||
|
||||
# AFTER (FIXED):
|
||||
if vol_min > 0:
|
||||
quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min
|
||||
else:
|
||||
quality_filter = None
|
||||
```
|
||||
- **Pattern Source:** Matches working `comprehensive_sweep.py:106` implementation
|
||||
- **Why It Broke:** Developer treated quality_filter as configuration dict, but it's actually a callback function called on every signal
|
||||
- **Verification:** Workers at 100% CPU, no errors after 2+ minutes, actively computing valid backtests
|
||||
- **Lessons Learned:**
|
||||
1. **Type mismatches (dict vs callable) can cause catastrophic silent failures** - Python's dynamic typing hides the error until runtime
|
||||
2. **Always validate result quality** - All zeros = red flag requiring immediate investigation
|
||||
3. **Compare to working code when debugging** - comprehensive_sweep.py had correct pattern
|
||||
4. **Silent failures more dangerous than crashes** - Exception handler hid severity by returning zeros instead of crashing
|
||||
5. **Test single case before running full sweep** - Would have caught error in 30 seconds vs 22 minutes
|
||||
6. **Add result assertions** - System should detect all-zero results and abort early
|
||||
- **Documentation:** `cluster/CRITICAL_BUG_FIX_DEC1_2025.md`
|
||||
- **Impact:** Blocked 45 minutes of distributed work, wasted cluster compute, caught before Stage 1 analysis
|
||||
- **Files changed:** `cluster/distributed_worker.py` lines 67-77
|
||||
|
||||
## File Conventions
|
||||
|
||||
- **API routes:** `app/api/[feature]/[action]/route.ts` (Next.js 15 App Router)
|
||||
|
||||
Reference in New Issue
Block a user