critical: Fix distributed worker quality_filter - dict to lambda function

Root cause: Passing dict {'min_adx': 15, 'min_volume_ratio': vol_min} when simulate_money_line() expects callable function. Bug caused ALL 2,096 backtests to fail with 'dict' object is not callable. Fix: Changed to lambda function matching comprehensive_sweep.py pattern: quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min Verified fix working: Workers running at 100% CPU, no errors after 2+ minutes.
2025-12-01 14:59:08 +01:00
parent a886555d44
commit 11a0ea324b
2 changed files with 151 additions and 5 deletions
--- a/cluster/CRITICAL_BUG_FIX_DEC1_2025.md
+++ b/cluster/CRITICAL_BUG_FIX_DEC1_2025.md
@@ -0,0 +1,144 @@
 # CRITICAL BUG FIX - Distributed Worker Quality Filter (Dec 1, 2025)
 ## 🔥 Critical Bug Discovered
 **Date:** December 1, 2025, 14:40 UTC
 **Impact:** ALL 2,096 backtests failed with `'dict' object is not callable` error
 **Severity:** CRITICAL - Blocked all distributed work
 ## Symptom
 All parameter combinations tested returned 0 trades:
 - Chunk 0: 2,000 configs, all with `trades=0`
 - Chunk 2: 96 configs, all with `trades=0`
 - Worker logs showed: `Error testing config X: 'dict' object is not callable` (repeated 2,096 times)
 ## Root Cause
 **File:** `cluster/distributed_worker.py`
 **Lines:** 67-70
 **BROKEN CODE:**
 ```python
 # Quality filter (matches comprehensive_sweep.py)
 quality_filter = {
    'min_adx': 15,
    'min_volume_ratio': vol_min,
 }
 ```
 **Problem:** Passing a `dict` object when `simulate_money_line()` expects a **callable function**.
 ## Investigation Timeline
 1. **14:35** - User reported "something finished"
 2. **14:40** - Discovered all 2,096 results had 0 trades
 3. **14:45** - Found error in worker logs: `'dict' object is not callable`
 4. **14:50** - Compared to `comprehensive_sweep.py` (working version)
 5. **14:52** - **ROOT CAUSE IDENTIFIED**: dict vs lambda function
 6. **14:55** - Fix applied and deployed
 7. **15:00** - Fix verified working (workers at 100% CPU, no errors)
 ## The Fix
 **BEFORE (BROKEN):**
 ```python
 quality_filter = {
    'min_adx': 15,
    'min_volume_ratio': vol_min,
 }
 ```
 **AFTER (FIXED):**
 ```python
 # CRITICAL FIX (Dec 1, 2025): Must be lambda function, not dict!
 # Bug was passing dict which caused "'dict' object is not callable" error
 if vol_min > 0:
    quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min
 else:
    quality_filter = None
 ```
 ## Why It Broke
 In `backtester/simulator.py` (line 118):
 ```python
 if not quality_filter(signal):
    continue
 ```
 The code calls `quality_filter()` as a **function**. When we passed a dict, Python tried to call a dict object, causing `'dict' object is not callable`.
 ## How It Was Missed
 - Coordinator and worker infrastructure all worked correctly
 - Data loaded successfully (34,273 rows)
 - Multiprocessing started without errors
 - Worker's exception handler caught the error and returned zeros
 - **Silent failure:** No crash, just invalid results
 - Files created looked successful (183KB)
 ## Verification Steps
 1. ✅ Deployed fixed code to worker1
 2. ✅ Cleaned up invalid results and database
 3. ✅ Restarted coordinator with fixed worker
 4. ✅ Verified no `'dict' object is not callable` errors in logs
 5. ✅ Confirmed 24 Python processes running at 100% CPU
 6. ✅ Workers actively computing (no immediate errors for 2+ minutes)
 ## Lessons Learned
 1. **Type matters:** Dict vs callable - subtle but critical difference
 2. **Silent failures are dangerous:** Exception handler hid the severity
 3. **Compare to working code:** `comprehensive_sweep.py` had correct pattern
 4. **Verify results quality:** All zeros = red flag, investigate immediately
 5. **Test fixes locally first:** Would have caught this earlier
 6. **Add validation:** Should detect all-zero results and abort
 ## Files Changed
 - `cluster/distributed_worker.py` - Fixed quality_filter (dict → lambda)
 ## Commit
 ```bash
 git add cluster/distributed_worker.py cluster/CRITICAL_BUG_FIX_DEC1_2025.md
 git commit -m "critical: Fix distributed worker quality_filter - dict to lambda function
 Root cause: Passing dict {'min_adx': 15, 'min_volume_ratio': vol_min} when
 simulate_money_line() expects callable function.
 Bug caused ALL 2,096 backtests to fail with 'dict' object is not callable.
 Fix: Changed to lambda function matching comprehensive_sweep.py pattern:
  quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min
 Verified fix working: Workers running at 100% CPU, no errors after 2+ minutes.
 "
 git push
 ```
 ## Status
 - ✅ Bug identified and fixed
 - ✅ Code deployed to worker1
 - ✅ Coordinator restarted
 - ✅ Workers actively processing (100% CPU, no errors)
 - ⏳ Awaiting completion of chunk 0 (2,000 configs, ~22 minutes estimated)
 - ⏳ Full sweep restart: 4,096 configs total
 ## Expected Timeline
 - **Chunk 0:** ~22 minutes (2,000 configs)
 - **Chunk 1:** ~22 minutes (2,000 configs) 
 - **Chunk 2:** ~1 minute (96 configs)
 - **Total:** ~45 minutes for complete sweep
 ## Next Steps
 1. Monitor chunk 0 completion (~10 minutes remaining)
 2. Verify results have trades > 0 (not all zeros)
 3. Import successful results to database
 4. Analyze top performers
 5. Deploy to worker2 for parallel processing
--- a/cluster/distributed_worker.py
+++ b/cluster/distributed_worker.py
@@ -63,11 +63,13 @@ def test_config(args):
        max_bars_per_trade=max_bars,
    )
-    # Quality filter (matches comprehensive_sweep.py)
+    # Quality filter (matches comprehensive_sweep.py signature)
-    quality_filter = {
+    # CRITICAL FIX (Dec 1, 2025): Must be lambda function, not dict!
-        'min_adx': 15,
+    # Bug was passing dict which caused "'dict' object is not callable" error
-        'min_volume_ratio': vol_min,
+    if vol_min > 0:
-    }
+        quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min
    else:
        quality_filter = None
    # Run simulation
    try: