Files

mindesbunister 11a0ea324b critical: Fix distributed worker quality_filter - dict to lambda function

Root cause: Passing dict {'min_adx': 15, 'min_volume_ratio': vol_min} when
simulate_money_line() expects callable function.

Bug caused ALL 2,096 backtests to fail with 'dict' object is not callable.

Fix: Changed to lambda function matching comprehensive_sweep.py pattern:
  quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min

Verified fix working: Workers running at 100% CPU, no errors after 2+ minutes.

2025-12-01 14:59:08 +01:00

4.4 KiB

Raw Permalink Blame History

CRITICAL BUG FIX - Distributed Worker Quality Filter (Dec 1, 2025)

🔥 Critical Bug Discovered

Date: December 1, 2025, 14:40 UTC Impact: ALL 2,096 backtests failed with 'dict' object is not callable error Severity: CRITICAL - Blocked all distributed work

Symptom

All parameter combinations tested returned 0 trades:

Chunk 0: 2,000 configs, all with trades=0
Chunk 2: 96 configs, all with trades=0
Worker logs showed: Error testing config X: 'dict' object is not callable (repeated 2,096 times)

Root Cause

File: cluster/distributed_worker.py Lines: 67-70

BROKEN CODE:

# Quality filter (matches comprehensive_sweep.py)
quality_filter = {
    'min_adx': 15,
    'min_volume_ratio': vol_min,
}

Problem: Passing a dict object when simulate_money_line() expects a callable function.

Investigation Timeline

14:35 - User reported "something finished"
14:40 - Discovered all 2,096 results had 0 trades
14:45 - Found error in worker logs: 'dict' object is not callable
14:50 - Compared to comprehensive_sweep.py (working version)
14:52 - ROOT CAUSE IDENTIFIED: dict vs lambda function
14:55 - Fix applied and deployed
15:00 - Fix verified working (workers at 100% CPU, no errors)

The Fix

BEFORE (BROKEN):

quality_filter = {
    'min_adx': 15,
    'min_volume_ratio': vol_min,
}

AFTER (FIXED):

# CRITICAL FIX (Dec 1, 2025): Must be lambda function, not dict!
# Bug was passing dict which caused "'dict' object is not callable" error
if vol_min > 0:
    quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min
else:
    quality_filter = None

Why It Broke

In backtester/simulator.py (line 118):

if not quality_filter(signal):
    continue

The code calls quality_filter() as a function. When we passed a dict, Python tried to call a dict object, causing 'dict' object is not callable.

How It Was Missed

Coordinator and worker infrastructure all worked correctly
Data loaded successfully (34,273 rows)
Multiprocessing started without errors
Worker's exception handler caught the error and returned zeros
Silent failure: No crash, just invalid results
Files created looked successful (183KB)

Verification Steps

✅ Deployed fixed code to worker1
✅ Cleaned up invalid results and database
✅ Restarted coordinator with fixed worker
✅ Verified no 'dict' object is not callable errors in logs
✅ Confirmed 24 Python processes running at 100% CPU
✅ Workers actively computing (no immediate errors for 2+ minutes)

Lessons Learned

Type matters: Dict vs callable - subtle but critical difference
Silent failures are dangerous: Exception handler hid the severity
Compare to working code: comprehensive_sweep.py had correct pattern
Verify results quality: All zeros = red flag, investigate immediately
Test fixes locally first: Would have caught this earlier
Add validation: Should detect all-zero results and abort

Files Changed

cluster/distributed_worker.py - Fixed quality_filter (dict → lambda)

Commit

git add cluster/distributed_worker.py cluster/CRITICAL_BUG_FIX_DEC1_2025.md
git commit -m "critical: Fix distributed worker quality_filter - dict to lambda function

Root cause: Passing dict {'min_adx': 15, 'min_volume_ratio': vol_min} when
simulate_money_line() expects callable function.

Bug caused ALL 2,096 backtests to fail with 'dict' object is not callable.

Fix: Changed to lambda function matching comprehensive_sweep.py pattern:
  quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min

Verified fix working: Workers running at 100% CPU, no errors after 2+ minutes.
"
git push

Status

✅ Bug identified and fixed
✅ Code deployed to worker1
✅ Coordinator restarted
✅ Workers actively processing (100% CPU, no errors)
⏳ Awaiting completion of chunk 0 (2,000 configs, ~22 minutes estimated)
⏳ Full sweep restart: 4,096 configs total

Expected Timeline

Chunk 0: ~22 minutes (2,000 configs)
Chunk 1: ~22 minutes (2,000 configs)
Chunk 2: ~1 minute (96 configs)
Total: ~45 minutes for complete sweep

Next Steps

Monitor chunk 0 completion (~10 minutes remaining)
Verify results have trades > 0 (not all zeros)
Import successful results to database
Analyze top performers
Deploy to worker2 for parallel processing

4.4 KiB Raw Permalink Blame History