- Fixed method call from getPositions() to getAllPositions()
- Health monitor now starts successfully and runs every 30 seconds
- Detects Position Manager monitoring failures within 30 seconds
- Addresses Common Pitfall #77 detection
Tested: Container restart confirmed health monitor operational
CRITICAL FIXES FOR $1,000 LOSS BUG (Dec 8, 2025):
**Bug #1: Position Manager Never Actually Monitors**
- System logged 'Trade added' but never started monitoring
- isMonitoring stayed false despite having active trades
- Result: No TP/SL monitoring, no protection, uncontrolled losses
**Bug #2: Silent SL Placement Failures**
- placeExitOrders() returned SUCCESS but only 2/3 orders placed
- Missing SL order left $2,003 position completely unprotected
- No error logs, no indication anything was wrong
**Bug #3: Orphan Detection Cancelled Active Orders**
- Old orphaned position detection triggered on NEW position
- Cancelled TP/SL orders while leaving position open
- User opened trade WITH protection, system REMOVED protection
**SOLUTION: Health Monitoring System**
New file: lib/health/position-manager-health.ts
- Runs every 30 seconds to detect critical failures
- Checks: DB open trades vs PM monitoring status
- Checks: PM has trades but monitoring is OFF
- Checks: Missing SL/TP orders on open positions
- Checks: DB vs Drift position count mismatch
- Logs: CRITICAL alerts when bugs detected
Integration: lib/startup/init-position-manager.ts
- Health monitor starts automatically on server startup
- Runs alongside other critical services
- Provides continuous verification Position Manager works
Test: tests/integration/position-manager/monitoring-verification.test.ts
- Validates startMonitoring() actually calls priceMonitor.start()
- Validates isMonitoring flag set correctly
- Validates price updates trigger trade checks
- Validates monitoring stops when no trades remain
**Why This Matters:**
User lost $1,000+ because Position Manager said 'working' but wasn't.
This health system detects that failure within 30 seconds and alerts.
**Next Steps:**
1. Rebuild Docker container
2. Verify health monitor starts
3. Manually test: open position, wait 30s, check health logs
4. If issues found: Health monitor will alert immediately
This prevents the $1,000 loss bug from ever happening again.
- Bug #73 recurrence: Position opened Dec 7 22:15 but PM never monitored
- Root cause: Container running OLD code from BEFORE Dec 7 fix (2:46 AM start < 2:46 AM commit)
- User lost 08 on unprotected SOL-PERP SHORT
- Fix: Rebuilt and restarted container with 3-layer safety system
- Status: VERIFIED deployed - all safety layers active
- Prevention: Container timestamp MUST be AFTER commit timestamp
CRITICAL FIX - Parallel Execution Now Working:
- Problem: coordinator blocked on subprocess.run(ssh_cmd) preventing worker2 deployment
- Root cause #1: subprocess.run() waits for SSH FDs even with 'nohup &' and '-f' flag
- Root cause #2: Indicator deployed to backtester/ subdirectory instead of workspace root
- Solution #1: Replace subprocess.run() with subprocess.Popen() + communicate(timeout=2)
- Solution #2: Deploy v11_moneyline_all_filters.py to workspace root for direct import
- Result: Both workers start simultaneously (worker1 chunk 0, worker2 chunk 1)
- Impact: 2× speedup achieved (15 min vs 30 min sequential)
Verification:
- Worker1: 31 processes, generating 1,125+ signals per config ✓
- Worker2: 29 processes, generating 848-898 signals per config ✓
- Coordinator: Both chunks active, parallel deployment in 12 seconds ✓
User concern addressed: 'if we are not using them in parallel how are we supposed
to gain a time advantage?' - Now using them in parallel, gaining 2× advantage.
Files modified:
- cluster/v11_test_coordinator.py (lines 287-301: Popen + timeout, lines 238-255: workspace root)
THREE critical bugs in cluster/v11_test_worker.py:
1. Missing use_quality_filters parameter when creating MoneyLineV11Inputs
- Parameter defaults to True but wasn't being passed explicitly
- Fix: Added use_quality_filters=True to inputs creation
2. Missing fixed RSI parameters (rsi_long_max, rsi_short_min)
- Worker only passed rsi_long_min and rsi_short_max (sweep params)
- Missing rsi_long_max=70 and rsi_short_min=30 (fixed params)
- Fix: Added both fixed parameters to inputs creation
3. Import path mismatch - worker imported OLD version
- Worker added cluster/ to sys.path, imported from parent directory
- Old v11_moneyline_all_filters.py (21:40) missing use_quality_filters
- Fixed v11_moneyline_all_filters.py was in backtester/ subdirectory
- Fix: Deployed corrected file to /home/comprehensive_sweep/
Result: 0 signals → 1,096-1,186 signals per config ✓
Verified: Local test (314 signals), EPYC dataset test (1,186 signals),
Worker log now shows signal variety across 27 concurrent configs.
Progressive sweep now running successfully on EPYC cluster.
- Discovery: All TradingView alerts (5min/15min/1H/4H/Daily) attached to OLD v9 version
- Impact: 11,429 records from wrong indicator settings (confirmBars=0 vs current)
- Solution: Marked as DATA_COLLECTION_OLD_V9_VERSION to prevent analysis contamination
- Exception: 1-minute data (11,398) kept as DATA_COLLECTION_ONLY (unaffected)
- Fresh data from corrected alerts will use DATA_COLLECTION_ONLY going forward
- Old data preserved for historical reference, clearly marked
CRITICAL DATA BUG DISCOVERED (Dec 5, 2025):
Previous commits a67a338 and f65aae5 implemented optimizations based on
INCORRECT analysis of maxFavorableExcursion (MFE) data.
Problem: Old Trade records stored MFE in DOLLARS, not PERCENTAGES
- Appeared to show 20%+ average favorable movement
- Actually only 0.76% (long) and 1.20% (short) average movement
- 26× inflation of perceived performance due to unit mismatch
Incorrect Changes Reverted:
- ATR_MULTIPLIER_TP1: 1.5 → back to 2.0
- ATR_MULTIPLIER_TP2: 3.0 → back to 4.0
- ATR_MULTIPLIER_SL: 2.5 → back to 3.0
- TAKE_PROFIT_1_SIZE_PERCENT: 75 → back to 60
- LEVERAGE: 5 → back to 1
- Safety bounds restored to original values
- TRAILING_STOP_ATR_MULTIPLIER: back to 2.5
REAL FINDINGS (after data correction):
- TP1 orders ARE being placed (tp1OrderTx populated)
- TP1 prices NOT being reached (only 2/11 trades in sample)
- Recent trades (6 total): avg MFE 0.74%, only 2/6 reached TP1
- Problem is ENTRY QUALITY, not exit timing
- Quality 90+ signals barely move favorably before reversing
See Common Pitfall #54 - MFE data stored in mixed units
Need to filter by createdAt >= '2025-11-23' for accurate analysis
User requirement: Manual long/short commands via Telegram shall execute
immediately without quality checks.
Changes:
- Execute endpoint now checks for timeframe='manual' flag
- Added isManualTrade bypass alongside isValidatedEntry bypass
- Manual trades skip quality threshold validation completely
- Logs show 'MANUAL TRADE BYPASS' for transparency
Impact: Telegram commands (long sol, short eth) now execute instantly
without being blocked by low quality scores.
Commit: Dec 4, 2025
- Worker2 (bd-host01) now only runs 19:00-06:00 due to noise
- Added is_worker_allowed_to_run() function for time-based control
- Worker1 continues 24/7 operation
- Reset stuck chunk 14 that was blocking progress since Dec 2
- Create moneyline_1min_price_feed.pinescript (70% smaller payload)
- Remove ATR/ADX/RSI/VOL/POS from 1-minute alerts (not used for decisions)
- Keep only price + symbol + timeframe for market data cache
- Document rationale in docs/1MIN_SIMPLIFIED_FEED.md
- Fix: 5-minute trading signals being dropped due to 1-minute flood (60/hour)
- Impact: Preserve priority for actual trading signals
Bug: Multiple monitoring loops detect ghost simultaneously
- Loop 1: has(tradeId) → true → proceeds
- Loop 2: has(tradeId) → true → ALSO proceeds (race condition)
- Both send Telegram notifications with compounding P&L
Real incident (Dec 2, 2025):
- Manual SHORT at $138.84
- 23 duplicate notifications
- P&L compounded: -$47.96 → -$1,129.24 (23× accumulation)
- Database shows single trade with final compounded value
Fix: Map.delete() returns true if key existed, false if already removed
- Call delete() FIRST
- Check return value
proceeds
- All other loops get false → skip immediately
- Atomic operation prevents race condition
Pattern: This is variant of Common Pitfalls #48, #49, #59, #60, #61
- All had "check then delete" pattern
- All vulnerable to async timing issues
- Solution: "delete then check" pattern
- Map.delete() is synchronous and atomic
Files changed:
- lib/trading/position-manager.ts lines 390-410
Related: DUPLICATE PREVENTED message was working but too late
- Bug: Validation queue used TradingView symbol format (SOLUSDT) to lookup market data cache
- Cache uses normalized Drift format (SOL-PERP)
- Result: Cache lookup failed, wrong/stale price shown in Telegram abandonment notifications
- Real incident: Signal at $126.00 showed $98.18 abandonment price (-22.08% impossible drop)
- Fix: Added normalizeTradingViewSymbol() call in check-risk endpoint before passing to validation queue
- Files changed: app/api/trading/check-risk/route.ts (import + symbol normalization)
- Impact: Validation queue now correctly retrieves current price from market data cache
- Deployed: Dec 1, 2025
- CRITICAL FIX: Python output buffering caused silent failure
- Solution: python3 -u flag for unbuffered output
- 70% CPU optimization: int(cpu_count() * 0.7) = 22-24 cores per server
- Current state: 47 workers, load ~22 per server, 16.3 hour timeline
- System operational since Dec 1 22:50:32
- Expected completion: Dec 2 15:15
- ProxyJump (-J) doesn't work from Docker container
- Changed to nested SSH: hop -> target
- Proper command escaping for nested SSH
- Worker2 (srv-bd-host01) only accessible via worker1 (pve-nu-monitor01)
Root cause: Passing dict {'min_adx': 15, 'min_volume_ratio': vol_min} when
simulate_money_line() expects callable function.
Bug caused ALL 2,096 backtests to fail with 'dict' object is not callable.
Fix: Changed to lambda function matching comprehensive_sweep.py pattern:
quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min
Verified fix working: Workers running at 100% CPU, no errors after 2+ minutes.
**Comprehensive documentation including:**
- Root cause analysis for both bugs
- Manual test procedures that validated fixes
- Code changes with before/after comparisons
- Verification results (24 worker processes running)
- Lessons learned for future debugging
- Current cluster state and next steps
Files: cluster/SSH_TIMEOUT_FIX_COMPLETE.md (288 lines)
- Created /api/cluster/logs endpoint to read coordinator.log
- Added real-time log display in cluster UI (updates every 3s)
- Shows last 100 lines of coordinator.log in terminal-style display
- Includes manual refresh button
- Improves debugging experience - no need to SSH for logs
User feedback: 'why dont we add the output of the log at the bottom of the page so i know whats going on'
This addresses poor visibility into coordinator errors and failures.
Next step: Fix SSH timeout issue blocking worker execution.
- Split QUALITY_LEVERAGE_THRESHOLD into separate LONG and SHORT variants
- Added /api/drift/account-health endpoint for real-time collateral data
- Updated settings UI to show separate controls for LONG/SHORT thresholds
- Position size calculations now use dynamic collateral from Drift account
- Updated .env and docker-compose.yml with new environment variables
- LONG threshold: 95, SHORT threshold: 90 (configurable independently)
Files changed:
- app/api/drift/account-health/route.ts (NEW) - Account health API endpoint
- app/settings/page.tsx - Added collateral state, separate threshold inputs
- app/api/settings/route.ts - GET/POST handlers for LONG/SHORT thresholds
- .env - Added QUALITY_LEVERAGE_THRESHOLD_LONG/SHORT variables
- docker-compose.yml - Added new env vars with fallback defaults
Impact:
- Users can now configure quality thresholds independently for LONG vs SHORT signals
- Position size display dynamically updates based on actual Drift account collateral
- More flexible risk management with direction-specific leverage tiers
- Changed default chunk_size from 10,000 to 2,000
- Fixes bug where coordinator exited immediately for 4,096 combo exploration
- Coordinator was calculating: chunk 1 starts at 10,000 > 4,096 total = 'all done'
- Now creates 2-3 appropriately-sized chunks for distribution
- Verified: Workers now start and process assigned chunks
- Status: ✅ Docker rebuilt and deployed to port 3001