Files
trading_bot_v4/docs/COMMON_PITFALLS.md
mindesbunister b11da009eb critical: Bug #89 - Detect and handle Drift fractional position remnants (3-part fix)
- Part 1: Position Manager fractional remnant detection after close attempts
  * Check if position < 1.5× minOrderSize after close transaction
  * Log to persistent logger with FRACTIONAL_REMNANT_DETECTED
  * Track closeAttempts, limit to 3 maximum
  * Mark exitReason='FRACTIONAL_REMNANT' in database
  * Remove from monitoring after 3 failed attempts

- Part 2: Pre-close validation in closePosition()
  * Check if position viable before attempting close
  * Reject positions < 1.5× minOrderSize with specific error
  * Prevent wasted transaction attempts on too-small positions
  * Return POSITION_TOO_SMALL_TO_CLOSE error with manual instructions

- Part 3: Health monitor detection for fractional remnants
  * Query Trade table for FRACTIONAL_REMNANT exits in last 24h
  * Alert operators with position details and manual cleanup instructions
  * Provide trade IDs, symbols, and Drift UI link

- Database schema: Added closeAttempts Int? field to Track attempts

Root cause: Drift protocol exchange constraints can leave fractional positions
Evidence: 3 close transactions confirmed but 0.15 SOL remnant persisted
Financial impact: ,000+ risk from unprotected fractional positions
Status: Fix implemented, awaiting deployment verification

See: docs/COMMON_PITFALLS.md Bug #89 for complete incident details
2025-12-16 22:05:12 +01:00

1677 lines
63 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Common Pitfalls Reference Documentation
> **Last Updated:** December 4, 2025
> **Total Documented:** 72 Pitfalls
> **Primary Source:** `.github/copilot-instructions.md`
## Purpose
This document is the **comprehensive reference** for all documented pitfalls, bugs, and lessons learned from the Trading Bot v4 project. Each entry represents a real incident that caused financial loss, system instability, or operational issues.
**How to Use This Document:**
1. **Before making changes:** Search for related pitfalls to avoid repeating mistakes
2. **When debugging:** Look for symptoms matching your issue
3. **After fixing bugs:** Add new entries to preserve institutional knowledge
4. **Code review:** Verify changes don't reintroduce known issues
**Severity Levels:**
- 🔴 **CRITICAL** - Financial loss, data corruption, or system failure
- ⚠️ **HIGH** - System stability or significant operational impact
- 🟡 **MEDIUM** - Performance degradation or UX issues
- 🔵 **LOW** - Code quality or minor improvements
---
## Quick Reference Table
| # | Severity | Category | Date | Summary |
|---|----------|----------|------|---------|
| 1 | 🔴 CRITICAL | SDK/Memory | Nov 15, 2025 | Drift SDK memory leak - heap OOM after 10+ hours |
| 2 | 🔴 CRITICAL | RPC/Infrastructure | Nov 14, 2025 | Wrong RPC provider (Alchemy) breaks Drift SDK |
| 3 | 🟡 MEDIUM | Build/Docker | - | Prisma not generated in Docker |
| 4 | 🟡 MEDIUM | Configuration | - | Wrong DATABASE_URL for container vs host |
| 5 | 🟡 MEDIUM | Data/Symbols | - | Symbol format mismatch (TradingView → Drift) |
| 6 | ⚠️ HIGH | Orders | - | Missing reduce-only flag on exit orders |
| 7 | 🟡 MEDIUM | Architecture | - | Singleton violations (DriftClient, Position Manager) |
| 8 | 🟡 MEDIUM | Types/Prisma | - | Type errors with Prisma after generate |
| 9 | 🟡 MEDIUM | Code Quality | - | Quality score duplication in check-risk and execute |
| 10 | ⚠️ HIGH | Configuration | - | TP2-as-Runner configuration confusion |
| 11 | 🔴 CRITICAL | P&L Calculation | - | P&L calculation using SDK values incorrectly |
| 12 | 🔴 CRITICAL | Transactions | - | Transaction confirmation missing (phantom trades) |
| 13 | ⚠️ HIGH | Execution Order | - | Execution order matters (Position Manager before DB) |
| 14 | ⚠️ HIGH | Timing | - | New trade grace period (30s for Drift propagation) |
| 15 | 🟡 MEDIUM | SDK/Drift | - | Drift minimum position sizes differ from docs |
| 16 | 🔴 CRITICAL | Exit Logic | - | Exit reason detection bug (using current price) |
| 17 | 🟡 MEDIUM | Cooldown | - | Per-symbol cooldown, not global |
| 18 | ⚠️ HIGH | Quality Scoring | - | Timeframe-aware scoring crucial for 5min |
| 19 | 🔴 CRITICAL | Trading Logic | - | Price position chasing causes flip-flops |
| 20 | 🟡 MEDIUM | TradingView | - | TradingView ADX minimum for 5min charts |
| 21 | 🟡 MEDIUM | Types/Prisma | - | Prisma Decimal type handling in raw SQL |
| 22 | 🔴 CRITICAL | Trailing Stop | Nov 11, 2025 | ATR-based trailing stop implementation bug |
| 23 | 🟡 MEDIUM | Database Schema | - | CreateTradeParams interface sync required |
| 24 | 🔴 CRITICAL | SDK/Units | Nov 12, 2025 | Position.size returns tokens not USD |
| 25 | 🟡 MEDIUM | Display | Nov 12, 2025 | Leverage display showing global instead of symbol-specific |
| 26 | 🟡 MEDIUM | Tracking | Nov 12, 2025 | Indicator version tracking (v5→v6→v7→v8) |
| 27 | 🔴 CRITICAL | Race Condition | Nov 15, 2025 | Runner stop loss gap - no protection between TP1 and TP2 |
| 28 | 🔴 CRITICAL | Race Condition | Nov 12, 2025 | External closure duplicate updates bug |
| 29 | 🔴 CRITICAL | Database | Nov 13, 2025 | Database-First Pattern required |
| 30 | ⚠️ HIGH | Network | Nov 13, 2025 | DNS retry logic needed |
| 31 | 🔴 CRITICAL | Deployment | Nov 13, 2025 | Declaring fixes "working" before deployment |
| 32 | 🔴 CRITICAL | Workflow | Nov 14, 2025 | Phantom trade notification workflow breaks |
| 33 | 🔴 CRITICAL | Data Integrity | Nov 15, 2025 | Wrong entry price after orphaned position restoration |
| 34 | 🔴 CRITICAL | Monitoring | Nov 15, 2025 | Runner stop loss gap (duplicate of #27) |
| 35 | 🔴 CRITICAL | Database | Nov 15, 2025 | Phantom trades need exitReason for cleanup |
| 36 | 🔴 CRITICAL | Rate Limits | Nov 15, 2025 | closePosition() missing retry logic causes rate limit storm |
| 37 | 🔴 CRITICAL | Ghost Positions | Nov 15, 2025 | Ghost position accumulation from failed DB updates |
| 38 | 🟡 MEDIUM | Display | Nov 15, 2025 | Analytics dashboard showing original position size |
| 39 | 🔴 CRITICAL | Permissions | Nov 15, 2025 | Settings UI permission error (.env not writable) |
| 40 | 🔴 CRITICAL | Ghost Positions | Nov 15-16, 2025 | Ghost position death spiral from skipped validation |
| 41 | 🔴 CRITICAL | P&L Calculation | Nov 19, 2025 | Stats API recalculating P&L incorrectly for TP1+runner |
| 42 | 🟡 MEDIUM | Notifications | Nov 16, 2025 | Missing Telegram notifications for position closures |
| 43 | 🔴 CRITICAL | Trailing Stop | Nov 20, 2025 | Runner trailing stop never activates after TP1 |
| 44 | ⚠️ HIGH | DNS | Nov 16, 2025 | Telegram bot DNS resolution failures |
| 45 | 🔴 CRITICAL | SDK/Drift | Nov 16, 2025 | Drift SDK position.entryPrice recalculates after partial closes |
| 46 | 🔴 CRITICAL | Leverage | Nov 16, 2025 | Drift account leverage must be set in UI, not API |
| 47 | 🔴 CRITICAL | Verification | Nov 16, 2025 | Position close verification gap - 6 hours unmonitored |
| 48 | 🔴 CRITICAL | P&L Compounding | Nov 16, 2025 | P&L compounding during close verification |
| 49 | 🔴 CRITICAL | P&L Compounding | Nov 17, 2025 | P&L exponential compounding in external closure detection |
| 50 | 🔴 CRITICAL | Database | Nov 19, 2025 | Database not tracking trades despite successful Drift executions |
| 51 | 🔴 CRITICAL | Detection | Nov 19, 2025 | TP1 detection fails when on-chain orders fill fast |
| 52 | 🔴 CRITICAL | Exit Logic | Nov 19, 2025 | ADX-based runner SL only applied in one code path |
| 53 | 🔴 CRITICAL | Container | Nov 19, 2025 | Container restart kills positions + phantom detection bug |
| 54 | 🔴 CRITICAL | Data Integrity | Nov 23, 2025 | MFE/MAE storing dollars instead of percentages |
| 55 | 🔴 CRITICAL | Configuration | Nov 19-20, 2025 | Settings UI quality score variable name mismatch / BlockedSignalTracker using wrong price source |
| 56 | 🔴 CRITICAL | Ghost Orders | Nov 20-21, 2025 | Ghost orders after external closures + false order count bug |
| 57 | 🔴 CRITICAL | P&L Calculation | Nov 20, 2025 | P&L calculation inaccuracy for external closures |
| 58 | ⚠️ HIGH | Database | Nov 21, 2025 | 5-Layer Database Protection System implemented |
| 59 | 🔴 CRITICAL | Duplicates | Nov 22, 2025 | Layer 2 ghost detection causing duplicate Telegram notifications |
| 60 | 🔴 CRITICAL | Race Condition | Nov 23, 2025 | Stale array snapshot in monitoring loop causes duplicate processing |
| 61 | 🔴 CRITICAL | P&L Compounding | Nov 24, 2025 | P&L compounding STILL happening despite all guards |
| 62 | 🔴 CRITICAL | Quality Check | Nov 24-27, 2025 | Adaptive leverage not working / Execute endpoint bypassing quality threshold |
| 63 | ⚠️ HIGH | Feature | Nov 30, 2025 | Smart Entry Validation System - Block & Watch deployed |
| 64 | 🔴 CRITICAL | Cluster | Dec 1, 2025 | EPYC Cluster SSH Timeout - nested hop requires longer timeouts |
| 65 | 🔴 CRITICAL | Cluster | Dec 1, 2025 | Distributed Worker Quality Filter - dict vs callable |
| 66 | 🔴 CRITICAL | Smart Entry | Dec 1, 2025 | Smart Entry Validation Queue wrong price display |
| 67 | 🔴 CRITICAL | Race Condition | Dec 2, 2025 | Ghost detection race condition causing duplicate notifications with P&L compounding |
| 68 | 🔴 CRITICAL | Smart Entry | Dec 3, 2025 | Smart Entry using webhook percentage as signal price |
| 69 | 🟡 MEDIUM | Configuration | Dec 3, 2025 | Direction-specific leverage thresholds not explicit in code |
| 70 | 🔴 CRITICAL | Smart Entry | Dec 3, 2025 | Smart Validation Queue rejected by execute endpoint |
| 71 | 🔴 CRITICAL | Revenge System | Dec 3, 2025 | Revenge system missing external closure integration |
| 72 | 🔴 CRITICAL | Telegram | Dec 4, 2025 | Telegram webhook conflicts with polling bot |
| 89 | 🔴 CRITICAL | Drift Protocol | Dec 16, 2025 | Drift fractional position remnants after SL execution |
---
## Category Index
### 🔴 P&L Calculation Errors
- [#11](#pitfall-11-pl-calculation-critical) - P&L calculation using SDK values incorrectly
- [#41](#pitfall-41-stats-api-recalculating-pl-incorrectly-critical---fixed-nov-19-2025) - Stats API recalculating P&L incorrectly
- [#48](#pitfall-48-pl-compounding-during-close-verification-critical---fixed-nov-16-2025) - P&L compounding during close verification
- [#49](#pitfall-49-pl-exponential-compounding-in-external-closure-detection-critical---fixed-nov-17-2025) - P&L exponential compounding
- [#54](#pitfall-54-mfemae-storing-dollars-instead-of-percentages-critical---fixed-nov-23-2025) - MFE/MAE storing dollars instead of percentages
- [#57](#pitfall-57-pl-calculation-inaccuracy-for-external-closures-critical---fixed-nov-20-2025) - P&L calculation inaccuracy for external closures
- [#61](#pitfall-61-pl-compounding-still-happening-despite-all-guards-critical---under-investigation-nov-24-2025) - P&L compounding STILL happening
### 🔴 Race Conditions & Duplicates
- [#27](#pitfall-27-runner-stop-loss-gap---no-protection-between-tp1-and-tp2-critical---fixed-nov-15-2025) - Runner stop loss gap - no protection between TP1 and TP2
- [#28](#pitfall-28-external-closure-duplicate-updates-bug-critical---fixed-nov-12-2025) - External closure duplicate updates
- [#59](#pitfall-59-layer-2-ghost-detection-causing-duplicate-telegram-notifications-critical---fixed-nov-22-2025) - Layer 2 ghost detection duplicates
- [#60](#pitfall-60-stale-array-snapshot-in-monitoring-loop-critical---fixed-nov-23-2025) - Stale array snapshot duplicates
- [#67](#pitfall-67-ghost-detection-race-condition-critical---fixed-dec-2-2025) - Ghost detection race condition
### 🔴 SDK/API Integration
- [#1](#pitfall-1-drift-sdk-memory-leak-critical---fixed-nov-15-2025) - Drift SDK memory leak
- [#2](#pitfall-2-wrong-rpc-provider-critical---investigation-complete-nov-14-2025) - Wrong RPC provider (Alchemy)
- [#12](#pitfall-12-transaction-confirmation-critical) - Transaction confirmation missing
- [#24](#pitfall-24-positionsize-tokens-vs-usd-bug-critical---fixed-nov-12-2025) - Position.size tokens vs USD
- [#36](#pitfall-36-closeposition-missing-retry-logic-critical---fixed-nov-15-2025) - closePosition() missing retry logic
- [#45](#pitfall-45-drift-sdk-positionentryprice-recalculates-critical---fixed-nov-16-2025) - position.entryPrice recalculates after partial closes
### 🔴 Database Operations
- [#29](#pitfall-29-database-first-pattern-critical---fixed-nov-13-2025) - Database-First Pattern required
- [#35](#pitfall-35-phantom-trades-need-exitreason-critical---fixed-nov-15-2025) - Phantom trades need exitReason
- [#37](#pitfall-37-ghost-position-accumulation-critical---fixed-nov-15-2025) - Ghost position accumulation
- [#50](#pitfall-50-database-not-tracking-trades-resolved---nov-19-2025) - Database not tracking trades
- [#58](#pitfall-58-5-layer-database-protection-system-implemented---nov-21-2025) - 5-Layer Database Protection System
### 🔴 Configuration & Settings
- [#55](#pitfall-55-configuration-issues-critical---fixed-nov-19-20-2025) - Settings UI quality score variable name mismatch
- [#62](#pitfall-62-adaptive-leverage-and-quality-bypass-critical---fixed-nov-24-27-2025) - Adaptive leverage / Execute endpoint bypassing quality threshold
### 🔴 Deployment & Verification
- [#31](#pitfall-31-declaring-fixes-working-before-deployment-critical---nov-13-2025) - Declaring fixes "working" before deployment
- [#47](#pitfall-47-position-close-verification-gap-critical---fixed-nov-16-2025) - Position close verification gap - 6 hours unmonitored
### 🔴 Smart Entry & Validation
- [#63](#pitfall-63-smart-entry-validation-system-deployed---nov-30-2025) - Smart Entry Validation System
- [#66](#pitfall-66-smart-entry-wrong-price-display-critical---fixed-dec-1-2025) - Smart Entry wrong price display
- [#68](#pitfall-68-smart-entry-using-webhook-percentage-critical---fixed-dec-3-2025) - Smart Entry using webhook percentage
- [#70](#pitfall-70-smart-validation-queue-rejected-critical---fixed-dec-3-2025) - Smart Validation Queue rejected by execute
### ⚠️ Ghost Positions & Orders
- [#40](#pitfall-40-ghost-position-death-spiral-critical---fixed-nov-15-16-2025) - Ghost position death spiral
- [#56](#pitfall-56-ghost-orders-after-external-closures-critical---fixed-nov-20-21-2025) - Ghost orders after external closures
### ⚠️ Network & Infrastructure
- [#30](#pitfall-30-dns-retry-logic-high---nov-13-2025) - DNS retry logic
- [#44](#pitfall-44-telegram-bot-dns-resolution-high---fixed-nov-16-2025) - Telegram bot DNS resolution
- [#64](#pitfall-64-epyc-cluster-ssh-timeout-critical---fixed-dec-1-2025) - EPYC Cluster SSH timeout
- [#65](#pitfall-65-distributed-worker-quality-filter-critical---fixed-dec-1-2025) - Distributed Worker dict vs callable
### ⚠️ Trailing Stop & Exit Logic
- [#22](#pitfall-22-atr-based-trailing-stop-implementation-critical---nov-11-2025) - ATR-based trailing stop implementation
- [#43](#pitfall-43-runner-trailing-stop-never-activates-critical---fixed-nov-20-2025) - Runner trailing stop never activates
- [#51](#pitfall-51-tp1-detection-fails-critical---fixed-nov-19-2025) - TP1 detection fails on-chain
- [#52](#pitfall-52-adx-based-runner-sl-critical---fixed-nov-19-2025) - ADX-based runner SL one code path
---
## Detailed Pitfall Entries
### Pitfall #1: Drift SDK Memory Leak (🔴 CRITICAL - Fixed Nov 15, 2025, Enhanced Nov 24, 2025)
**Symptom:** JavaScript heap out of memory after 10+ hours runtime, Telegram bot timeouts (60s)
**Root Cause:** Drift SDK accumulates WebSocket subscriptions over time without cleanup
**Real Incident:**
- Thousands of `accountUnsubscribe error: readyState was 2 (CLOSING)` in logs
- Heap growth: Normal ~200MB → 4GB+ after 10 hours → OOM crash
**Impact:** System crashes after extended uptime, requires manual container restart
**Fix Applied:**
- **File:** `lib/monitoring/drift-health-monitor.ts`
- **Implementation:** Smart error-based health monitoring replaces blind timer
- `interceptWebSocketErrors()` patches console.error to catch SDK WebSocket errors
- 30-second sliding window: Only restarts if 50+ errors in 30 seconds
- Container restart via flag: Writes `/tmp/trading-bot-restart.flag` for watch-restart.sh
- **API:** `GET /api/drift/health` - Check error count and health status
- **Commit:** Enhanced Nov 24, 2025
**Code Reference:**
```typescript
// lib/monitoring/drift-health-monitor.ts
interceptWebSocketErrors() // Patches console.error
if (errorsInWindow > 50) {
writeRestartFlag() // Triggers container restart
}
```
**Prevention:** Monitor for `🏥 Drift health monitor started` and error threshold logs
**Lesson Learned:** Smart, reactive monitoring is better than blind timers. Only restart when actual problems occur, not on a schedule.
---
### Pitfall #2: Wrong RPC Provider (🔴 CRITICAL - Investigation Complete Nov 14, 2025)
**Symptom:** Trades fail, duplicate closes, Position Manager loses tracking, database save failures
**Root Cause:** Alchemy's rate limiting breaks Drift SDK's burst subscription pattern during initialization
**Real Incident (Nov 14, 21:14 CET):**
- Created diagnostic endpoint `/api/testing/drift-init`
- Alchemy: 17-71 subscription errors EVERY init (49 avg over 5 runs), 1644ms avg init time
- Helius: 0 subscription errors EVERY init, 800ms avg init time
**Impact:** Complete system failure when using wrong RPC provider
**Why Alchemy Fails:**
- Drift SDK subscribes to 30-50+ accounts simultaneously during init (burst pattern)
- Alchemy's CUPS enforcement rate limits these burst requests
- Drift SDK does NOT retry failed subscriptions
- SDK reports "initialized successfully" but with incomplete subscription set
- Error: `"Received JSON-RPC error calling accountSubscribe"`
**Fix Applied:**
- **Use Helius RPC** (https://mainnet.helius-rpc.com/?api-key=...)
- Retry logic: 5s exponential backoff for rate limits
- **Documentation:** `docs/ALCHEMY_RPC_INVESTIGATION_RESULTS.md`
**Code Reference:**
```bash
# Test yourself
curl 'http://localhost:3001/api/testing/drift-init?rpc=alchemy'
```
**Prevention:** ALWAYS use Helius RPC. Do not use Alchemy for Drift SDK.
**Lesson Learned:** Documentation doesn't always reflect reality. Test with real infrastructure before trusting provider claims.
---
### Pitfall #3: Prisma Not Generated in Docker (🟡 MEDIUM)
**Symptom:** Build fails with Prisma client errors
**Root Cause:** Must run `npx prisma generate` in Dockerfile BEFORE `npm run build`
**Fix Applied:** Add `RUN npx prisma generate` before build step in Dockerfile
---
### Pitfall #4: Wrong DATABASE_URL (🟡 MEDIUM)
**Symptom:** Database connection failures
**Root Cause:** Container runtime needs `trading-bot-postgres` (container name), Prisma CLI from host needs `localhost:5432`
**Fix Applied:** Use correct hostname based on context:
- Container: `postgresql://postgres:password@trading-bot-postgres:5432/trading_bot_v4`
- Host CLI: `postgresql://postgres:password@localhost:5432/trading_bot_v4`
---
### Pitfall #5: Symbol Format Mismatch (🟡 MEDIUM)
**Symptom:** Drift API rejects orders, symbol not found errors
**Root Cause:** TradingView sends "SOLUSDT" but Drift requires "SOL-PERP"
**Fix Applied:** Always normalize with `normalizeTradingViewSymbol()` before calling Drift
- **File:** `config/trading.ts`
- Applies to ALL endpoints including `/api/trading/close`
---
### Pitfall #6: Missing Reduce-Only Flag (⚠️ HIGH)
**Symptom:** Exit orders accidentally open new positions instead of closing
**Root Cause:** Exit orders without `reduceOnly: true` can open new positions
**Fix Applied:** All TP/SL orders MUST include `reduceOnly: true`
```typescript
const orderParams = {
reduceOnly: true, // CRITICAL for TP/SL orders
// ... other params
}
```
---
### Pitfall #7: Singleton Violations (🟡 MEDIUM)
**Symptom:** Connection issues, state inconsistencies, multiple WebSocket connections
**Root Cause:** Creating multiple DriftClient or Position Manager instances
**Fix Applied:** Always use getter functions:
```typescript
const driftService = await initializeDriftService() // NOT: new DriftService()
const positionManager = getPositionManager() // NOT: new PositionManager()
const prisma = getPrismaClient() // NOT: new PrismaClient()
```
---
### Pitfall #8: Prisma Type Errors (🟡 MEDIUM)
**Symptom:** TypeScript compilation fails with Prisma types
**Root Cause:** Trade type from Prisma only available AFTER `npx prisma generate`
**Fix Applied:** Run `npx prisma generate` after any schema changes
---
### Pitfall #9: Quality Score Duplication (🟡 MEDIUM)
**Symptom:** Inconsistent quality scoring between endpoints
**Root Cause:** Signal quality calculation exists in BOTH `check-risk` and `execute` endpoints
**Fix Applied:** Keep logic synchronized between both endpoints when making changes
---
### Pitfall #10: TP2-as-Runner Configuration (⚠️ HIGH)
**Symptom:** Confusion about runner size and TP2 behavior
**Root Cause:** `takeProfit2SizePercent: 0` means "TP2 activates trailing stop, no position close"
**Fix Applied:**
- `TAKE_PROFIT_2_PERCENT=0.7` sets TP2 trigger price
- `TAKE_PROFIT_2_SIZE_PERCENT` should be 0 for runner system
- Runner = 100% - TAKE_PROFIT_1_SIZE_PERCENT (default 40%)
---
### Pitfall #11: P&L Calculation Critical (🔴 CRITICAL)
**Symptom:** Incorrect P&L values in database and analytics
**Root Cause:** Using SDK values instead of actual entry vs exit price calculation
**Fix Applied:**
```typescript
const profitPercent = this.calculateProfitPercent(trade.entryPrice, exitPrice, trade.direction)
const actualRealizedPnL = (closedSizeUSD * profitPercent) / 100
trade.realizedPnL += actualRealizedPnL // NOT: result.realizedPnL from SDK
```
---
### Pitfall #12: Transaction Confirmation Critical (🔴 CRITICAL)
**Symptom:** "Phantom trades" - SDK returns signatures for transactions that never execute
**Root Cause:** Both `openPosition()` AND `closePosition()` must call `connection.confirmTransaction()`
**Fix Applied:**
```typescript
const txSig = await driftClient.placePerpOrder(orderParams)
console.log('⏳ Confirming transaction on-chain...')
const connection = driftService.getConnection()
const confirmation = await connection.confirmTransaction(txSig, 'confirmed')
if (confirmation.value.err) {
throw new Error(`Transaction failed: ${JSON.stringify(confirmation.value.err)}`)
}
console.log('✅ Transaction confirmed on-chain')
```
---
### Pitfall #13: Execution Order Matters (⚠️ HIGH)
**Symptom:** Race conditions where monitoring starts before trade exists in database
**Root Cause:** Position Manager added before database save
**Fix Applied:** Order MUST be:
1. Open position + place exit orders
2. Save to database (`createTrade()`)
3. Add to Position Manager (`positionManager.addTrade()`)
---
### Pitfall #14: New Trade Grace Period (⚠️ HIGH)
**Symptom:** New positions immediately detected as "closed externally" and cancelled
**Root Cause:** Drift positions take 5-10 seconds to propagate after opening
**Fix Applied:** Position Manager skips "external closure" detection for trades <30 seconds old
---
### Pitfall #15: Drift Minimum Position Sizes (🟡 MEDIUM)
**Symptom:** Orders rejected for being too small
**Root Cause:** Actual minimums differ from documentation:
- SOL-PERP: 0.1 SOL (~$5-15)
- ETH-PERP: 0.01 ETH (~$38-40)
- BTC-PERP: 0.0001 BTC (~$10-12)
**Fix Applied:** Calculate `minOrderSize × currentPrice` must exceed Drift's $4 minimum. Add buffer.
---
### Pitfall #16: Exit Reason Detection Bug (🔴 CRITICAL)
**Symptom:** Profitable trades mislabeled as "SL" exits
**Root Cause:** Position Manager using current price to determine exit reason, but on-chain orders filled at different price
**Fix Applied:** Use `trade.tp1Hit` / `trade.tp2Hit` flags and realized P&L to correctly identify exit trigger
---
### Pitfall #17: Per-Symbol Cooldown (🟡 MEDIUM)
**Symptom:** ETH trade incorrectly blocking SOL trade
**Root Cause:** Cooldown was global, not per-symbol
**Fix Applied:** Each coin (SOL/ETH/BTC) has independent cooldown timer via `getLastTradeTimeForSymbol(symbol)`
---
### Pitfall #18: Timeframe-Aware Scoring Crucial (⚠️ HIGH)
**Symptom:** Valid 5min breakouts blocked as "low quality"
**Root Cause:** Signal quality thresholds not adjusted for 5min vs higher timeframes
- 5min: ADX 12-22 healthy, ATR 0.2-0.7%
- Daily: ADX 18-30 healthy, ATR 0.4%+
**Fix Applied:** Always pass `timeframe` parameter from TradingView alerts to `scoreSignalQuality()`
---
### Pitfall #19: Price Position Chasing (🔴 CRITICAL)
**Symptom:** Rapid flip-flop losses
**Root Cause:** Opening longs at 90%+ range or shorts at <10% range
**Real Incident:** Overnight flip-flop losses all had price position 9-94%
**Fix Applied:** Quality scoring now penalizes -15 to -30 points for range extremes
---
### Pitfall #20: TradingView ADX Minimum (🟡 MEDIUM)
**Symptom:** Too many signals blocked or too many low-quality signals passing
**Root Cause:** TradingView ADX filter should be 15 for 5min (not 20+)
**Fix Applied:** Set ADX ≥15 in TradingView alerts for 5min charts. Bot's quality scoring provides second-layer filtering.
---
### Pitfall #21: Prisma Decimal Type Handling (🟡 MEDIUM)
**Symptom:** Frontend errors with `.toFixed()` on undefined
**Root Cause:** Raw SQL queries return Prisma `Decimal` objects, not plain numbers
**Fix Applied:**
```typescript
// Use `any` type for numeric fields in $queryRaw results
const stat: { total_pnl: any } = await prisma.$queryRaw`...`
// Convert with Number() before returning to frontend
totalPnL: Number(stat.total_pnl) || 0
```
---
### Pitfall #22: ATR-Based Trailing Stop Implementation (🔴 CRITICAL - Nov 11, 2025)
**Symptom:** Trades with +7-9% MFE exited for losses
**Root Cause:** Runner system was using FIXED 0.3% trailing instead of ATR-based
**Real Incident:** At $168 SOL, 0.3% = $0.50 wiggle room - too tight
**Fix Applied:**
```typescript
trailingDistancePercent = (atrAtEntry / currentPrice * 100) × trailingStopAtrMultiplier
```
**Configuration:**
- `TRAILING_STOP_ATR_MULTIPLIER=1.5`
- `MIN=0.25%`, `MAX=0.9%`
- `ACTIVATION=0.5%`
**Result:** 0.45% ATR × 1.5 = 0.675% trail ($1.13 vs $0.50 = 2.26x more room)
**Documentation:** `ATR_TRAILING_STOP_FIX.md`
---
### Pitfall #23: CreateTradeParams Interface Sync (🟡 MEDIUM)
**Symptom:** TypeScript build fails when endpoint passes field not in interface
**Root Cause:** New database fields added to Trade model but not to `CreateTradeParams` interface
**Fix Applied:** When adding new fields:
1. Add to interface in `lib/database/trades.ts`
2. Add to Prisma create data object in `createTrade()` function
---
### Pitfall #24: Position.size Tokens vs USD Bug (🔴 CRITICAL - Fixed Nov 12, 2025)
**Symptom:** Position Manager detects false TP1 hits, moves SL to breakeven prematurely
**Root Cause:** `lib/drift/client.ts` returns `position.size` as BASE ASSET TOKENS (12.28 SOL), not USD ($1,950)
**Real Incident:** Comparing tokens (12.28) directly to USD ($1,950) → "99.4% reduction" → FALSE TP1!
**Fix Applied:**
```typescript
// In Position Manager (lines 322, 519, 558, 591)
const positionSizeUSD = Math.abs(position.size) * currentPrice
// Now compare USD to USD
if (positionSizeUSD < trade.currentSize * 0.95) {
// Actual 5%+ reduction detected
}
```
**Impact:** Without this fix, TP1 never triggers correctly, SL moves at wrong times, runner system fails
---
### Pitfall #25: Leverage Display Bug (🟡 MEDIUM - Fixed Nov 12, 2025)
**Symptom:** Telegram notifications showing "⚡ Leverage: 10x" when actual position uses 15x
**Root Cause:** API response returning `config.leverage` (global default) instead of symbol-specific value
**Fix Applied:**
```typescript
const { size, leverage, enabled } = getPositionSizeForSymbol(driftSymbol, config)
// Return symbol-specific leverage
leverage: leverage, // NOT: config.leverage
```
---
### Pitfall #26: Indicator Version Tracking (🟡 MEDIUM - Nov 12, 2025+)
**Symptom:** Unable to compare performance between TradingView strategies
**Root Cause:** No tracking of which indicator generated the signal
**Fix Applied:** Database field `indicatorVersion` tracks:
- v5: Buy/Sell Signal (pre-Nov 12)
- v6: HalfTrend + BarColor (Nov 12-18)
- v7: v6 with toggles (deprecated)
- v8: Money Line Sticky Trend (Nov 18+)
- v9: Money Line with Momentum Filter (Nov 26+)
---
### Pitfall #27: Runner Stop Loss Gap - No Protection Between TP1 and TP2 (🔴 CRITICAL - Fixed Nov 15, 2025)
**Symptom:** Runner position remained open despite price moving far past stop loss level
**Root Cause:** Position Manager only checked stop loss BEFORE TP1 (line 877), creating a protection gap
**Real Incident:**
1. SHORT opened, TP1 hit at 70% close (runner = 30% remaining)
2. Runner had stop loss at profit-lock level (+0.5%)
3. Price moved past stop loss → NO CHECK RAN (tp1Hit = true, so SL check skipped)
4. Runner exposed to unlimited loss for hours during TP1→TP2 window
**Fix Applied:**
```typescript
// Added explicit runner stop loss check at line ~881:
if (trade.tp1Hit && !trade.tp2Hit && this.shouldStopLoss(currentPrice, trade)) {
console.log(`🔴 RUNNER STOP LOSS: ${trade.symbol}`)
await this.executeExit(trade, 100, 'SL', currentPrice)
return
}
```
**Lesson Learned:** Every conditional branch in risk management MUST have explicit stop loss checks - never assume "it'll get caught somewhere"
---
### Pitfall #28: External Closure Duplicate Updates Bug (<28><> CRITICAL - Fixed Nov 12, 2025)
**Symptom:** Trades showing 7-8x larger losses than actual ($58 loss when Drift shows $7 loss)
**Root Cause:** Position Manager monitoring loop re-processes external closures multiple times before trade removed from activeTrades Map
**Real Incident:**
1. Trade closed externally at -$7.98
2. Position Manager detects closure, calculates P&L → -$7.50 in DB
3. Trade still in Map (removal async), loop runs again
4. Accumulates P&L: -$7.50 + -$7.50 = -$15.00
5. Repeats 8 times → final -$58.43
**Fix Applied:**
```typescript
// BEFORE (BROKEN):
await updateTradeExit({ ... })
await this.removeTrade(trade.id) // Too late!
// AFTER (FIXED):
this.activeTrades.delete(trade.id) // Remove FIRST
await updateTradeExit({ ... }) // Then update DB
```
**Commit:** Fixed Nov 12, 2025
---
### Pitfall #29: Database-First Pattern (🔴 CRITICAL - Fixed Nov 13, 2025)
**Symptom:** Positions opened on Drift with NO database record, NO Position Manager tracking, NO TP/SL protection
**Root Cause:** Execute endpoint saved to database AFTER adding to Position Manager, with silent error catch
**Real Incident:** Unprotected position opened, database save failed silently, Position Manager never tracked it
**Fix Applied:**
```typescript
// CRITICAL: Save to database FIRST before adding to Position Manager
try {
await createTrade({...})
} catch (dbError) {
console.error('❌ CRITICAL: Failed to save trade to database:', dbError)
return NextResponse.json({
success: false,
error: 'Database save failed - position unprotected',
message: `CLOSE POSITION MANUALLY IMMEDIATELY. Transaction: ${openResult.transactionSignature}`,
}, { status: 500 })
}
// ONLY add to Position Manager if database save succeeded
await positionManager.addTrade(activeTrade)
```
**Documentation:** `CRITICAL_INCIDENT_UNPROTECTED_POSITION.md`
---
### Pitfall #30: DNS Retry Logic (⚠️ HIGH - Nov 13, 2025)
**Symptom:** Trading bot fails with "fetch failed" errors when DNS resolution temporarily fails
**Root Cause:** `EAI_AGAIN` errors are transient DNS issues that resolve in seconds
**Fix Applied:** Automatic retry in `lib/drift/client.ts`:
```typescript
// Detects: fetch failed, EAI_AGAIN, ENOTFOUND, ETIMEDOUT
// Retries up to 3 times with 2s delay
await this.retryOperation(async () => {
// Initialize Drift SDK, subscribe, get user account
}, 3, 2000, 'Drift initialization')
```
**Documentation:** `docs/DNS_RETRY_LOGIC.md`
---
### Pitfall #31: Declaring Fixes "Working" Before Deployment (🔴 CRITICAL - Nov 13, 2025)
**Symptom:** AI says "position is protected" when container still running old code
**Root Cause:** Conflating "code committed to git" with "code running in production"
**Real Incident:** Fix committed 15:56, declared "working" at 19:42, but container started 15:06 (old code)
**Verification Required:**
```bash
# ALWAYS check before declaring fix deployed:
docker logs trading-bot-v4 | grep "Server starting" | head -1
# Compare container start time to git commit timestamp
# If container older: FIX NOT DEPLOYED
```
**Rule:** NEVER say "fixed", "working", "protected", or "deployed" without verifying container restart timestamp
---
### Pitfall #32: Phantom Trade Notification Workflow Breaks (🔴 CRITICAL - Nov 14, 2025)
**Symptom:** Phantom trade detected, position opened, but n8n workflow stops. User NOT notified.
**Root Cause:** Execute endpoint returned HTTP 500 when phantom detected, causing n8n chain to halt
**Fix Applied:** Auto-close phantom trades immediately + return HTTP 200 with warning:
```typescript
return NextResponse.json({
success: true,
warning: 'Phantom trade detected and auto-closed',
isPhantom: true,
message: '[Full notification text]',
phantomDetails: {...}
})
```
**Database tracking:** `status='phantom'`, `exitReason='manual'`
---
### Pitfall #33: Wrong Entry Price After Orphaned Position Restoration (🔴 CRITICAL - Fixed Nov 15, 2025)
**Symptom:** Position Manager tracking wrong entry price after container restart
**Root Cause:** Startup validation restored orphaned position using OLD database entry price instead of querying Drift
**Real Incident:** DB showed $141.51, Drift showed $141.31 actual entry → 0.14% SL placement error
**Fix Applied:** Query Drift SDK for actual entry price during orphaned position restoration:
```typescript
await prisma.trade.update({
data: {
entryPrice: position.entryPrice, // CRITICAL: Use Drift's actual entry price
positionSizeUSD: positionSizeUSD,
}
})
```
---
### Pitfall #35: Phantom Trades Need exitReason (🔴 CRITICAL - Fixed Nov 15, 2025)
**Symptom:** Position Manager keeps restoring phantom trade on every restart
**Root Cause:** Phantom auto-closure sets `status='phantom'` but leaves `exitReason=NULL`
**Real Incident:** Phantom trade caused 232% size mismatch, hundreds of false alerts
**Fix Applied:** MUST set exitReason when auto-closing phantoms:
```typescript
await updateTradeExit({
tradeId: trade.id,
exitPrice: currentPrice,
exitReason: 'manual', // CRITICAL: Must set exitReason for cleanup
status: 'phantom'
})
```
---
### Pitfall #36: closePosition() Missing Retry Logic (🔴 CRITICAL - Fixed Nov 15, 2025)
**Symptom:** Position Manager tries to close, gets 429 error, retries EVERY 2 SECONDS → 100+ failed attempts
**Root Cause:** `placeExitOrders()` had retry wrapper but `closePosition()` did NOT
**Real Incident:** 100+ "❌ Failed to close position: 429" + compounding P&L
**Fix Applied:** Wrapped closePosition() with retryWithBackoff():
```typescript
const txSig = await retryWithBackoff(async () => {
return await driftClient.placePerpOrder(orderParams)
}, 3, 8000) // 8s base delay, 3 max retries (8s → 16s → 32s)
```
---
### Pitfall #37: Ghost Position Accumulation (🔴 CRITICAL - Fixed Nov 15, 2025)
**Symptom:** Position Manager tracking 4+ positions when database shows only 1 open trade
**Root Cause:** Database has `exitReason IS NULL` for positions actually closed on Drift
**Real Incident:** 4+ ghosts → massive rate limiting, "vanishing orders"
**Fix Applied:** Periodic Drift position validation:
```typescript
private scheduleValidation(): void {
this.validationInterval = setInterval(async () => {
await this.validatePositions()
}, 5 * 60 * 1000)
}
```
---
### Pitfall #38: Analytics Dashboard Wrong Size (🟡 MEDIUM - Fixed Nov 15, 2025)
**Symptom:** Analytics page displays $42.54 when actual runner is $12.59 after TP1
**Root Cause:** API returns `trade.positionSizeUSD` (original) not runner size
**Fix Applied:** Check Position Manager state for open positions:
```typescript
const currentSize = configSnapshot?.positionManagerState?.currentSize
const displaySize = trade.exitReason === null && currentSize
? currentSize
: trade.positionSizeUSD
```
---
### Pitfall #40: Ghost Position Death Spiral (🔴 CRITICAL - Fixed Nov 15-16, 2025)
**Symptom:** Container crashes from cascading ghost detection failures
**Root Cause:** Position validation skipped during death spiral recovery, creating more ghosts
**Fix Applied:** Never skip validation during recovery operations
---
### Pitfall #41: Stats API Recalculating P&L Incorrectly (🔴 CRITICAL - Fixed Nov 19, 2025)
**Symptom:** Analytics showing wrong P&L for trades with TP1+runner
**Root Cause:** Stats API recalculating P&L from partial position data
**Fix Applied:** Use stored `realizedPnL` directly, don't recalculate
---
### Pitfall #43: Runner Trailing Stop Never Activates (🔴 CRITICAL - Fixed Nov 20, 2025)
**Symptom:** Runner position sits without trailing stop after TP1
**Root Cause:** Trailing stop activation logic only ran in one code path
**Fix Applied:** Ensure trailing stop activates in all TP1 detection paths
---
### Pitfall #44: Telegram Bot DNS Resolution (⚠️ HIGH - Fixed Nov 16, 2025)
**Symptom:** Telegram notifications fail intermittently
**Root Cause:** DNS resolution failures for api.telegram.org
**Fix Applied:** Retry logic for Telegram API calls
---
### Pitfall #45: Drift SDK position.entryPrice Recalculates (🔴 CRITICAL - Fixed Nov 16, 2025)
**Symptom:** Entry price changes after partial closes
**Root Cause:** Drift SDK calculates `position.entryPrice` from `quoteAssetAmount / baseAssetAmount`
**Impact:** After TP1 closes 75%, remaining 25% has "new" entry price
**Fix Applied:** Store and use original entry price from trade record, not SDK
---
### Pitfall #46: 100% Position Sizing InsufficientCollateral (🔴 CRITICAL - Fixed Nov 16, 2025)
**Symptom:** Bot gets InsufficientCollateral errors when Drift UI can open same size
**Root Cause:** Drift's margin calculation includes fees, slippage buffers
**Real Incident:** $85.55 collateral, bot tries 100% → rejected, shortage: $0.03
**Fix Applied:**
```typescript
if (configuredSize >= 100) {
percentDecimal = 0.99
console.log(`⚠️ Applying 99% safety buffer for 100% position`)
}
```
**Commit:** 7129cbf
---
### Pitfall #47: Position Close Verification Gap (🔴 CRITICAL - Fixed Nov 16, 2025)
**Symptom:** Close transaction confirmed, database marked "closed", but position stayed open 6+ hours
**Root Cause:** Transaction confirmation ≠ Drift internal state updated immediately (5-10s delay)
**Real Incident:** Trailing stop triggered 02:51, position stayed open until 08:51 restart
**Fix Applied:** 2-layer verification:
```typescript
if (params.percentToClose === 100) {
await cancelAllOrders(params.symbol)
console.log('⏳ Waiting 5s for Drift state to propagate...')
await new Promise(resolve => setTimeout(resolve, 5000))
const verifyPosition = await driftService.getPosition(marketConfig.driftMarketIndex)
if (verifyPosition && Math.abs(verifyPosition.size) >= 0.01) {
console.error(`🔴 CRITICAL: Close confirmed BUT position still exists!`)
return { ...result, needsVerification: true }
}
}
```
**Commit:** c607a66
---
### Pitfall #48: P&L Compounding During Close Verification (🔴 CRITICAL - Fixed Nov 16, 2025)
**Symptom:** P&L accumulates during the 5-10s verification wait
**Root Cause:** Monitoring loop continues during verification, detecting "external closure" multiple times
**Fix Applied:** `closingInProgress` flag:
```typescript
if ((result as any).needsVerification) {
trade.closingInProgress = true
trade.closeConfirmedAt = Date.now()
console.log(`🔒 Marked as closing in progress - external closure detection disabled`)
return
}
// Skip external closure check if closingInProgress
if ((position === null || position.size === 0) && !trade.closingInProgress) {
// ... handle external closure
}
```
**Related:** Pitfalls #27, #49
---
### Pitfall #49: P&L Exponential Compounding in External Closure Detection (🔴 CRITICAL - Fixed Nov 17, 2025)
**Symptom:** Database P&L shows 15-20× actual value ($92.46 when Drift shows $6.00)
**Root Cause:** `trade.realizedPnL` was being mutated during each external closure detection cycle
**Real Incident (Nov 17, 13:54 CET):**
- SOL-PERP SHORT closed by on-chain orders
- Actual P&L: ~$6.00, Database recorded: $92.46 (15.4× too high)
- Rate limiting caused 15+ detection cycles → $6 → $12 → $24 → $48 → $96
**Fix Applied:**
```typescript
// DON'T mutate trade.realizedPnL - causes compounding!
// trade.realizedPnL = totalRealizedPnL ← REMOVED
// Use local variable for DB update
await updateTradeExit({
realizedPnL: totalRealizedPnL, // Use local variable
})
```
**Commit:** 6156c0f
**Lesson Learned:** In monitoring loops, NEVER mutate shared state during calculation phases. Calculate locally, update shared state ONCE at the end.
---
### Pitfall #50: Database Not Tracking Trades (🔴 CRITICAL - RESOLVED Nov 19, 2025)
**Symptom:** Drift UI shows 6 trades, database shows only 3 trades
**Root Cause:** P&L compounding bug (#49) - in-memory object with stale/accumulated values
**Fix Applied:** Calculate P&L from immutable source values (entry/exit prices), never from in-memory fields
---
### Pitfall #51: TP1 Detection Fails When On-Chain Orders Fill Fast (🔴 CRITICAL - Fixed Nov 19, 2025)
**Symptom:** TP1 order fills, but database records exitReason as "SL" instead of "TP1"
**Root Cause:** Position Manager detects closure AFTER both TP1 and runner already closed on-chain
**Real Incident:** LONG opened, TP1+runner closed within 7 minutes, `trade.tp1Hit = false`
**Fix Applied:** Simple percentage-based exit reason:
```typescript
if (runnerProfitPercent > 0.3) {
if (runnerProfitPercent >= 1.2) {
exitReason = 'TP2' // Large profit (>1.2%)
} else {
exitReason = 'TP1' // Moderate profit (0.3-1.2%)
}
} else {
exitReason = 'SL' // Negative or tiny profit (<0.3%)
}
```
**Commit:** de57c96
---
### Pitfall #52: ADX-Based Runner SL Only Applied in One Code Path (🔴 CRITICAL - Fixed Nov 19, 2025)
**Symptom:** TP1 fills via on-chain order, runner gets breakeven SL instead of ADX-based positioning
**Root Cause:** Two TP1 detection paths, only one had ADX logic
**Fix Applied:** Added ADX-based runner SL to on-chain fill detection path (lines 607-642)
**Commits:** b2cb6a3, 66b2922
---
### Pitfall #53: Container Restart Kills Positions + Phantom Detection Bug (🔴 CRITICAL - Fixed Nov 19, 2025)
**Two bugs from container restart:**
**Bug 1: Startup order restore failure**
- Wrong database field names (`takeProfit1OrderTx` vs correct `tp1OrderTx`)
- Fix: Use correct field names
**Bug 2: Phantom detection killing runners**
- Runners (40% remaining) flagged as phantom
- Fix: Check `!trade.tp1Hit` before phantom detection:
```typescript
const wasPhantom = !trade.tp1Hit && trade.currentSize > 0 && (trade.currentSize / trade.positionSize) < 0.5
```
**Commit:** eccecf7
---
### Pitfall #54: MFE/MAE Storing Dollars Instead of Percentages (🔴 CRITICAL - Fixed Nov 23, 2025)
**Symptom:** Database showing maxFavorableExcursion = 64.08% when TradingView showed 0.48%
**Root Cause:** Position Manager storing DOLLAR amounts instead of PERCENTAGES
**Real Incident:** 133× inflation (64.08% stored vs 0.48% actual)
**Fix Applied:**
```typescript
// BEFORE (BROKEN):
if (currentPnLDollars > trade.maxFavorableExcursion) {
trade.maxFavorableExcursion = currentPnLDollars // Storing $64.08
// AFTER (FIXED):
if (profitPercent > trade.maxFavorableExcursion) {
trade.maxFavorableExcursion = profitPercent // Storing 0.48%
```
**Commit:** 6255662
**Lesson Learned:** Always verify data storage units match schema expectations. Comments don't override schema.
---
### Pitfall #55: Configuration Issues (🔴 CRITICAL - Fixed Nov 19-20, 2025)
**Two configuration bugs:**
**Bug 1: Settings UI quality score variable name mismatch**
- Settings API used `MIN_QUALITY_SCORE` (wrong)
- Code actually reads `MIN_SIGNAL_QUALITY_SCORE` (correct)
- User changes in UI had ZERO effect
**Bug 2: BlockedSignalTracker using Pyth cache instead of Drift oracle**
- `priceAfter1Min/5Min/15Min/30Min` fields staying NULL
- Fix: Use `driftService.getOraclePrice()` instead of `getPythPriceMonitor().getCachedPrice()`
**Commit:** 6b00303
---
### Pitfall #56: Ghost Orders After External Closures (🔴 CRITICAL - Fixed Nov 20-21, 2025)
**Symptom:** Position closed, but TP/SL orders remain active on Drift
**Root Cause:** External closure handler didn't call `cancelAllOrders()` before completing
**Real Incident:** Risk of ghost order filling → unintended positions
**Fix Applied:**
```typescript
// In external closure handler:
console.log(`🗑️ Cancelling remaining orders for ${trade.symbol}...`)
const cancelResult = await cancelAllOrders(trade.symbol)
```
**Additional Bug:** False positive "32 open orders" on restart
- Fix: Check `baseAssetAmount.eq(new BN(0))` to filter truly active orders
**Commits:** a3a6222 (Nov 20), 29fce01 (Nov 21)
---
### Pitfall #57: P&L Calculation Inaccuracy for External Closures (🔴 CRITICAL - Fixed Nov 20, 2025)
**Symptom:** Database P&L shows -$101.68 when Drift UI shows -$138.35 (36% error)
**Root Cause:** External closure handler calculates P&L from monitoring loop's `currentPrice`, which lags behind actual fill price
**Fix Applied:** Query Drift's actual settledPnL:
```typescript
const position = userAccount.perpPositions.find((p: any) =>
p.marketIndex === marketConfig.driftMarketIndex
)
const settledPnL = Number(position.settledPnl || 0) / 1e6 // Convert to USD
if (Math.abs(settledPnL) > 0.01) {
totalRealizedPnL = settledPnL
console.log(`✅ Using Drift's actual P&L: $${totalRealizedPnL.toFixed(2)}`)
}
```
**Commit:** 8e600c8
---
### Pitfall #58: 5-Layer Database Protection System (⚠️ HIGH - Implemented Nov 21, 2025)
**Purpose:** Bulletproof protection against untracked positions from database failures
**5 Layers:**
1. **Persistent File Logger** (`lib/utils/persistent-logger.ts`) - Survives container restarts
2. **Database Save with Retry + Verification** - 3 retries with exponential backoff
3. **Orphan Position Detection** - Runs on EVERY container startup
4. **Critical Logging in Execute Endpoint** - Full trade details for recovery
5. **Infrastructure (Docker volumes)** - `./logs:/app/logs`
**Real-world validation:** Nov 21, 2025 - No database failure occurred, but protection now in place
---
### Pitfall #59: Layer 2 Ghost Detection Causing Duplicate Telegram Notifications (🔴 CRITICAL - Fixed Nov 22, 2025)
**Symptom:** Trade #8 sent 13 duplicate notifications with compounding P&L ($11.50 → $155.05)
**Root Cause:** Layer 2 ghost detection (failureCount > 20) didn't check `closingInProgress` flag
**Real Incident (Nov 22, 04:05 CET):**
- Actual P&L: +$18.79, Database final: $155.05 (8.2× actual)
- Rate limit storm: 6,581 failed close attempts
**Fix Applied:**
```typescript
// AFTER (FIXED):
if (trade.priceCheckCount > 20 && !trade.closingInProgress) {
if (!position || Math.abs(position.size) < 0.01) {
trade.closingInProgress = true
trade.closeConfirmedAt = Date.now()
await this.handleExternalClosure(trade, 'Layer 2: Ghost detected')
return
}
}
```
**Commit:** b19f156
---
### Pitfall #60: Stale Array Snapshot in Monitoring Loop (🔴 CRITICAL - Fixed Nov 23, 2025)
**Symptom:** Manual closure sends duplicate "POSITION CLOSED" Telegram notifications
**Root Cause:** Position Manager creates array snapshot before async processing
**Real Incident:** Two identical notifications for cmibdii4k0004pe07nzfmturo
**Fix Applied:**
```typescript
private async checkTradeConditions(trade: ActiveTrade, currentPrice: number): Promise<void> {
// CRITICAL FIX: Check if trade still in monitoring
if (!this.activeTrades.has(trade.id)) {
console.log(`⏭️ Skipping ${trade.symbol} - already removed from monitoring`)
return
}
// ... rest of function
}
```
**Commit:** a7c5930
---
### Pitfall #61: P&L Compounding STILL Happening Despite All Guards (🔴 CRITICAL - Under Investigation Nov 24, 2025)
**Symptom:** Trade showed $974.05 P&L when actual was $72.41 (13.4× inflation)
**Evidence:** 14 duplicate Telegram notifications with compounding P&L
**Status:** All existing guards in place, yet duplicates still occurred
**Interim Fix:** Manual P&L correction, container restart with enhanced closingInProgress flag
**Investigation Needed:**
- Serialization lock around external closure detection
- Unique transaction ID to prevent duplicate DB updates
- Telegram notification deduplication
**Commit:** 0466295
---
### Pitfall #62: Adaptive Leverage and Quality Bypass (🔴 CRITICAL - Fixed Nov 24-27, 2025)
**Two related bugs:**
**Bug 1: Adaptive leverage not working (Nov 24)**
- `USE_ADAPTIVE_LEVERAGE` ENV variable not set in .env
- Quality 90 trade used 15x instead of intended 10x
**Bug 2: Execute endpoint bypassing quality threshold (Nov 27)**
- Bot executed trades at quality 30, 50, 50 when minimum is 90/95
- Execute endpoint calculated quality but never validated it
**Fix Applied (Nov 27):**
```typescript
if (qualityResult.score < minQualityScore) {
console.log(`❌ QUALITY TOO LOW: ${qualityResult.score} < ${minQualityScore} threshold`)
return NextResponse.json({
success: false,
error: 'Quality score too low',
}, { status: 400 })
}
console.log(`✅ Quality check passed: ${qualityResult.score} >= ${minQualityScore}`)
```
**Commit:** cefa3e6
---
### Pitfall #63: Smart Entry Validation System (⚠️ HIGH - Deployed Nov 30, 2025)
**Purpose:** Recover profits from marginal quality signals (50-89)
**Implementation:** `lib/trading/smart-validation-queue.ts` (330+ lines)
**Threshold Results (Dec 1, 2025):**
- **±0.3%:** 28/200 entries (14%), 67.9% WR, +4.73% total ✅
- ±0.2%: 51/200 entries (26%), 43.1% WR, -18.49% total
- ±0.15%: 73/200 entries (36%), 35.6% WR, -38.27% total
**Commit:** 7c9cfba
---
### Pitfall #64: EPYC Cluster SSH Timeout (🔴 CRITICAL - Fixed Dec 1, 2025)
**Symptom:** Coordinator reports "SSH command timed out for v9_chunk_000002 on worker1"
**Root Cause:** 30-second subprocess timeout insufficient for nested SSH hop (master → worker1 → worker2)
**Fix Applied:**
```python
ssh_opts = "-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ServerAliveInterval=5"
result = subprocess.run(ssh_cmd, timeout=60) # Increased from 30s to 60s
```
**Commit:** ef371a1
**Lesson Learned:** Nested SSH hops need 2× minimum timeout. Latency compounds at each hop.
---
### Pitfall #65: Distributed Worker Quality Filter - Dict vs Callable (🔴 CRITICAL - Fixed Dec 1, 2025)
**Symptom:** ALL 2,096 distributed backtests returned 0 trades
**Root Cause:** Passed dict `{'min_adx': 15, 'min_volume_ratio': vol_min}` instead of lambda function
**Error:** `'dict' object is not callable`
**Fix Applied:**
```python
# BEFORE (BROKEN):
quality_filter = {'min_adx': 15, 'min_volume_ratio': vol_min}
# AFTER (FIXED):
if vol_min > 0:
quality_filter = lambda s: s.adx >= 15 and s.volume_ratio >= vol_min
else:
quality_filter = None
```
**Commit:** 11a0ea3
**Lesson Learned:** Silent failures more dangerous than crashes. Exception handler hid severity by returning zeros.
---
### Pitfall #66: Smart Entry Wrong Price Display (🔴 CRITICAL - Fixed Dec 1, 2025)
**Symptom:** Abandonment notifications showing impossible prices ($126 → $98 = -22% in 30 seconds)
**Root Cause:** Symbol format mismatch between validation queue ("SOLUSDT") and market data cache ("SOL-PERP")
**Real Incident:** Cache lookup `marketDataCache.get("SOLUSDT")` returned null
**Fix Applied:**
```typescript
// Normalize symbol before validation queue
const normalizedSymbol = normalizeTradingViewSymbol(body.symbol)
const queued = await validationQueue.addSignal({
symbol: normalizedSymbol, // Use normalized format for cache lookup
// ...
})
```
**Commit:** 6cec2e8
---
### Pitfall #67: Ghost Detection Race Condition (🔴 CRITICAL - Fixed Dec 2, 2025)
**Symptom:** 23 duplicate "POSITION CLOSED" notifications with P&L compounding (-$47.96 to -$1,129.24)
**Root Cause:** Race condition in ghost detection - check `Map.has()` happened AFTER function entry
**Real Incident (Dec 2, 17:20 CET):**
- Expected P&L: ~-$48
- Actual: 23 notifications with compounding P&L
**Fix Applied:** Use Map.delete() atomic return value as deduplication lock:
```typescript
// FIXED CODE:
async handleExternalClosure(trade: ActiveTrade, reason: string) {
const tradeId = trade.id
// ✅ Delete IMMEDIATELY - atomic operation
if (!this.activeTrades.delete(tradeId)) {
console.log('DUPLICATE PREVENTED (atomic lock)')
return
}
// ONLY first caller reaches here
// ... rest of cleanup
}
```
**Commit:** 93dd950
**Lesson Learned:** When async handler can be called by multiple code paths simultaneously, use atomic operations (like Map.delete()) as locks at function entry.
---
### Pitfall #68: Smart Entry Using Webhook Percentage as Signal Price (🔴 CRITICAL - Fixed Dec 3, 2025)
**Symptom:** $89 position sizes, 97% pullback calculations, impossible entry conditions
**Root Cause:** TradingView webhook `signal.price` contained percentage (70.80) instead of market price ($142.50)
**Real Incident:** Smart Entry log showed "97.4% pullback required" (impossible)
**Fix Applied:**
```typescript
// Use Pyth current price instead of webhook signal price
const pythPrice = await pythClient.getPrice(symbol)
const signalPrice = pythPrice.price // ✅ Use actual market price
```
**Commit:** 7d0d38a
**Lesson Learned:** Never trust webhook data for calculations. Use authoritative price sources (Pyth, Drift).
---
### Pitfall #69: Direction-Specific Leverage Thresholds Not Explicit (🟡 MEDIUM - Fixed Dec 3, 2025)
**Symptom:** Leverage code checked quality score without explicit direction context
**Root Cause:** Code pattern was ambiguous about which direction's threshold applied
**Fix Applied:** Made direction-specific thresholds explicit:
```typescript
if (body.direction === 'LONG') {
if (qualityResult.score >= 90) leverage = 5
// ...
} else { // SHORT
if (qualityResult.score >= 90) leverage = 5 // Same as LONG but explicit
// ...
}
```
**Commit:** 58f812f
---
### Pitfall #70: Smart Validation Queue Rejected by Execute Endpoint (🔴 CRITICAL - Fixed Dec 3, 2025)
**Symptom:** Quality 50-89 signals validated by queue get rejected with "Quality score too low"
**Root Cause:** Execute endpoint applies quality threshold check AFTER validation queue confirmed price action
**Fix Applied:**
```typescript
const isValidatedEntry = body.validatedEntry === true
if (isValidatedEntry) {
console.log(`✅ VALIDATED ENTRY BYPASS: Quality ${qualityResult.score} accepted`)
}
// Only apply quality threshold if NOT a validated entry
if (!isValidatedEntry && qualityResult.score < minQualityScore) {
return NextResponse.json({ error: 'Quality too low' }, { status: 400 })
}
```
**Commit:** 785b09e
---
### Pitfall #71: Revenge System Missing External Closure Integration (🔴 CRITICAL - Fixed Dec 3, 2025)
**Symptom:** High-quality signals (85+) stopped by external closures don't trigger revenge window
**Root Cause:** Revenge eligibility check only existed in executeExit() path, not handleExternalClosure()
**Real Incident (Nov 20):** Quality 90 SHORT at $141.37, stopped at $142.48 (-$138.35), price dropped to $131.32 (+$490 opportunity missed)
**Fix Applied:**
```typescript
// In external closure handler:
if (exitReason === 'SL' && trade.signalQualityScore && trade.signalQualityScore >= 85) {
console.log(`🎯 External SL closure - Quality ${trade.signalQualityScore} >= 85`)
await stopHuntTracker.recordStopHunt({
originalTradeId: trade.id,
symbol: trade.symbol,
direction: trade.direction,
stopHuntPrice: currentPrice,
originalEntryPrice: trade.entryPrice,
originalQualityScore: trade.signalQualityScore,
stopLossAmount: Math.abs(totalRealizedPnL)
})
console.log(`✅ Revenge window activated for external closure (30min monitoring)`)
}
```
**Commit:** 785b09e
---
### Pitfall #72: Telegram Webhook Conflicts with Polling Bot (🔴 CRITICAL - Fixed Dec 4, 2025)
**Symptom:** Python Telegram bot crashes with "Conflict: can't use getUpdates method while webhook is active"
**Root Cause:** n8n had active Telegram webhook that intercepted ALL messages before Python bot
**Real Incident:** `/status` command returned n8n test message with broken template syntax
**Fix Applied:**
```bash
# Delete Telegram webhook
curl -s "https://api.telegram.org/bot{TOKEN}/deleteWebhook"
# Restart Python bot
docker restart telegram-trade-bot
```
**Architecture Decision:** Cannot run both n8n webhook AND Python polling bot simultaneously. Choose one.
---
### Pitfall #89: Drift Fractional Position Remnants After SL Execution (🔴 CRITICAL - Dec 16, 2025)
**Symptom:** Stop loss triggered and transaction confirmed, but Drift shows 0.15 SOL fractional position remaining unprotected
**Financial Impact:** $1,000+ losses from unprotected positions - fractional remnant has NO stop loss orders
**Real Incident (Dec 16, 2025 20:41:25):**
- Main position: SOL-PERP SHORT at $126.90, size $2,128.74
- Stop loss triggered at $128.13 for -$20.55 loss
- Position Manager attempted to close 100% (16.77 SOL)
- Transaction confirmed on-chain successfully
- BUT Drift showed 0.15 SOL ($19.22) still open
- **Three close attempts** all confirmed but residual remained
**Evidence from logs:**
```
🔍 CALC1: positionSizeUSD calculated = $2147.38
🔍 CALC2: trackedSizeUSD = $2128.74
params.percentToClose: 100
position.size: 16.77
Calculated sizeToClose: 16.77
Is below minimum? false
🔴 CRITICAL: Close transaction confirmed BUT position still exists on Drift!
Transaction: 3FTBmiCLkRqtuhHH1EwazTxGCuy63xuWpmUaxMJ2YU7n...
Drift size: 0.15
This indicates Drift state propagation delay or partial fill
```
**Database Evidence:**
```sql
-- Main trade (stopped out correctly)
id: cmj8yqixi00e | SOL-PERP SHORT | Entry: $126.90 | Exit: $128.13
Size: $2,128.74 | P&L: -$20.55 | Reason: SL
-- Ghost fractional (wrong entry price, unprotected)
id: cmj91z1nr002 | SOL-PERP SHORT | Entry: $33.13 (WRONG!)
Size: $19.22 | P&L: $0 | Reason: GHOST_CLEANUP
```
**Root Cause:** **Drift Protocol Partial Fill Issue**
NOT a bot calculation error. Evidence shows:
1. Position Manager correctly calculated 100% close (16.77 SOL)
2. Close transaction executed and confirmed on-chain (verified signature)
3. Drift still showed 0.15 SOL after successful transaction
4. **Multiple attempts** (3 transactions) all confirmed but remnant persisted
5. Fractional position likely below exchange liquidity threshold
6. Oracle price slippage or minimum fill constraints
**Why Multiple Close Attempts Failed:**
- First close: 16.77 SOL → 0.15 SOL remains
- Second close: 0.15 SOL → Transaction confirmed but still 0.15 SOL
- Third close: 0.15 SOL → Transaction confirmed but still 0.15 SOL
- All transactions returned SUCCESS but Drift state didn't update
**Transaction Signatures:**
1. `3FTBmiCLkRqtuhHH1EwazTxGCuy63xuWpmUaxMJ2YU7nrmiVAikw8c36TxsS4Dsnjm3Qcz1bMG7o9Brmhmt84g4L`
2. `4fHrkDxtmmyKW2vBsqe5tT1rHNosoHo8azcV6ntFC6KQRiytwdC2LLYM3Vv4J4tEmZetUEfKBR55WD8odnqCczGw`
3. `2BcdpZirfKvzhKoakqG5k3XbHkn9pVfCWGMpmYWTBtxYP1UGjKUyH3XSP8v5vM7xsch1jeCamcrmaBqyAz5ZA9B3`
**THE FIX (Dec 16, 2025):**
**Part 1: Fractional Position Detection (Position Manager)**
```typescript
// lib/trading/position-manager.ts - in handlePriceUpdate()
// After close attempt, check for fractional remnants
if (closeResult.success && position.size < minOrderSize * 1.5) {
console.log(`⚠️ FRACTIONAL REMNANT: ${trade.symbol} has ${position.size} remaining (below ${minOrderSize * 1.5})`)
console.log(` This is likely Drift partial fill issue`)
console.log(` Position too small to close normally - marking for force liquidation`)
// Log to persistent logger
const { logCriticalError } = await import('../utils/persistent-logger')
await logCriticalError('FRACTIONAL_REMNANT_DETECTED', {
symbol: trade.symbol,
remnantSize: position.size,
minOrderSize: minOrderSize,
tradeId: trade.id,
closeAttempts: trade.closeAttempts || 1
})
// Mark trade for manual intervention
await this.prisma.trade.update({
where: { id: trade.id },
data: {
exitReason: 'FRACTIONAL_REMNANT',
closeAttempts: (trade.closeAttempts || 0) + 1
}
})
// Remove from monitoring if close attempts > 3
if ((trade.closeAttempts || 0) >= 3) {
console.log(`❌ Giving up after 3 close attempts - removing from monitoring`)
console.log(` Manual intervention required via Drift UI`)
this.activeTrades.delete(tradeId)
}
}
```
**Part 2: Minimum Size Safeguard (Close Function)**
```typescript
// lib/drift/orders.ts - in closePosition()
// Before attempting close, check if position viable
const minViableSize = marketConfig.minOrderSize * 1.5
if (Math.abs(position.size) < minViableSize) {
console.warn(`⚠️ Position size ${position.size} below minimum viable ${minViableSize}`)
console.warn(` This fractional position cannot be closed normally`)
console.warn(` Drift protocol issue - position likely stuck`)
return {
success: false,
error: 'POSITION_TOO_SMALL_TO_CLOSE',
remnantSize: Math.abs(position.size),
instructions: 'Close manually via Drift UI or wait for auto-liquidation'
}
}
```
**Part 3: Health Monitor Detection**
```typescript
// lib/health/position-manager-health.ts
// Add check for fractional remnants
const fractionalPositions = await prisma.trade.findMany({
where: {
exitReason: 'FRACTIONAL_REMNANT',
exitTime: { gt: new Date(Date.now() - 24 * 60 * 60 * 1000) }
}
})
if (fractionalPositions.length > 0) {
console.log(`🚨 CRITICAL: ${fractionalPositions.length} fractional remnants detected`)
for (const pos of fractionalPositions) {
console.log(` ${pos.symbol}: Trade ${pos.id} (${pos.closeAttempts || 1} close attempts)`)
}
}
```
**Why This Matters:**
- **This is a REAL MONEY system** - fractional remnants = unprotected exposure
- Drift protocol has known issues with small positions
- Cannot be detected by size calculations alone
- Requires transaction verification AFTER close attempts
- Health monitor will alert within 30 seconds
**Prevention Rules:**
1. ALWAYS verify Drift position size after close transactions
2. NEVER assume transaction confirmation = position closed
3. Check for fractional remnants below 1.5× minimum order size
4. Limit close retry attempts to prevent infinite loops
5. Log to persistent logger for manual review
6. Remove from monitoring after 3 failed attempts
**Red Flags Indicating This Bug:**
- Transaction confirmed but position still shows on Drift
- Position size below 2× minimum order size
- Multiple close attempts with same size remaining
- "CRITICAL: Close transaction confirmed BUT position still exists" logs
- Health monitor shows "UNTRACKED POSITIONS DETECTED"
- Auto-sync cooldown repeatedly activating
**Manual Resolution:**
1. Check Drift UI for fractional positions
2. Try closing via Drift UI directly (may work when API fails)
3. If stuck: Contact Drift support with transaction signatures
4. Database cleanup: Mark exitReason='FRACTIONAL_REMNANT_MANUAL'
**Files Changed:**
- lib/trading/position-manager.ts (fractional detection + retry limits)
- lib/drift/orders.ts (minimum viable size check)
- lib/health/position-manager-health.ts (fractional remnant alerts)
**Git commit:** [PENDING] "critical: Bug #89 - Detect and handle Drift fractional position remnants"
**Deployment:** [PENDING] Requires Docker rebuild + restart
**Status:** ⏳ FIX IMPLEMENTED - Awaiting deployment verification
**Lesson Learned:** Transaction confirmation ≠ position closed. Drift protocol can confirm transactions but leave fractional remnants due to exchange constraints, oracle pricing, or minimum fill requirements. Always verify actual position size after close operations, not just transaction success status.
---
## Appendix: Pattern Recognition
### Common Root Causes
1. **Race Conditions:** Multiple code paths detecting same event (P&L compounding bugs #48, #49, #59, #60, #67)
2. **Unit Mismatches:** Tokens vs USD, dollars vs percentages (#24, #54)
3. **Symbol Format:** TradingView ("SOLUSDT") vs Drift ("SOL-PERP") (#5, #66)
4. **Deployment Verification:** Declaring "fixed" without checking container timestamp (#31)
5. **SDK Behavior:** Documentation doesn't match reality (#2, #24, #45)
6. **Async Timing:** Operations completing out of expected order (#13, #28, #60)
### Prevention Strategies
1. **Use atomic operations** for state changes (Map.delete() returns boolean)
2. **Always normalize symbols** at integration boundaries
3. **Verify deployment** with container timestamp vs commit time
4. **Never mutate shared state** during calculation phases
5. **Add explicit checks** in ALL code paths, not just happy path
6. **Test with real infrastructure** before trusting provider claims
---
## Cross-Reference Index
- **See Also:** `.github/copilot-instructions.md` - Main AI agent instructions with Top 10 Critical Pitfalls
- **Related:** `docs/bugs/` - Additional bug documentation
- **Related:** `docs/architecture/` - System design context
---
**Last Updated:** December 4, 2025
**Maintainer:** AI Agent team following "NOTHING gets lost" principle