docs: Add Docker cache bug to Common Pitfalls (Bug #86)

- Documents critical incident where --force-recreate didn't deploy code
- Telegram showed 0.15% instead of 0.3% despite commits and rebuild
- Root cause: Docker cached build layers, only container recreated
- Solution: docker compose build --no-cache trading-bot required
- Adds when to use --no-cache vs --force-recreate guidelines
- Includes verification steps and prevention rules
- 2 hours debugging time, now documented for future reference
This commit is contained in:
mindesbunister
2025-12-17 14:55:08 +01:00
parent 0310b14f24
commit 8fdcf06d4b

View File

@@ -3229,6 +3229,111 @@ This section contains the **TOP 10 MOST CRITICAL** pitfalls that every AI agent
- **Why Bug #81 Didn't Fix This:** Bug #81 = orders never placed, Bug #82 = orders placed then REMOVED by verifier
- **Status:** ✅ EMERGENCY FIX DEPLOYED Dec 10, 2025 11:06 CET (commit e5714e4)
**12. Docker Cache Prevents Telegram Notification Code Deployment (#86 - CRITICAL - Dec 17, 2025)** - `--force-recreate` ≠ `--no-cache`
- **Symptom:** Container newer than commits, but Telegram shows OLD notification format (0.15% instead of 0.3%)
- **User Report:** "telegram is not fixed" after multiple rebuilds showing old code
- **Financial Impact:** 2 hours of debugging, delayed feature deployment, confusion about system state
- **Real Incident Timeline (Dec 17, 2025):**
* 13:38:09 - Committed dd9e5bd: Changed Telegram 0.15% → 0.3%
* 14:00:11 - Committed 6ac2647: Made Telegram thresholds adaptive
* 14:03:31 - Container rebuilt and restarted with `--force-recreate`
* 14:40:00 - User receives NEW signal showing **0.15% old format** (pre-dd9e5bd code!)
* Investigation: Container start time (14:03:31) > commit time (14:00:11) ✅ BUT showing old code ❌
- **Root Cause:**
* Multi-stage Dockerfile caches `COPY . .` and `RUN npm run build` layers
* `docker compose up -d --force-recreate` only recreates CONTAINER, not IMAGE layers
* Docker reused cached build layer from BEFORE dd9e5bd commit
* Notification string changes (telegram.ts) didn't trigger cache invalidation
* Container appeared "new" but contained "old" compiled code
- **Git History Analysis:**
```bash
git show dd9e5bd^ # Showed exact 0.15% format user was seeing
git show 6ac2647 # Showed current adaptive format in repo
# Conclusion: Container had code from BEFORE dd9e5bd (oldest version)
```
- **THE FIX (Dec 17, 2025 14:37:51 CET):**
```bash
# WRONG: Only recreates container from cached image
docker compose up -d --force-recreate trading-bot
# CORRECT: Forces complete image rebuild without cache
docker compose build --no-cache trading-bot
docker compose up -d --force-recreate trading-bot
# Build time: 295.4s (vs ~30s with cache)
# Result: ALL code freshly compiled, ALL commits deployed ✅
```
- **Why `--force-recreate` is Misleading:**
* Flag name suggests "rebuild everything from scratch"
* Reality: Only destroys and recreates container from existing image
* Image layers remain cached from previous builds
* Common misconception in Docker workflows
- **When to Use `--no-cache`:**
* Telegram/notification message changes not appearing
* UI text/string constants showing old values
* Hardcoded values not updating after code changes
* Normal rebuild deploys everything EXCEPT your specific change
* Debugging "why does new container have old code?"
* Any time Docker layer cache might be stale
- **When `--force-recreate` is Sufficient:**
* Configuration file changes (config.ts logic changes)
* Environment variable updates (.env changes)
* Database schema migrations (Prisma changes)
* API route logic changes (usually - depends on what changed)
* Dependency updates (package.json changes trigger rebuild)
- **Verification After Rebuild:**
```bash
# 1. Check container start time
docker logs trading-bot-v4 | grep "Server starting" | head -1
# Output: Server starting at 2025-12-17T14:37:51.123Z
# 2. Check latest commit time
git log -1 --format='%ai'
# Output: 2025-12-17 14:00:11 +0100
# 3. Verify container NEWER than commit
# Container 14:37:51 > Commit 14:00:11 ✅
# 4. Test actual behavior (wait for next signal or test manually)
# Expected: New format with 0.3%, "confirms"/"against" text
```
- **Code Evolution (3 Versions):**
* **Version 1** (pre-dd9e5bd): Hardcoded 0.15%, old structure
* **Version 2** (dd9e5bd to 6ac2647^): Hardcoded 0.3%, old structure
* **Version 3** (6ac2647+): Adaptive display, new 4-line format
* Container showed Version 1 due to cache, now shows Version 3 ✅
- **Files Affected:**
* lib/notifications/telegram.ts (lines 118-130)
* lib/trading/smart-validation-queue.ts (lines 107-109, 120-128)
* All notification text changes susceptible to this bug
- **Prevention Rules:**
1. ALWAYS use `--no-cache` when notification/UI text changes
2. NEVER trust `--force-recreate` alone for code deployment
3. ALWAYS verify actual behavior after "deployment" (not just container timestamp)
4. Check specific changed files in running container if possible
5. Build time 10× longer (30s → 300s) is normal and expected
6. Add to deployment checklist: "String changes require --no-cache"
- **Red Flags Indicating Docker Cache Bug:**
* Container start time > commit time, but showing old behavior
* Code changes in repository don't appear in container
* Telegram/UI showing old text despite rebuilds
* User says "it's not fixed" after rebuild
* git show reveals old code format matches what's running
* No TypeScript compilation errors, but behavior unchanged
- **Why This Matters:**
* **This is a REAL MONEY system** - delayed deployments = missed features
* User confusion: "I rebuilt, why isn't it working?"
* Wasted debugging time (2 hours investigating non-existent code issues)
* Misleading system state (appears deployed, actually isn't)
* Future notification changes will hit same issue without --no-cache
- **Git Commits:**
* dd9e5bd - "fix: Correct Smart Validation Queue confirmation threshold (0.15% → 0.3%)"
* 6ac2647 - "feat: Make Smart Validation Queue thresholds adaptive in Telegram notifications"
* 0310b14 - "fix: Enable BlockedSignalTracker for SMART_VALIDATION_QUEUED signals"
- **Deployment:** Dec 17, 2025 14:37:51 CET (--no-cache rebuild)
- **Status:** ✅ FIXED - All notification code deployed, next signal will show correct format
- **Lesson Learned:** Docker cache optimization (fast builds) can backfire for notification/UI changes. `--force-recreate` is misleadingly named - only recreates container, not image layers. Always use `--no-cache` for string/notification changes. Build time cost (295s vs 30s) is worth correct code deployment in real money system.
---
**REMOVED FROM TOP 10 (Still documented in full section):**