trading_bot_v4/ha-setup/README.md

# High Availability Setup for Trading Bot v4

## Architecture: Active-Passive Failover

**Primary Server (Active):** Runs trading bot 24/7
**Secondary Server (Passive):** Monitors primary, takes over on failure

### Why Active-Passive (Not Active-Active)?
- **Prevents duplicate trades** - CRITICAL for financial system
- **Single source of truth** - One Position Manager tracking state
- **No split-brain scenarios** - Only one bot executes trades
- **Database consistency** - No conflicting writes

---

## Setup Instructions

### 1. Prerequisites

**Primary Server:** `root@192.168.1.100` (update in scripts)
**Secondary Server:** `root@72.62.39.24`

Both servers need:
- Docker & Docker Compose installed
- Trading bot project at `/home/icke/traderv4`
- Same `.env` file (especially DRIFT_WALLET_PRIVATE_KEY)
- Same n8n workflows configured

### 2. Initial Sync (Already Done via rsync ✅)

```bash
# From primary server
rsync -avz --exclude 'node_modules' --exclude '.next' \
  /home/icke/traderv4/ root@72.62.39.24:/home/icke/traderv4/
```

### 3. Database Synchronization

**Option A: Manual Sync (Simpler, Recommended for Start)**

On primary:
```bash
docker exec trading-bot-postgres pg_dump -U postgres trading_bot_v4 > /tmp/trading_bot_backup.sql
rsync -avz /tmp/trading_bot_backup.sql root@72.62.39.24:/tmp/
```

On secondary:
```bash
docker exec -i trading-bot-postgres psql -U postgres trading_bot_v4 < /tmp/trading_bot_backup.sql
```

Run this daily via cron on primary:
```bash
0 2 * * * /home/icke/traderv4/ha-setup/sync-db-daily.sh
```

**Option B: Streaming Replication (Advanced)**
```bash
# On primary
bash ha-setup/setup-db-replication.sh primary

# On secondary
bash ha-setup/setup-db-replication.sh secondary
```

### 4. Setup Health Monitoring

Make scripts executable:
```bash
chmod +x ha-setup/*.sh
```

**Test healthcheck on both servers:**
```bash
bash ha-setup/healthcheck.sh
# Should output: ✅ HEALTHY: All checks passed
```

### 5. Start Failover Controller (SECONDARY ONLY)

**Edit configuration first:**
```bash
nano ha-setup/failover-controller.sh
# Update PRIMARY_HOST with actual IP
# Update SECONDARY_HOST if needed
```

**Run as systemd service:**
```bash
sudo cp ha-setup/trading-bot-ha.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable trading-bot-ha
sudo systemctl start trading-bot-ha
```

**Check status:**
```bash
sudo systemctl status trading-bot-ha
sudo journalctl -u trading-bot-ha -f
```

### 6. SSH Key Setup (Password-less Auth)

Secondary needs SSH access to primary for health checks:

```bash
# On secondary
ssh-keygen -t ed25519 -f /root/.ssh/trading_bot_ha
ssh-copy-id -i /root/.ssh/trading_bot_ha root@192.168.1.100

# Test connection
ssh root@192.168.1.100 "docker ps | grep trading-bot"
```

---

## How It Works

### Normal Operation (Primary Active)

1. **Primary:** Trading bot running, executing trades
2. **Secondary:** Failover controller checks primary every 15s
3. **Secondary:** Bot container STOPPED (passive standby)

### Failover Scenario

1. **Primary fails** (server down, docker crash, API unresponsive)
2. **Secondary detects** 3 consecutive failed health checks (45s)
3. **Telegram alert sent:** "🚨 HA FAILOVER: Primary failed, activating secondary"
4. **Secondary starts** trading bot container
5. **Trading continues** on secondary with same wallet/config

### Recovery Scenario

1. **Primary recovers** (you fix it, restart, etc.)
2. **Secondary detects** primary is healthy again
3. **Secondary stops** its trading bot (returns to standby)
4. **Telegram alert:** "Primary recovered, secondary deactivated"
5. **Primary resumes** as active node

---

## Monitoring & Maintenance

### Check HA Status

**On secondary:**
```bash
# View failover controller logs
sudo journalctl -u trading-bot-ha -f --lines=50

# Check if secondary is active
docker ps | grep trading-bot-v4
```

**On primary:**
```bash
# Run healthcheck manually
bash ha-setup/healthcheck.sh

# Check container status
docker ps | grep trading-bot-v4
```

### Manual Failover Testing

**Simulate primary failure:**
```bash
# On primary, stop trading bot
docker compose stop trading-bot

# Watch secondary logs - should activate within 45s
# On secondary
sudo journalctl -u trading-bot-ha -f
```

**Restore primary:**
```bash
# On primary, restart trading bot
docker compose up -d trading-bot

# Watch secondary - should deactivate within 15s
```

### Database Sync Schedule

**Daily sync from primary to secondary:**

On primary, add to crontab:
```bash
crontab -e
# Add:
0 2 * * * /home/icke/traderv4/ha-setup/sync-db-daily.sh >> /var/log/trading-bot-db-sync.log 2>&1
```

**Before failover events:** Secondary uses last synced DB state (max 24h old trade history)
**After failover:** Secondary continues with current state, syncs back to primary when recovered

---

## Important Notes

### Financial Safety

- **NEVER run both servers actively** - would cause duplicate trades and wallet conflicts
- **Failover controller ensures** only one active at a time
- **Same wallet key** required on both servers
- **Same n8n webhook endpoint** - update TradingView alerts if needed

### Database Consistency

- **Daily sync:** Keeps secondary within 24h of primary
- **Trade history:** May have small gap after failover (acceptable)
- **Position Manager:** Rebuilds state from Drift Protocol on startup
- **No financial loss:** Drift Protocol is source of truth for positions

### Network Requirements

- **Secondary → Primary:** SSH access (port 22) for health checks
- **Both → Internet:** For Drift Protocol, Telegram, n8n webhooks
- **n8n:** Can run on both or centralized (needs webhook routing)

### Testing Recommendations

1. **Week 1:** Run without failover, just monitor health checks
2. **Week 2:** Test manual failover (stop primary, verify secondary takes over)
3. **Week 3:** Test recovery (restart primary, verify secondary stops)
4. **Week 4:** Enable automatic failover for production

---

## Troubleshooting

### Secondary Won't Start After Failover

```bash
# Check logs
docker logs trading-bot-v4

# Check .env file exists
ls -la /home/icke/traderv4/.env

# Check Drift initialization
docker logs trading-bot-v4 | grep "Drift"
```

### Split-Brain (Both Servers Active)

**EMERGENCY - Stop both immediately:**
```bash
# On both servers
docker compose stop trading-bot
```

**Then restart only primary:**
```bash
# On primary only
docker compose up -d trading-bot
```

**Check Drift positions:**
```bash
curl -s http://localhost:3001/api/trading/positions \
  -H "Authorization: Bearer ${API_SECRET_KEY}" | jq .
```

### Health Check False Positives

Adjust thresholds in `failover-controller.sh`:
```bash
CHECK_INTERVAL=30  # Slower checks (reduce network load)
MAX_FAILURES=5     # More tolerant (reduce false failovers)
```

---

## Cost Analysis

**Primary Server:** Always running (existing cost)
**Secondary Server:** Always running, but mostly idle

**Benefits:**
- **99.9% uptime** vs 95% single server
- **~4.5 hours/year** max downtime (failover time)
- **Financial protection** - no missed trades during outages
- **Peace of mind** - sleep without worrying about server crashes

**Worth it?** YES - For a financial system, redundancy is essential.

---

## Future Enhancements

1. **Geographic redundancy:** Secondary in different datacenter/region
2. **Load balancer:** Route n8n webhooks to active server automatically
3. **Database streaming replication:** Real-time sync (0 data loss)
4. **Multi-region:** Three servers (US, EU, Asia) for global coverage
5. **Health dashboard:** Web UI showing HA status and metrics