feat: Add High Availability setup roadmap and scripts
Created comprehensive HA roadmap with 6 phases: - Phase 1: Warm standby (CURRENT - manual failover) - Phase 2: Database replication - Phase 3: Health monitoring - Phase 4: Reverse proxy + floating IP - Phase 5: Automated failover - Phase 6: Geographic redundancy Includes: - Decision gates based on capital and stability - Cost-benefit analysis - Scripts for healthcheck, failover, DB sync - Recommendation to defer full HA until capital > $5k Secondary server ready at 72.62.39.24 for emergency manual failover. Related: User concern about system uptime, but full HA complexity not justified at current scale (~$600 capital). Revisit in Q1 2026.
This commit is contained in:
298
ha-setup/README.md
Normal file
298
ha-setup/README.md
Normal file
@@ -0,0 +1,298 @@
|
||||
# High Availability Setup for Trading Bot v4
|
||||
|
||||
## Architecture: Active-Passive Failover
|
||||
|
||||
**Primary Server (Active):** Runs trading bot 24/7
|
||||
**Secondary Server (Passive):** Monitors primary, takes over on failure
|
||||
|
||||
### Why Active-Passive (Not Active-Active)?
|
||||
- **Prevents duplicate trades** - CRITICAL for financial system
|
||||
- **Single source of truth** - One Position Manager tracking state
|
||||
- **No split-brain scenarios** - Only one bot executes trades
|
||||
- **Database consistency** - No conflicting writes
|
||||
|
||||
---
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
### 1. Prerequisites
|
||||
|
||||
**Primary Server:** `root@192.168.1.100` (update in scripts)
|
||||
**Secondary Server:** `root@72.62.39.24`
|
||||
|
||||
Both servers need:
|
||||
- Docker & Docker Compose installed
|
||||
- Trading bot project at `/home/icke/traderv4`
|
||||
- Same `.env` file (especially DRIFT_WALLET_PRIVATE_KEY)
|
||||
- Same n8n workflows configured
|
||||
|
||||
### 2. Initial Sync (Already Done via rsync ✅)
|
||||
|
||||
```bash
|
||||
# From primary server
|
||||
rsync -avz --exclude 'node_modules' --exclude '.next' \
|
||||
/home/icke/traderv4/ root@72.62.39.24:/home/icke/traderv4/
|
||||
```
|
||||
|
||||
### 3. Database Synchronization
|
||||
|
||||
**Option A: Manual Sync (Simpler, Recommended for Start)**
|
||||
|
||||
On primary:
|
||||
```bash
|
||||
docker exec trading-bot-postgres pg_dump -U postgres trading_bot_v4 > /tmp/trading_bot_backup.sql
|
||||
rsync -avz /tmp/trading_bot_backup.sql root@72.62.39.24:/tmp/
|
||||
```
|
||||
|
||||
On secondary:
|
||||
```bash
|
||||
docker exec -i trading-bot-postgres psql -U postgres trading_bot_v4 < /tmp/trading_bot_backup.sql
|
||||
```
|
||||
|
||||
Run this daily via cron on primary:
|
||||
```bash
|
||||
0 2 * * * /home/icke/traderv4/ha-setup/sync-db-daily.sh
|
||||
```
|
||||
|
||||
**Option B: Streaming Replication (Advanced)**
|
||||
```bash
|
||||
# On primary
|
||||
bash ha-setup/setup-db-replication.sh primary
|
||||
|
||||
# On secondary
|
||||
bash ha-setup/setup-db-replication.sh secondary
|
||||
```
|
||||
|
||||
### 4. Setup Health Monitoring
|
||||
|
||||
Make scripts executable:
|
||||
```bash
|
||||
chmod +x ha-setup/*.sh
|
||||
```
|
||||
|
||||
**Test healthcheck on both servers:**
|
||||
```bash
|
||||
bash ha-setup/healthcheck.sh
|
||||
# Should output: ✅ HEALTHY: All checks passed
|
||||
```
|
||||
|
||||
### 5. Start Failover Controller (SECONDARY ONLY)
|
||||
|
||||
**Edit configuration first:**
|
||||
```bash
|
||||
nano ha-setup/failover-controller.sh
|
||||
# Update PRIMARY_HOST with actual IP
|
||||
# Update SECONDARY_HOST if needed
|
||||
```
|
||||
|
||||
**Run as systemd service:**
|
||||
```bash
|
||||
sudo cp ha-setup/trading-bot-ha.service /etc/systemd/system/
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable trading-bot-ha
|
||||
sudo systemctl start trading-bot-ha
|
||||
```
|
||||
|
||||
**Check status:**
|
||||
```bash
|
||||
sudo systemctl status trading-bot-ha
|
||||
sudo journalctl -u trading-bot-ha -f
|
||||
```
|
||||
|
||||
### 6. SSH Key Setup (Password-less Auth)
|
||||
|
||||
Secondary needs SSH access to primary for health checks:
|
||||
|
||||
```bash
|
||||
# On secondary
|
||||
ssh-keygen -t ed25519 -f /root/.ssh/trading_bot_ha
|
||||
ssh-copy-id -i /root/.ssh/trading_bot_ha root@192.168.1.100
|
||||
|
||||
# Test connection
|
||||
ssh root@192.168.1.100 "docker ps | grep trading-bot"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
### Normal Operation (Primary Active)
|
||||
|
||||
1. **Primary:** Trading bot running, executing trades
|
||||
2. **Secondary:** Failover controller checks primary every 15s
|
||||
3. **Secondary:** Bot container STOPPED (passive standby)
|
||||
|
||||
### Failover Scenario
|
||||
|
||||
1. **Primary fails** (server down, docker crash, API unresponsive)
|
||||
2. **Secondary detects** 3 consecutive failed health checks (45s)
|
||||
3. **Telegram alert sent:** "🚨 HA FAILOVER: Primary failed, activating secondary"
|
||||
4. **Secondary starts** trading bot container
|
||||
5. **Trading continues** on secondary with same wallet/config
|
||||
|
||||
### Recovery Scenario
|
||||
|
||||
1. **Primary recovers** (you fix it, restart, etc.)
|
||||
2. **Secondary detects** primary is healthy again
|
||||
3. **Secondary stops** its trading bot (returns to standby)
|
||||
4. **Telegram alert:** "Primary recovered, secondary deactivated"
|
||||
5. **Primary resumes** as active node
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Maintenance
|
||||
|
||||
### Check HA Status
|
||||
|
||||
**On secondary:**
|
||||
```bash
|
||||
# View failover controller logs
|
||||
sudo journalctl -u trading-bot-ha -f --lines=50
|
||||
|
||||
# Check if secondary is active
|
||||
docker ps | grep trading-bot-v4
|
||||
```
|
||||
|
||||
**On primary:**
|
||||
```bash
|
||||
# Run healthcheck manually
|
||||
bash ha-setup/healthcheck.sh
|
||||
|
||||
# Check container status
|
||||
docker ps | grep trading-bot-v4
|
||||
```
|
||||
|
||||
### Manual Failover Testing
|
||||
|
||||
**Simulate primary failure:**
|
||||
```bash
|
||||
# On primary, stop trading bot
|
||||
docker compose stop trading-bot
|
||||
|
||||
# Watch secondary logs - should activate within 45s
|
||||
# On secondary
|
||||
sudo journalctl -u trading-bot-ha -f
|
||||
```
|
||||
|
||||
**Restore primary:**
|
||||
```bash
|
||||
# On primary, restart trading bot
|
||||
docker compose up -d trading-bot
|
||||
|
||||
# Watch secondary - should deactivate within 15s
|
||||
```
|
||||
|
||||
### Database Sync Schedule
|
||||
|
||||
**Daily sync from primary to secondary:**
|
||||
|
||||
On primary, add to crontab:
|
||||
```bash
|
||||
crontab -e
|
||||
# Add:
|
||||
0 2 * * * /home/icke/traderv4/ha-setup/sync-db-daily.sh >> /var/log/trading-bot-db-sync.log 2>&1
|
||||
```
|
||||
|
||||
**Before failover events:** Secondary uses last synced DB state (max 24h old trade history)
|
||||
**After failover:** Secondary continues with current state, syncs back to primary when recovered
|
||||
|
||||
---
|
||||
|
||||
## Important Notes
|
||||
|
||||
### Financial Safety
|
||||
|
||||
- **NEVER run both servers actively** - would cause duplicate trades and wallet conflicts
|
||||
- **Failover controller ensures** only one active at a time
|
||||
- **Same wallet key** required on both servers
|
||||
- **Same n8n webhook endpoint** - update TradingView alerts if needed
|
||||
|
||||
### Database Consistency
|
||||
|
||||
- **Daily sync:** Keeps secondary within 24h of primary
|
||||
- **Trade history:** May have small gap after failover (acceptable)
|
||||
- **Position Manager:** Rebuilds state from Drift Protocol on startup
|
||||
- **No financial loss:** Drift Protocol is source of truth for positions
|
||||
|
||||
### Network Requirements
|
||||
|
||||
- **Secondary → Primary:** SSH access (port 22) for health checks
|
||||
- **Both → Internet:** For Drift Protocol, Telegram, n8n webhooks
|
||||
- **n8n:** Can run on both or centralized (needs webhook routing)
|
||||
|
||||
### Testing Recommendations
|
||||
|
||||
1. **Week 1:** Run without failover, just monitor health checks
|
||||
2. **Week 2:** Test manual failover (stop primary, verify secondary takes over)
|
||||
3. **Week 3:** Test recovery (restart primary, verify secondary stops)
|
||||
4. **Week 4:** Enable automatic failover for production
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Secondary Won't Start After Failover
|
||||
|
||||
```bash
|
||||
# Check logs
|
||||
docker logs trading-bot-v4
|
||||
|
||||
# Check .env file exists
|
||||
ls -la /home/icke/traderv4/.env
|
||||
|
||||
# Check Drift initialization
|
||||
docker logs trading-bot-v4 | grep "Drift"
|
||||
```
|
||||
|
||||
### Split-Brain (Both Servers Active)
|
||||
|
||||
**EMERGENCY - Stop both immediately:**
|
||||
```bash
|
||||
# On both servers
|
||||
docker compose stop trading-bot
|
||||
```
|
||||
|
||||
**Then restart only primary:**
|
||||
```bash
|
||||
# On primary only
|
||||
docker compose up -d trading-bot
|
||||
```
|
||||
|
||||
**Check Drift positions:**
|
||||
```bash
|
||||
curl -s http://localhost:3001/api/trading/positions \
|
||||
-H "Authorization: Bearer ${API_SECRET_KEY}" | jq .
|
||||
```
|
||||
|
||||
### Health Check False Positives
|
||||
|
||||
Adjust thresholds in `failover-controller.sh`:
|
||||
```bash
|
||||
CHECK_INTERVAL=30 # Slower checks (reduce network load)
|
||||
MAX_FAILURES=5 # More tolerant (reduce false failovers)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
**Primary Server:** Always running (existing cost)
|
||||
**Secondary Server:** Always running, but mostly idle
|
||||
|
||||
**Benefits:**
|
||||
- **99.9% uptime** vs 95% single server
|
||||
- **~4.5 hours/year** max downtime (failover time)
|
||||
- **Financial protection** - no missed trades during outages
|
||||
- **Peace of mind** - sleep without worrying about server crashes
|
||||
|
||||
**Worth it?** YES - For a financial system, redundancy is essential.
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Geographic redundancy:** Secondary in different datacenter/region
|
||||
2. **Load balancer:** Route n8n webhooks to active server automatically
|
||||
3. **Database streaming replication:** Real-time sync (0 data loss)
|
||||
4. **Multi-region:** Three servers (US, EU, Asia) for global coverage
|
||||
5. **Health dashboard:** Web UI showing HA status and metrics
|
||||
Reference in New Issue
Block a user