Files
trading_bot_v4/ha-setup/README.md
mindesbunister 880aae9a77 feat: Add High Availability setup roadmap and scripts
Created comprehensive HA roadmap with 6 phases:
- Phase 1: Warm standby (CURRENT - manual failover)
- Phase 2: Database replication
- Phase 3: Health monitoring
- Phase 4: Reverse proxy + floating IP
- Phase 5: Automated failover
- Phase 6: Geographic redundancy

Includes:
- Decision gates based on capital and stability
- Cost-benefit analysis
- Scripts for healthcheck, failover, DB sync
- Recommendation to defer full HA until capital > $5k

Secondary server ready at 72.62.39.24 for emergency manual failover.

Related: User concern about system uptime, but full HA complexity
not justified at current scale (~$600 capital). Revisit in Q1 2026.
2025-11-19 20:52:12 +01:00

7.5 KiB

High Availability Setup for Trading Bot v4

Architecture: Active-Passive Failover

Primary Server (Active): Runs trading bot 24/7 Secondary Server (Passive): Monitors primary, takes over on failure

Why Active-Passive (Not Active-Active)?

  • Prevents duplicate trades - CRITICAL for financial system
  • Single source of truth - One Position Manager tracking state
  • No split-brain scenarios - Only one bot executes trades
  • Database consistency - No conflicting writes

Setup Instructions

1. Prerequisites

Primary Server: root@192.168.1.100 (update in scripts) Secondary Server: root@72.62.39.24

Both servers need:

  • Docker & Docker Compose installed
  • Trading bot project at /home/icke/traderv4
  • Same .env file (especially DRIFT_WALLET_PRIVATE_KEY)
  • Same n8n workflows configured

2. Initial Sync (Already Done via rsync )

# From primary server
rsync -avz --exclude 'node_modules' --exclude '.next' \
  /home/icke/traderv4/ root@72.62.39.24:/home/icke/traderv4/

3. Database Synchronization

Option A: Manual Sync (Simpler, Recommended for Start)

On primary:

docker exec trading-bot-postgres pg_dump -U postgres trading_bot_v4 > /tmp/trading_bot_backup.sql
rsync -avz /tmp/trading_bot_backup.sql root@72.62.39.24:/tmp/

On secondary:

docker exec -i trading-bot-postgres psql -U postgres trading_bot_v4 < /tmp/trading_bot_backup.sql

Run this daily via cron on primary:

0 2 * * * /home/icke/traderv4/ha-setup/sync-db-daily.sh

Option B: Streaming Replication (Advanced)

# On primary
bash ha-setup/setup-db-replication.sh primary

# On secondary
bash ha-setup/setup-db-replication.sh secondary

4. Setup Health Monitoring

Make scripts executable:

chmod +x ha-setup/*.sh

Test healthcheck on both servers:

bash ha-setup/healthcheck.sh
# Should output: ✅ HEALTHY: All checks passed

5. Start Failover Controller (SECONDARY ONLY)

Edit configuration first:

nano ha-setup/failover-controller.sh
# Update PRIMARY_HOST with actual IP
# Update SECONDARY_HOST if needed

Run as systemd service:

sudo cp ha-setup/trading-bot-ha.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable trading-bot-ha
sudo systemctl start trading-bot-ha

Check status:

sudo systemctl status trading-bot-ha
sudo journalctl -u trading-bot-ha -f

6. SSH Key Setup (Password-less Auth)

Secondary needs SSH access to primary for health checks:

# On secondary
ssh-keygen -t ed25519 -f /root/.ssh/trading_bot_ha
ssh-copy-id -i /root/.ssh/trading_bot_ha root@192.168.1.100

# Test connection
ssh root@192.168.1.100 "docker ps | grep trading-bot"

How It Works

Normal Operation (Primary Active)

  1. Primary: Trading bot running, executing trades
  2. Secondary: Failover controller checks primary every 15s
  3. Secondary: Bot container STOPPED (passive standby)

Failover Scenario

  1. Primary fails (server down, docker crash, API unresponsive)
  2. Secondary detects 3 consecutive failed health checks (45s)
  3. Telegram alert sent: "🚨 HA FAILOVER: Primary failed, activating secondary"
  4. Secondary starts trading bot container
  5. Trading continues on secondary with same wallet/config

Recovery Scenario

  1. Primary recovers (you fix it, restart, etc.)
  2. Secondary detects primary is healthy again
  3. Secondary stops its trading bot (returns to standby)
  4. Telegram alert: "Primary recovered, secondary deactivated"
  5. Primary resumes as active node

Monitoring & Maintenance

Check HA Status

On secondary:

# View failover controller logs
sudo journalctl -u trading-bot-ha -f --lines=50

# Check if secondary is active
docker ps | grep trading-bot-v4

On primary:

# Run healthcheck manually
bash ha-setup/healthcheck.sh

# Check container status
docker ps | grep trading-bot-v4

Manual Failover Testing

Simulate primary failure:

# On primary, stop trading bot
docker compose stop trading-bot

# Watch secondary logs - should activate within 45s
# On secondary
sudo journalctl -u trading-bot-ha -f

Restore primary:

# On primary, restart trading bot
docker compose up -d trading-bot

# Watch secondary - should deactivate within 15s

Database Sync Schedule

Daily sync from primary to secondary:

On primary, add to crontab:

crontab -e
# Add:
0 2 * * * /home/icke/traderv4/ha-setup/sync-db-daily.sh >> /var/log/trading-bot-db-sync.log 2>&1

Before failover events: Secondary uses last synced DB state (max 24h old trade history) After failover: Secondary continues with current state, syncs back to primary when recovered


Important Notes

Financial Safety

  • NEVER run both servers actively - would cause duplicate trades and wallet conflicts
  • Failover controller ensures only one active at a time
  • Same wallet key required on both servers
  • Same n8n webhook endpoint - update TradingView alerts if needed

Database Consistency

  • Daily sync: Keeps secondary within 24h of primary
  • Trade history: May have small gap after failover (acceptable)
  • Position Manager: Rebuilds state from Drift Protocol on startup
  • No financial loss: Drift Protocol is source of truth for positions

Network Requirements

  • Secondary → Primary: SSH access (port 22) for health checks
  • Both → Internet: For Drift Protocol, Telegram, n8n webhooks
  • n8n: Can run on both or centralized (needs webhook routing)

Testing Recommendations

  1. Week 1: Run without failover, just monitor health checks
  2. Week 2: Test manual failover (stop primary, verify secondary takes over)
  3. Week 3: Test recovery (restart primary, verify secondary stops)
  4. Week 4: Enable automatic failover for production

Troubleshooting

Secondary Won't Start After Failover

# Check logs
docker logs trading-bot-v4

# Check .env file exists
ls -la /home/icke/traderv4/.env

# Check Drift initialization
docker logs trading-bot-v4 | grep "Drift"

Split-Brain (Both Servers Active)

EMERGENCY - Stop both immediately:

# On both servers
docker compose stop trading-bot

Then restart only primary:

# On primary only
docker compose up -d trading-bot

Check Drift positions:

curl -s http://localhost:3001/api/trading/positions \
  -H "Authorization: Bearer ${API_SECRET_KEY}" | jq .

Health Check False Positives

Adjust thresholds in failover-controller.sh:

CHECK_INTERVAL=30  # Slower checks (reduce network load)
MAX_FAILURES=5     # More tolerant (reduce false failovers)

Cost Analysis

Primary Server: Always running (existing cost) Secondary Server: Always running, but mostly idle

Benefits:

  • 99.9% uptime vs 95% single server
  • ~4.5 hours/year max downtime (failover time)
  • Financial protection - no missed trades during outages
  • Peace of mind - sleep without worrying about server crashes

Worth it? YES - For a financial system, redundancy is essential.


Future Enhancements

  1. Geographic redundancy: Secondary in different datacenter/region
  2. Load balancer: Route n8n webhooks to active server automatically
  3. Database streaming replication: Real-time sync (0 data loss)
  4. Multi-region: Three servers (US, EU, Asia) for global coverage
  5. Health dashboard: Web UI showing HA status and metrics