trading_bot_v4/docs/RATE_LIMIT_MONITORING.md

# Rate Limit Monitoring - SQL Queries

## Quick Access
```bash
# View rate limit analytics via API
curl http://localhost:3001/api/analytics/rate-limits | python3 -m json.tool

# Direct database queries
docker exec trading-bot-postgres psql -U postgres -d trading_bot_v4
```

## Common Queries

### 1. Recent Rate Limit Events (Last 24 Hours)
```sql
SELECT
  "eventType",
  message,
  details,
  TO_CHAR("createdAt", 'MM-DD HH24:MI:SS') as time
FROM "SystemEvent"
WHERE "eventType" IN ('rate_limit_hit', 'rate_limit_recovered', 'rate_limit_exhausted')
  AND "createdAt" > NOW() - INTERVAL '24 hours'
ORDER BY "createdAt" DESC
LIMIT 20;
```

### 2. Rate Limit Statistics (Last 7 Days)
```sql
SELECT
  "eventType",
  COUNT(*) as occurrences,
  MIN("createdAt") as first_seen,
  MAX("createdAt") as last_seen
FROM "SystemEvent"
WHERE "eventType" IN ('rate_limit_hit', 'rate_limit_recovered', 'rate_limit_exhausted')
  AND "createdAt" > NOW() - INTERVAL '7 days'
GROUP BY "eventType"
ORDER BY occurrences DESC;
```

### 3. Rate Limit Pattern by Hour (Find Peak Times)
```sql
SELECT
  EXTRACT(HOUR FROM "createdAt") as hour,
  COUNT(*) as rate_limit_hits,
  COUNT(DISTINCT DATE("createdAt")) as days_affected
FROM "SystemEvent"
WHERE "eventType" = 'rate_limit_hit'
  AND "createdAt" > NOW() - INTERVAL '7 days'
GROUP BY EXTRACT(HOUR FROM "createdAt")
ORDER BY rate_limit_hits DESC;
```

### 4. Recovery Time Analysis
```sql
SELECT
  (details->>'retriesNeeded')::int as retries,
  (details->>'totalTimeMs')::int as recovery_ms,
  TO_CHAR("createdAt", 'MM-DD HH24:MI:SS') as recovered_at
FROM "SystemEvent"
WHERE "eventType" = 'rate_limit_recovered'
  AND "createdAt" > NOW() - INTERVAL '7 days'
ORDER BY recovery_ms DESC;
```

### 5. Failed Recoveries (Exhausted Retries)
```sql
SELECT
  details->>'errorMessage' as error,
  (details->>'totalTimeMs')::int as failed_after_ms,
  TO_CHAR("createdAt", 'MM-DD HH24:MI:SS') as failed_at
FROM "SystemEvent"
WHERE "eventType" = 'rate_limit_exhausted'
  AND "createdAt" > NOW() - INTERVAL '7 days'
ORDER BY "createdAt" DESC;
```

### 6. Rate Limit Health Score (Last 24h)
```sql
SELECT
  COUNT(CASE WHEN "eventType" = 'rate_limit_hit' THEN 1 END) as total_hits,
  COUNT(CASE WHEN "eventType" = 'rate_limit_recovered' THEN 1 END) as recovered,
  COUNT(CASE WHEN "eventType" = 'rate_limit_exhausted' THEN 1 END) as failed,
  CASE
    WHEN COUNT(CASE WHEN "eventType" = 'rate_limit_hit' THEN 1 END) = 0 THEN '✅ HEALTHY'
    WHEN COUNT(CASE WHEN "eventType" = 'rate_limit_exhausted' THEN 1 END) > 0 THEN '🔴 CRITICAL'
    WHEN COUNT(CASE WHEN "eventType" = 'rate_limit_hit' THEN 1 END) > 10 THEN '⚠️ WARNING'
    ELSE '✅ HEALTHY'
  END as health_status,
  ROUND(100.0 * COUNT(CASE WHEN "eventType" = 'rate_limit_recovered' THEN 1 END) /
        NULLIF(COUNT(CASE WHEN "eventType" = 'rate_limit_hit' THEN 1 END), 0), 1) as recovery_rate
FROM "SystemEvent"
WHERE "eventType" IN ('rate_limit_hit', 'rate_limit_recovered', 'rate_limit_exhausted')
  AND "createdAt" > NOW() - INTERVAL '24 hours';
```

## What to Watch For

### 🔴 Critical Alerts
- **rate_limit_exhausted** events: Order placement/cancellation failed completely
- Recovery rate below 80%: System struggling to handle rate limits
- Multiple exhausted events in short time: RPC endpoint may be degraded

### ⚠️ Warnings
- More than 10 rate_limit_hit events per hour: High trading frequency
- Recovery times > 10 seconds: Backoff delays stacking up
- Rate limits during specific hours: Identify peak Solana network times

### ✅ Healthy Patterns
- 100% recovery rate: All rate limits handled successfully
- Recovery times 2-4 seconds: Retries working efficiently
- Zero rate_limit_exhausted events: No failed operations

## Optimization Actions

**If seeing frequent rate limits:**
1. Increase `baseDelay` in `retryWithBackoff()` (currently 2000ms)
2. Add delay between `cancelAllOrders()` and `placeExitOrders()` (currently immediate)
3. Consider using a faster RPC endpoint (Helius Pro, Triton, etc.)
4. Batch order operations if possible

**If seeing exhausted retries:**
1. Increase `maxRetries` from 3 to 5
2. Increase exponential backoff multiplier (currently 2x)
3. Check RPC endpoint health/status page
4. Consider implementing circuit breaker pattern

## Live Monitoring Commands

```bash
# Watch rate limits in real-time
docker logs -f trading-bot-v4 | grep -i "rate limit"

# Count rate limit events today
docker exec trading-bot-postgres psql -U postgres -d trading_bot_v4 -c "
SELECT COUNT(*) FROM \"SystemEvent\"
WHERE \"eventType\" = 'rate_limit_hit'
AND DATE(\"createdAt\") = CURRENT_DATE;"

# Check latest rate limit event
docker exec trading-bot-postgres psql -U postgres -d trading_bot_v4 -c "
SELECT * FROM \"SystemEvent\"
WHERE \"eventType\" IN ('rate_limit_hit', 'rate_limit_recovered', 'rate_limit_exhausted')
ORDER BY \"createdAt\" DESC LIMIT 1;"
```

## Integration with Alerts

When implementing automated alerts, trigger on:
- Any `rate_limit_exhausted` event (critical)
- More than 5 `rate_limit_hit` events in 5 minutes (warning)
- Recovery rate below 90% over 1 hour (warning)

Log format examples:
```
✅ Retry successful after 2341ms (1 retries)
⏳ Rate limited (429), retrying in 2s... (attempt 1/3)
❌ RATE LIMIT EXHAUSTED: Failed after 3 retries and 14523ms
```