# Docker Infrastructure Improvement Roadmap

**Generated:** November 11, 2025  
**Status:** Planning Phase  
**Total Services:** 39 running containers

---

## Overview

This roadmap addresses critical issues, security vulnerabilities, and operational improvements identified in the Docker Compose infrastructure. The plan is divided into 4 phases, prioritizing quick wins and critical security issues first.

---

## Phase 1: Quick Wins (Low Risk, High Impact)

**Estimated Time:** 2-4 hours  
**Risk Level:** Low  
**Downtime:** Minimal

### Tasks

- [ ] **Change N8N password** from "changeme" to secure password
  - File: `n8n.yml`
  - Impact: Critical security fix
  - Downtime: < 1 minute

- [ ] **Add healthchecks to critical services**
  - [ ] Bitwarden (password manager)
  - [ ] Gitea (code repository)
  - [ ] N8N (automation)
  - [ ] Synapse (Matrix server)
  - [ ] MariaDB instances
  - Benefit: Auto-restart on failure, better monitoring

- [ ] **Enable Loki logging for remaining 15 services**
  - Services missing logging: element-web, telegram-bridge, whatsapp-bridge, piper, whisper, gitea, coturn, trading-bot, postgres, and others
  - Benefit: Centralized log management

- [ ] **Add `depends_on` to multi-container stacks**
  - [ ] Blog → mysql-blog
  - [ ] Helferlein → mysql-helferlein
  - [ ] Traccar → mysql-traccar
  - [ ] Zabbix components
  - [ ] Matrix bridges → Synapse
  - Benefit: Proper startup order

---

## Phase 2: Security Hardening (Medium Risk)

**Estimated Time:** 4-8 hours  
**Risk Level:** Medium  
**Downtime:** 5-10 minutes per service

### Tasks

- [ ] **Move passwords to environment files**
  - [ ] Create `/home/icke/env_files/` directory structure
  - [ ] Move passwords from compose files to `.env` files:
    - [ ] blog.yml → `eccmts42*`
    - [ ] nextcloud.yml → `eccmts42*`
    - [ ] helferlein.yml → `eccmts42*`
    - [ ] traccar.yml → `eccmts42*`
    - [ ] wallabag.yml → `eccmts42*`
    - [ ] zabbix.yml → `eccmts42*`
    - [ ] firefly.yml → `firefly_secure_password_123`
    - [ ] matamo.yml → `matomo`
    - [ ] n8n.yml → new secure password
  - [ ] Update `.gitignore` to exclude `.env` files
  - [ ] Document password locations in separate secure file

- [ ] **Move admin tokens to secrets**
  - [ ] Bitwarden admin token → env file
  - [ ] Firefly cron token → env file
  - [ ] Coturn static auth secret → config file

- [ ] **Create dedicated networks for isolated services**
  - [ ] Element-web (currently no network)
  - [ ] Telegram-bridge (currently no network)
  - [ ] Whatsapp-bridge (currently no network)
  - [ ] Piper (currently no network)
  - [ ] Whisper (currently no network)
  - [ ] Coturn (currently no network)

- [ ] **Remove services from shared default network**
  - Services on `compose_files_default`:
    - [ ] n8n → dedicated network
    - [ ] plex → dedicated network
    - [ ] whisper → dedicated network
    - [ ] unifi → dedicated network
    - [ ] synapse + bridges → shared matrix network
    - [ ] piper → dedicated network
    - [ ] coturn → can stay (needs to be accessible)

- [ ] **Remove deprecated `links:` directives** (7 instances)
  - [ ] blog.yml
  - [ ] helferlein.yml
  - [ ] traccar.yml
  - [ ] zabbix.yml
  - Replace with network aliases and `depends_on`

- [ ] **Review and fix user permissions**
  - [ ] Plex: Change from UID=0 to proper user
  - [ ] Jellyfin: Change from UID=0 to proper user
  - [ ] Verify other services aren't running as root unnecessarily

---

## Phase 3: Stability & Reliability Improvements (Medium-High Risk)

**Estimated Time:** 8-16 hours  
**Risk Level:** Medium-High  
**Downtime:** 10-30 minutes per service

### Tasks

- [ ] **Remove `container_name` from all services** (54 instances)
  - Use compose project naming with network aliases instead
  - Prevents stale endpoint issues after `docker system prune`
  - Priority services:
    - [ ] bitwarden.yml
    - [ ] blog.yml
    - [ ] gitea.yml
    - [ ] jellyfin.yml
    - [ ] plex.yml
    - [ ] synapse.yml
    - [ ] n8n.yml
    - [ ] unifi.yml
    - [ ] zabbix.yml (multiple containers)
    - [ ] firefly.yml (multiple containers)
    - [ ] Element-web, bridges (all)
    - [ ] Trading bot components
  - Note: Nextcloud already fixed ✅

- [ ] **Remove static IP addresses** (16 instances)
  - [ ] bitwarden.yml → use DNS aliases
  - [ ] blog.yml → use DNS aliases
  - [ ] jellyfin.yml → use DNS aliases
  - [ ] zabbix.yml → use DNS aliases
  - Replace with network aliases for service discovery

- [ ] **Add resource limits to all services**
  - Template (adjust per service):
    ```yaml
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '0.5'
        reservations:
          memory: 256M
    ```
  - Priority services to limit:
    - [ ] Plex (media server - high memory)
    - [ ] Jellyfin (media server - high memory)
    - [ ] N8N (automation - can grow)
    - [ ] Nextcloud (web app - high memory)
    - [ ] Synapse (Matrix - high memory)
    - [ ] MySQL/MariaDB instances
    - [ ] Zabbix server
  - Less critical services: 512M limits

- [ ] **Standardize compose file format**
  - [ ] Remove `version:` declarations (deprecated in current compose spec)
  - [ ] Use consistent YAML formatting
  - [ ] Add comments for complex configurations

- [ ] **Add volume backup labels/annotations**
  - Label critical data volumes:
    - [ ] Bitwarden data
    - [ ] Gitea data
    - [ ] Nextcloud data
    - [ ] Database volumes
    - [ ] N8N workflows
  - Prepare for automated backup solutions

---

## Phase 4: Software Upgrades (High Risk)

**Estimated Time:** 4-8 hours  
**Risk Level:** High  
**Downtime:** 30-60 minutes per service  
**Recommendation:** Test in development first

### Tasks

- [ ] **Upgrade EOL MySQL 5.7 to MariaDB 10.11+**
  - [ ] Blog (mysql-blog)
    - Backup database
    - Export data
    - Switch to MariaDB
    - Import data
    - Test thoroughly
  - [ ] Helferlein (mysql-helferlein)
    - Same process as blog

- [ ] **Upgrade Zabbix 6.4 → 7.0+**
  - Current: `zabbix/zabbix-server-mysql:6.4-ubuntu-latest`
  - Target: `zabbix/zabbix-server-mysql:7.0-alpine-latest`
  - Steps:
    - [ ] Read Zabbix 7.0 migration guide
    - [ ] Backup Zabbix database
    - [ ] Update images in zabbix.yml
    - [ ] Test web UI and agents

- [ ] **Pin `:latest` tags to specific versions**
  - Services currently using `:latest`:
    - [ ] Synapse
    - [ ] Element-web
    - [ ] Jellyfin
    - [ ] Gitea
    - [ ] Telegram-bridge
    - [ ] Whatsapp-bridge
    - [ ] And others
  - Benefit: Predictable updates, easier rollback

- [ ] **Consider N8N database backend migration**
  - Current: File-based storage
  - Recommended: PostgreSQL for better performance
  - Would require N8N reconfiguration

- [ ] **Review Unifi duplicate mount**
  - Currently mounts `/home/icke/unifi` to both `/config` and `/data`
  - Clean up redundant configuration

---

## Critical Services Priority List

Fix these services first due to security/stability concerns:

1. **N8N** (automation) - Weak password, no network isolation
2. **Bitwarden** (passwords) - Exposed admin token
3. **Gitea** (code repo) - No healthcheck, no dedicated network
4. **Blog/Helferlein** - EOL MySQL version
5. **Synapse + Bridges** - Network architecture needs improvement
6. **Services on compose_files_default** - Need network isolation

---

## Statistics

- **Total Services:** 39 running containers
- **Services with `container_name`:** 54 instances
- **Services with hardcoded passwords:** 20+ instances
- **Services using deprecated `links`:** 7 instances
- **Services with static IPs:** 16 instances
- **Services with Loki logging:** 24/39 (61%)
- **Services with healthchecks:** 2/39 (5%)
- **Services with resource limits:** 1/39 (3%)
- **Services using old MySQL 5.7:** 2 instances
- **Shared networks:** 13 custom networks (some overloaded)

---

## Implementation Notes

### Before Starting Any Phase

1. **Full system backup**
   - Backup all `/home/icke/` directories
   - Export all databases
   - Document current working state

2. **Create rollback plan**
   - Keep old compose files as `.yml.backup`
   - Document current container states
   - Test rollback procedure

3. **Schedule maintenance window**
   - Notify users of potential downtime
   - Choose low-traffic time period
   - Have monitoring ready

### Testing Strategy

1. Test changes on one service first
2. Monitor for 24 hours
3. Apply to similar services in batches
4. Keep previous configs for quick rollback

### Success Criteria

- All services start successfully
- No stale endpoint errors after `docker system prune`
- All services accessible via their original URLs/ports
- Logs flowing to Loki
- Healthchecks reporting healthy status

---

## Maintenance Schedule Recommendation

- **Phase 1:** Can be done immediately, low risk
- **Phase 2:** Schedule over 2-3 weekends
- **Phase 3:** One service per weekend, monitor for a week
- **Phase 4:** Full maintenance window, test environment first

---

## Additional Recommendations

### Future Improvements (Not in Roadmap)

- Consider Traefik/Nginx Proxy Manager for unified reverse proxy
- Implement automated backup solution (Duplicati, Restic, etc.)
- Add Prometheus monitoring for metrics collection
- Consider Watchtower for automated updates (carefully configured)
- Create Docker Swarm or K8s cluster for HA (if needed)
- Implement secrets management (Vault, Docker Secrets)
- Add CI/CD pipeline for compose file validation

### Documentation

- Document network architecture diagram
- Create service dependency map
- Maintain service inventory with versions
- Document backup and restore procedures
- Create runbooks for common issues

---

## Progress Tracking

Use this section to track completion:

```
Phase 1: [ ] 0/4 major tasks
Phase 2: [ ] 0/7 major tasks  
Phase 3: [ ] 0/5 major tasks
Phase 4: [ ] 0/5 major tasks

Overall Progress: 0%
```

---

## Notes & Decisions

Document any decisions or deviations from this roadmap here:

- 2025-11-11: Roadmap created based on infrastructure analysis
- 2025-11-11: Nextcloud fixed (removed container_name, added dedicated network)

---

**Last Updated:** 2025-11-11  
**Next Review:** After Phase 1 completion