Files
srvdocker02_compose_files/compose_files/INFRASTRUCTURE_ROADMAP.md
mindesbunister 68e9a89593 Add infrastructure improvement roadmap
- Comprehensive analysis of all 39 running containers
- Identified critical issues: container names, hardcoded passwords, network conflicts
- 4-phase improvement plan prioritized by risk and impact
- Documents specific tasks for security, stability, and upgrades
- Includes statistics and implementation guidelines
2025-11-11 14:43:01 +01:00

10 KiB

Docker Infrastructure Improvement Roadmap

Generated: November 11, 2025
Status: Planning Phase
Total Services: 39 running containers


Overview

This roadmap addresses critical issues, security vulnerabilities, and operational improvements identified in the Docker Compose infrastructure. The plan is divided into 4 phases, prioritizing quick wins and critical security issues first.


Phase 1: Quick Wins (Low Risk, High Impact)

Estimated Time: 2-4 hours
Risk Level: Low
Downtime: Minimal

Tasks

  • Change N8N password from "changeme" to secure password

    • File: n8n.yml
    • Impact: Critical security fix
    • Downtime: < 1 minute
  • Add healthchecks to critical services

    • Bitwarden (password manager)
    • Gitea (code repository)
    • N8N (automation)
    • Synapse (Matrix server)
    • MariaDB instances
    • Benefit: Auto-restart on failure, better monitoring
  • Enable Loki logging for remaining 15 services

    • Services missing logging: element-web, telegram-bridge, whatsapp-bridge, piper, whisper, gitea, coturn, trading-bot, postgres, and others
    • Benefit: Centralized log management
  • Add depends_on to multi-container stacks

    • Blog → mysql-blog
    • Helferlein → mysql-helferlein
    • Traccar → mysql-traccar
    • Zabbix components
    • Matrix bridges → Synapse
    • Benefit: Proper startup order

Phase 2: Security Hardening (Medium Risk)

Estimated Time: 4-8 hours
Risk Level: Medium
Downtime: 5-10 minutes per service

Tasks

  • Move passwords to environment files

    • Create /home/icke/env_files/ directory structure
    • Move passwords from compose files to .env files:
      • blog.yml → eccmts42*
      • nextcloud.yml → eccmts42*
      • helferlein.yml → eccmts42*
      • traccar.yml → eccmts42*
      • wallabag.yml → eccmts42*
      • zabbix.yml → eccmts42*
      • firefly.yml → firefly_secure_password_123
      • matamo.yml → matomo
      • n8n.yml → new secure password
    • Update .gitignore to exclude .env files
    • Document password locations in separate secure file
  • Move admin tokens to secrets

    • Bitwarden admin token → env file
    • Firefly cron token → env file
    • Coturn static auth secret → config file
  • Create dedicated networks for isolated services

    • Element-web (currently no network)
    • Telegram-bridge (currently no network)
    • Whatsapp-bridge (currently no network)
    • Piper (currently no network)
    • Whisper (currently no network)
    • Coturn (currently no network)
  • Remove services from shared default network

    • Services on compose_files_default:
      • n8n → dedicated network
      • plex → dedicated network
      • whisper → dedicated network
      • unifi → dedicated network
      • synapse + bridges → shared matrix network
      • piper → dedicated network
      • coturn → can stay (needs to be accessible)
  • Remove deprecated links: directives (7 instances)

    • blog.yml
    • helferlein.yml
    • traccar.yml
    • zabbix.yml
    • Replace with network aliases and depends_on
  • Review and fix user permissions

    • Plex: Change from UID=0 to proper user
    • Jellyfin: Change from UID=0 to proper user
    • Verify other services aren't running as root unnecessarily

Phase 3: Stability & Reliability Improvements (Medium-High Risk)

Estimated Time: 8-16 hours
Risk Level: Medium-High
Downtime: 10-30 minutes per service

Tasks

  • Remove container_name from all services (54 instances)

    • Use compose project naming with network aliases instead
    • Prevents stale endpoint issues after docker system prune
    • Priority services:
      • bitwarden.yml
      • blog.yml
      • gitea.yml
      • jellyfin.yml
      • plex.yml
      • synapse.yml
      • n8n.yml
      • unifi.yml
      • zabbix.yml (multiple containers)
      • firefly.yml (multiple containers)
      • Element-web, bridges (all)
      • Trading bot components
    • Note: Nextcloud already fixed
  • Remove static IP addresses (16 instances)

    • bitwarden.yml → use DNS aliases
    • blog.yml → use DNS aliases
    • jellyfin.yml → use DNS aliases
    • zabbix.yml → use DNS aliases
    • Replace with network aliases for service discovery
  • Add resource limits to all services

    • Template (adjust per service):
      deploy:
        resources:
          limits:
            memory: 1G
            cpus: '0.5'
          reservations:
            memory: 256M
      
    • Priority services to limit:
      • Plex (media server - high memory)
      • Jellyfin (media server - high memory)
      • N8N (automation - can grow)
      • Nextcloud (web app - high memory)
      • Synapse (Matrix - high memory)
      • MySQL/MariaDB instances
      • Zabbix server
    • Less critical services: 512M limits
  • Standardize compose file format

    • Remove version: declarations (deprecated in current compose spec)
    • Use consistent YAML formatting
    • Add comments for complex configurations
  • Add volume backup labels/annotations

    • Label critical data volumes:
      • Bitwarden data
      • Gitea data
      • Nextcloud data
      • Database volumes
      • N8N workflows
    • Prepare for automated backup solutions

Phase 4: Software Upgrades (High Risk)

Estimated Time: 4-8 hours
Risk Level: High
Downtime: 30-60 minutes per service
Recommendation: Test in development first

Tasks

  • Upgrade EOL MySQL 5.7 to MariaDB 10.11+

    • Blog (mysql-blog)
      • Backup database
      • Export data
      • Switch to MariaDB
      • Import data
      • Test thoroughly
    • Helferlein (mysql-helferlein)
      • Same process as blog
  • Upgrade Zabbix 6.4 → 7.0+

    • Current: zabbix/zabbix-server-mysql:6.4-ubuntu-latest
    • Target: zabbix/zabbix-server-mysql:7.0-alpine-latest
    • Steps:
      • Read Zabbix 7.0 migration guide
      • Backup Zabbix database
      • Update images in zabbix.yml
      • Test web UI and agents
  • Pin :latest tags to specific versions

    • Services currently using :latest:
      • Synapse
      • Element-web
      • Jellyfin
      • Gitea
      • Telegram-bridge
      • Whatsapp-bridge
      • And others
    • Benefit: Predictable updates, easier rollback
  • Consider N8N database backend migration

    • Current: File-based storage
    • Recommended: PostgreSQL for better performance
    • Would require N8N reconfiguration
  • Review Unifi duplicate mount

    • Currently mounts /home/icke/unifi to both /config and /data
    • Clean up redundant configuration

Critical Services Priority List

Fix these services first due to security/stability concerns:

  1. N8N (automation) - Weak password, no network isolation
  2. Bitwarden (passwords) - Exposed admin token
  3. Gitea (code repo) - No healthcheck, no dedicated network
  4. Blog/Helferlein - EOL MySQL version
  5. Synapse + Bridges - Network architecture needs improvement
  6. Services on compose_files_default - Need network isolation

Statistics

  • Total Services: 39 running containers
  • Services with container_name: 54 instances
  • Services with hardcoded passwords: 20+ instances
  • Services using deprecated links: 7 instances
  • Services with static IPs: 16 instances
  • Services with Loki logging: 24/39 (61%)
  • Services with healthchecks: 2/39 (5%)
  • Services with resource limits: 1/39 (3%)
  • Services using old MySQL 5.7: 2 instances
  • Shared networks: 13 custom networks (some overloaded)

Implementation Notes

Before Starting Any Phase

  1. Full system backup

    • Backup all /home/icke/ directories
    • Export all databases
    • Document current working state
  2. Create rollback plan

    • Keep old compose files as .yml.backup
    • Document current container states
    • Test rollback procedure
  3. Schedule maintenance window

    • Notify users of potential downtime
    • Choose low-traffic time period
    • Have monitoring ready

Testing Strategy

  1. Test changes on one service first
  2. Monitor for 24 hours
  3. Apply to similar services in batches
  4. Keep previous configs for quick rollback

Success Criteria

  • All services start successfully
  • No stale endpoint errors after docker system prune
  • All services accessible via their original URLs/ports
  • Logs flowing to Loki
  • Healthchecks reporting healthy status

Maintenance Schedule Recommendation

  • Phase 1: Can be done immediately, low risk
  • Phase 2: Schedule over 2-3 weekends
  • Phase 3: One service per weekend, monitor for a week
  • Phase 4: Full maintenance window, test environment first

Additional Recommendations

Future Improvements (Not in Roadmap)

  • Consider Traefik/Nginx Proxy Manager for unified reverse proxy
  • Implement automated backup solution (Duplicati, Restic, etc.)
  • Add Prometheus monitoring for metrics collection
  • Consider Watchtower for automated updates (carefully configured)
  • Create Docker Swarm or K8s cluster for HA (if needed)
  • Implement secrets management (Vault, Docker Secrets)
  • Add CI/CD pipeline for compose file validation

Documentation

  • Document network architecture diagram
  • Create service dependency map
  • Maintain service inventory with versions
  • Document backup and restore procedures
  • Create runbooks for common issues

Progress Tracking

Use this section to track completion:

Phase 1: [ ] 0/4 major tasks
Phase 2: [ ] 0/7 major tasks  
Phase 3: [ ] 0/5 major tasks
Phase 4: [ ] 0/5 major tasks

Overall Progress: 0%

Notes & Decisions

Document any decisions or deviations from this roadmap here:

  • 2025-11-11: Roadmap created based on infrastructure analysis
  • 2025-11-11: Nextcloud fixed (removed container_name, added dedicated network)

Last Updated: 2025-11-11
Next Review: After Phase 1 completion