Files

mindesbunister 68e9a89593 Add infrastructure improvement roadmap

- Comprehensive analysis of all 39 running containers
- Identified critical issues: container names, hardcoded passwords, network conflicts
- 4-phase improvement plan prioritized by risk and impact
- Documents specific tasks for security, stability, and upgrades
- Includes statistics and implementation guidelines

2025-11-11 14:43:01 +01:00

10 KiB

Raw Blame History

Docker Infrastructure Improvement Roadmap

Generated: November 11, 2025
Status: Planning Phase
Total Services: 39 running containers

Overview

This roadmap addresses critical issues, security vulnerabilities, and operational improvements identified in the Docker Compose infrastructure. The plan is divided into 4 phases, prioritizing quick wins and critical security issues first.

Phase 1: Quick Wins (Low Risk, High Impact)

Estimated Time: 2-4 hours
Risk Level: Low
Downtime: Minimal

Tasks

Change N8N password from "changeme" to secure password
- File: n8n.yml
- Impact: Critical security fix
- Downtime: < 1 minute
Add healthchecks to critical services
- Bitwarden (password manager)
- Gitea (code repository)
- N8N (automation)
- Synapse (Matrix server)
- MariaDB instances
- Benefit: Auto-restart on failure, better monitoring
Enable Loki logging for remaining 15 services
- Services missing logging: element-web, telegram-bridge, whatsapp-bridge, piper, whisper, gitea, coturn, trading-bot, postgres, and others
- Benefit: Centralized log management
Add depends_on to multi-container stacks
- Blog → mysql-blog
- Helferlein → mysql-helferlein
- Traccar → mysql-traccar
- Zabbix components
- Matrix bridges → Synapse
- Benefit: Proper startup order

Phase 2: Security Hardening (Medium Risk)

Estimated Time: 4-8 hours
Risk Level: Medium
Downtime: 5-10 minutes per service

Tasks

Move passwords to environment files
- Create /home/icke/env_files/ directory structure
- Move passwords from compose files to .env files:
  - blog.yml → eccmts42*
  - nextcloud.yml → eccmts42*
  - helferlein.yml → eccmts42*
  - traccar.yml → eccmts42*
  - wallabag.yml → eccmts42*
  - zabbix.yml → eccmts42*
  - firefly.yml → firefly_secure_password_123
  - matamo.yml → matomo
  - n8n.yml → new secure password
- Update .gitignore to exclude .env files
- Document password locations in separate secure file
Move admin tokens to secrets
- Bitwarden admin token → env file
- Firefly cron token → env file
- Coturn static auth secret → config file
Create dedicated networks for isolated services
- Element-web (currently no network)
- Telegram-bridge (currently no network)
- Whatsapp-bridge (currently no network)
- Piper (currently no network)
- Whisper (currently no network)
- Coturn (currently no network)
Remove services from shared default network
- Services on compose_files_default:
  - n8n → dedicated network
  - plex → dedicated network
  - whisper → dedicated network
  - unifi → dedicated network
  - synapse + bridges → shared matrix network
  - piper → dedicated network
  - coturn → can stay (needs to be accessible)
Remove deprecated links: directives (7 instances)
- blog.yml
- helferlein.yml
- traccar.yml
- zabbix.yml
- Replace with network aliases and depends_on
Review and fix user permissions
- Plex: Change from UID=0 to proper user
- Jellyfin: Change from UID=0 to proper user
- Verify other services aren't running as root unnecessarily

Phase 3: Stability & Reliability Improvements (Medium-High Risk)

Estimated Time: 8-16 hours
Risk Level: Medium-High
Downtime: 10-30 minutes per service

Tasks

Remove container_name from all services (54 instances)
- Use compose project naming with network aliases instead
- Prevents stale endpoint issues after docker system prune
- Priority services:
  - bitwarden.yml
  - blog.yml
  - gitea.yml
  - jellyfin.yml
  - plex.yml
  - synapse.yml
  - n8n.yml
  - unifi.yml
  - zabbix.yml (multiple containers)
  - firefly.yml (multiple containers)
  - Element-web, bridges (all)
  - Trading bot components
- Note: Nextcloud already fixed ✅
Remove static IP addresses (16 instances)
- bitwarden.yml → use DNS aliases
- blog.yml → use DNS aliases
- jellyfin.yml → use DNS aliases
- zabbix.yml → use DNS aliases
- Replace with network aliases for service discovery
Add resource limits to all services
- Template (adjust per service):
```
deploy:
  resources:
    limits:
      memory: 1G
      cpus: '0.5'
    reservations:
      memory: 256M
```
- Priority services to limit:
  - Plex (media server - high memory)
  - Jellyfin (media server - high memory)
  - N8N (automation - can grow)
  - Nextcloud (web app - high memory)
  - Synapse (Matrix - high memory)
  - MySQL/MariaDB instances
  - Zabbix server
- Less critical services: 512M limits
Standardize compose file format
- Remove version: declarations (deprecated in current compose spec)
- Use consistent YAML formatting
- Add comments for complex configurations
Add volume backup labels/annotations
- Label critical data volumes:
  - Bitwarden data
  - Gitea data
  - Nextcloud data
  - Database volumes
  - N8N workflows
- Prepare for automated backup solutions

Phase 4: Software Upgrades (High Risk)

Estimated Time: 4-8 hours
Risk Level: High
Downtime: 30-60 minutes per service
Recommendation: Test in development first

Tasks

Upgrade EOL MySQL 5.7 to MariaDB 10.11+
- Blog (mysql-blog)
  - Backup database
  - Export data
  - Switch to MariaDB
  - Import data
  - Test thoroughly
- Helferlein (mysql-helferlein)
  - Same process as blog
Upgrade Zabbix 6.4 → 7.0+
- Current: zabbix/zabbix-server-mysql:6.4-ubuntu-latest
- Target: zabbix/zabbix-server-mysql:7.0-alpine-latest
- Steps:
  - Read Zabbix 7.0 migration guide
  - Backup Zabbix database
  - Update images in zabbix.yml
  - Test web UI and agents
Pin :latest tags to specific versions
- Services currently using :latest:
  - Synapse
  - Element-web
  - Jellyfin
  - Gitea
  - Telegram-bridge
  - Whatsapp-bridge
  - And others
- Benefit: Predictable updates, easier rollback
Consider N8N database backend migration
- Current: File-based storage
- Recommended: PostgreSQL for better performance
- Would require N8N reconfiguration
Review Unifi duplicate mount
- Currently mounts /home/icke/unifi to both /config and /data
- Clean up redundant configuration

Critical Services Priority List

Fix these services first due to security/stability concerns:

N8N (automation) - Weak password, no network isolation
Bitwarden (passwords) - Exposed admin token
Gitea (code repo) - No healthcheck, no dedicated network
Blog/Helferlein - EOL MySQL version
Synapse + Bridges - Network architecture needs improvement
Services on compose_files_default - Need network isolation

Statistics

Total Services: 39 running containers
Services with container_name: 54 instances
Services with hardcoded passwords: 20+ instances
Services using deprecated links: 7 instances
Services with static IPs: 16 instances
Services with Loki logging: 24/39 (61%)
Services with healthchecks: 2/39 (5%)
Services with resource limits: 1/39 (3%)
Services using old MySQL 5.7: 2 instances
Shared networks: 13 custom networks (some overloaded)

Implementation Notes

Before Starting Any Phase

Full system backup
- Backup all /home/icke/ directories
- Export all databases
- Document current working state
Create rollback plan
- Keep old compose files as .yml.backup
- Document current container states
- Test rollback procedure
Schedule maintenance window
- Notify users of potential downtime
- Choose low-traffic time period
- Have monitoring ready

Testing Strategy

Test changes on one service first
Monitor for 24 hours
Apply to similar services in batches
Keep previous configs for quick rollback

Success Criteria

All services start successfully
No stale endpoint errors after docker system prune
All services accessible via their original URLs/ports
Logs flowing to Loki
Healthchecks reporting healthy status

Maintenance Schedule Recommendation

Phase 1: Can be done immediately, low risk
Phase 2: Schedule over 2-3 weekends
Phase 3: One service per weekend, monitor for a week
Phase 4: Full maintenance window, test environment first

Additional Recommendations

Future Improvements (Not in Roadmap)

Consider Traefik/Nginx Proxy Manager for unified reverse proxy
Implement automated backup solution (Duplicati, Restic, etc.)
Add Prometheus monitoring for metrics collection
Consider Watchtower for automated updates (carefully configured)
Create Docker Swarm or K8s cluster for HA (if needed)
Implement secrets management (Vault, Docker Secrets)
Add CI/CD pipeline for compose file validation

Documentation

Document network architecture diagram
Create service dependency map
Maintain service inventory with versions
Document backup and restore procedures
Create runbooks for common issues

Progress Tracking

Use this section to track completion:

Phase 1: [ ] 0/4 major tasks
Phase 2: [ ] 0/7 major tasks  
Phase 3: [ ] 0/5 major tasks
Phase 4: [ ] 0/5 major tasks

Overall Progress: 0%

Notes & Decisions

Document any decisions or deviations from this roadmap here:

2025-11-11: Roadmap created based on infrastructure analysis
2025-11-11: Nextcloud fixed (removed container_name, added dedicated network)

Last Updated: 2025-11-11
Next Review: After Phase 1 completion

10 KiB Raw Blame History

Docker Infrastructure Improvement Roadmap

Overview

Phase 1: Quick Wins (Low Risk, High Impact)

Tasks

Phase 2: Security Hardening (Medium Risk)

Tasks

Phase 3: Stability & Reliability Improvements (Medium-High Risk)

Tasks

Phase 4: Software Upgrades (High Risk)

Tasks

Critical Services Priority List

Statistics

Implementation Notes

Before Starting Any Phase

Testing Strategy

Success Criteria

Maintenance Schedule Recommendation

Additional Recommendations

Future Improvements (Not in Roadmap)

Documentation

Progress Tracking

Notes & Decisions

10 KiB

Raw Blame History