diff --git a/compose_files/INFRASTRUCTURE_ROADMAP.md b/compose_files/INFRASTRUCTURE_ROADMAP.md new file mode 100644 index 0000000..debee75 --- /dev/null +++ b/compose_files/INFRASTRUCTURE_ROADMAP.md @@ -0,0 +1,350 @@ +# Docker Infrastructure Improvement Roadmap + +**Generated:** November 11, 2025 +**Status:** Planning Phase +**Total Services:** 39 running containers + +--- + +## Overview + +This roadmap addresses critical issues, security vulnerabilities, and operational improvements identified in the Docker Compose infrastructure. The plan is divided into 4 phases, prioritizing quick wins and critical security issues first. + +--- + +## Phase 1: Quick Wins (Low Risk, High Impact) + +**Estimated Time:** 2-4 hours +**Risk Level:** Low +**Downtime:** Minimal + +### Tasks + +- [ ] **Change N8N password** from "changeme" to secure password + - File: `n8n.yml` + - Impact: Critical security fix + - Downtime: < 1 minute + +- [ ] **Add healthchecks to critical services** + - [ ] Bitwarden (password manager) + - [ ] Gitea (code repository) + - [ ] N8N (automation) + - [ ] Synapse (Matrix server) + - [ ] MariaDB instances + - Benefit: Auto-restart on failure, better monitoring + +- [ ] **Enable Loki logging for remaining 15 services** + - Services missing logging: element-web, telegram-bridge, whatsapp-bridge, piper, whisper, gitea, coturn, trading-bot, postgres, and others + - Benefit: Centralized log management + +- [ ] **Add `depends_on` to multi-container stacks** + - [ ] Blog → mysql-blog + - [ ] Helferlein → mysql-helferlein + - [ ] Traccar → mysql-traccar + - [ ] Zabbix components + - [ ] Matrix bridges → Synapse + - Benefit: Proper startup order + +--- + +## Phase 2: Security Hardening (Medium Risk) + +**Estimated Time:** 4-8 hours +**Risk Level:** Medium +**Downtime:** 5-10 minutes per service + +### Tasks + +- [ ] **Move passwords to environment files** + - [ ] Create `/home/icke/env_files/` directory structure + - [ ] Move passwords from compose files to `.env` files: + - [ ] blog.yml → `eccmts42*` + - [ ] nextcloud.yml → `eccmts42*` + - [ ] helferlein.yml → `eccmts42*` + - [ ] traccar.yml → `eccmts42*` + - [ ] wallabag.yml → `eccmts42*` + - [ ] zabbix.yml → `eccmts42*` + - [ ] firefly.yml → `firefly_secure_password_123` + - [ ] matamo.yml → `matomo` + - [ ] n8n.yml → new secure password + - [ ] Update `.gitignore` to exclude `.env` files + - [ ] Document password locations in separate secure file + +- [ ] **Move admin tokens to secrets** + - [ ] Bitwarden admin token → env file + - [ ] Firefly cron token → env file + - [ ] Coturn static auth secret → config file + +- [ ] **Create dedicated networks for isolated services** + - [ ] Element-web (currently no network) + - [ ] Telegram-bridge (currently no network) + - [ ] Whatsapp-bridge (currently no network) + - [ ] Piper (currently no network) + - [ ] Whisper (currently no network) + - [ ] Coturn (currently no network) + +- [ ] **Remove services from shared default network** + - Services on `compose_files_default`: + - [ ] n8n → dedicated network + - [ ] plex → dedicated network + - [ ] whisper → dedicated network + - [ ] unifi → dedicated network + - [ ] synapse + bridges → shared matrix network + - [ ] piper → dedicated network + - [ ] coturn → can stay (needs to be accessible) + +- [ ] **Remove deprecated `links:` directives** (7 instances) + - [ ] blog.yml + - [ ] helferlein.yml + - [ ] traccar.yml + - [ ] zabbix.yml + - Replace with network aliases and `depends_on` + +- [ ] **Review and fix user permissions** + - [ ] Plex: Change from UID=0 to proper user + - [ ] Jellyfin: Change from UID=0 to proper user + - [ ] Verify other services aren't running as root unnecessarily + +--- + +## Phase 3: Stability & Reliability Improvements (Medium-High Risk) + +**Estimated Time:** 8-16 hours +**Risk Level:** Medium-High +**Downtime:** 10-30 minutes per service + +### Tasks + +- [ ] **Remove `container_name` from all services** (54 instances) + - Use compose project naming with network aliases instead + - Prevents stale endpoint issues after `docker system prune` + - Priority services: + - [ ] bitwarden.yml + - [ ] blog.yml + - [ ] gitea.yml + - [ ] jellyfin.yml + - [ ] plex.yml + - [ ] synapse.yml + - [ ] n8n.yml + - [ ] unifi.yml + - [ ] zabbix.yml (multiple containers) + - [ ] firefly.yml (multiple containers) + - [ ] Element-web, bridges (all) + - [ ] Trading bot components + - Note: Nextcloud already fixed ✅ + +- [ ] **Remove static IP addresses** (16 instances) + - [ ] bitwarden.yml → use DNS aliases + - [ ] blog.yml → use DNS aliases + - [ ] jellyfin.yml → use DNS aliases + - [ ] zabbix.yml → use DNS aliases + - Replace with network aliases for service discovery + +- [ ] **Add resource limits to all services** + - Template (adjust per service): + ```yaml + deploy: + resources: + limits: + memory: 1G + cpus: '0.5' + reservations: + memory: 256M + ``` + - Priority services to limit: + - [ ] Plex (media server - high memory) + - [ ] Jellyfin (media server - high memory) + - [ ] N8N (automation - can grow) + - [ ] Nextcloud (web app - high memory) + - [ ] Synapse (Matrix - high memory) + - [ ] MySQL/MariaDB instances + - [ ] Zabbix server + - Less critical services: 512M limits + +- [ ] **Standardize compose file format** + - [ ] Remove `version:` declarations (deprecated in current compose spec) + - [ ] Use consistent YAML formatting + - [ ] Add comments for complex configurations + +- [ ] **Add volume backup labels/annotations** + - Label critical data volumes: + - [ ] Bitwarden data + - [ ] Gitea data + - [ ] Nextcloud data + - [ ] Database volumes + - [ ] N8N workflows + - Prepare for automated backup solutions + +--- + +## Phase 4: Software Upgrades (High Risk) + +**Estimated Time:** 4-8 hours +**Risk Level:** High +**Downtime:** 30-60 minutes per service +**Recommendation:** Test in development first + +### Tasks + +- [ ] **Upgrade EOL MySQL 5.7 to MariaDB 10.11+** + - [ ] Blog (mysql-blog) + - Backup database + - Export data + - Switch to MariaDB + - Import data + - Test thoroughly + - [ ] Helferlein (mysql-helferlein) + - Same process as blog + +- [ ] **Upgrade Zabbix 6.4 → 7.0+** + - Current: `zabbix/zabbix-server-mysql:6.4-ubuntu-latest` + - Target: `zabbix/zabbix-server-mysql:7.0-alpine-latest` + - Steps: + - [ ] Read Zabbix 7.0 migration guide + - [ ] Backup Zabbix database + - [ ] Update images in zabbix.yml + - [ ] Test web UI and agents + +- [ ] **Pin `:latest` tags to specific versions** + - Services currently using `:latest`: + - [ ] Synapse + - [ ] Element-web + - [ ] Jellyfin + - [ ] Gitea + - [ ] Telegram-bridge + - [ ] Whatsapp-bridge + - [ ] And others + - Benefit: Predictable updates, easier rollback + +- [ ] **Consider N8N database backend migration** + - Current: File-based storage + - Recommended: PostgreSQL for better performance + - Would require N8N reconfiguration + +- [ ] **Review Unifi duplicate mount** + - Currently mounts `/home/icke/unifi` to both `/config` and `/data` + - Clean up redundant configuration + +--- + +## Critical Services Priority List + +Fix these services first due to security/stability concerns: + +1. **N8N** (automation) - Weak password, no network isolation +2. **Bitwarden** (passwords) - Exposed admin token +3. **Gitea** (code repo) - No healthcheck, no dedicated network +4. **Blog/Helferlein** - EOL MySQL version +5. **Synapse + Bridges** - Network architecture needs improvement +6. **Services on compose_files_default** - Need network isolation + +--- + +## Statistics + +- **Total Services:** 39 running containers +- **Services with `container_name`:** 54 instances +- **Services with hardcoded passwords:** 20+ instances +- **Services using deprecated `links`:** 7 instances +- **Services with static IPs:** 16 instances +- **Services with Loki logging:** 24/39 (61%) +- **Services with healthchecks:** 2/39 (5%) +- **Services with resource limits:** 1/39 (3%) +- **Services using old MySQL 5.7:** 2 instances +- **Shared networks:** 13 custom networks (some overloaded) + +--- + +## Implementation Notes + +### Before Starting Any Phase + +1. **Full system backup** + - Backup all `/home/icke/` directories + - Export all databases + - Document current working state + +2. **Create rollback plan** + - Keep old compose files as `.yml.backup` + - Document current container states + - Test rollback procedure + +3. **Schedule maintenance window** + - Notify users of potential downtime + - Choose low-traffic time period + - Have monitoring ready + +### Testing Strategy + +1. Test changes on one service first +2. Monitor for 24 hours +3. Apply to similar services in batches +4. Keep previous configs for quick rollback + +### Success Criteria + +- All services start successfully +- No stale endpoint errors after `docker system prune` +- All services accessible via their original URLs/ports +- Logs flowing to Loki +- Healthchecks reporting healthy status + +--- + +## Maintenance Schedule Recommendation + +- **Phase 1:** Can be done immediately, low risk +- **Phase 2:** Schedule over 2-3 weekends +- **Phase 3:** One service per weekend, monitor for a week +- **Phase 4:** Full maintenance window, test environment first + +--- + +## Additional Recommendations + +### Future Improvements (Not in Roadmap) + +- Consider Traefik/Nginx Proxy Manager for unified reverse proxy +- Implement automated backup solution (Duplicati, Restic, etc.) +- Add Prometheus monitoring for metrics collection +- Consider Watchtower for automated updates (carefully configured) +- Create Docker Swarm or K8s cluster for HA (if needed) +- Implement secrets management (Vault, Docker Secrets) +- Add CI/CD pipeline for compose file validation + +### Documentation + +- Document network architecture diagram +- Create service dependency map +- Maintain service inventory with versions +- Document backup and restore procedures +- Create runbooks for common issues + +--- + +## Progress Tracking + +Use this section to track completion: + +``` +Phase 1: [ ] 0/4 major tasks +Phase 2: [ ] 0/7 major tasks +Phase 3: [ ] 0/5 major tasks +Phase 4: [ ] 0/5 major tasks + +Overall Progress: 0% +``` + +--- + +## Notes & Decisions + +Document any decisions or deviations from this roadmap here: + +- 2025-11-11: Roadmap created based on infrastructure analysis +- 2025-11-11: Nextcloud fixed (removed container_name, added dedicated network) + +--- + +**Last Updated:** 2025-11-11 +**Next Review:** After Phase 1 completion