364 lines
11 KiB
Markdown
364 lines
11 KiB
Markdown
# Docker Infrastructure Improvement Roadmap
|
|
|
|
**Generated:** November 11, 2025
|
|
**Status:** Planning Phase
|
|
**Total Services:** 39 running containers
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This roadmap addresses critical issues, security vulnerabilities, and operational improvements identified in the Docker Compose infrastructure. The plan is divided into 4 phases, prioritizing quick wins and critical security issues first.
|
|
|
|
---
|
|
|
|
## Phase 1: Quick Wins (Low Risk, High Impact)
|
|
|
|
**Estimated Time:** 2-4 hours
|
|
**Risk Level:** Low
|
|
**Downtime:** Minimal
|
|
|
|
### Tasks
|
|
|
|
- [ ] **Upgrade Nextcloud MariaDB 10.5 → 10.6**
|
|
- File: `nextcloud.yml`
|
|
- Current: `mariadb:10.5` (2.2GB database)
|
|
- Target: `mariadb:10.6` (recommended by Nextcloud 30)
|
|
- Steps:
|
|
1. Backup: `docker exec compose_files_db_1 mariadb-dump -uroot -p'eccmts42*' --all-databases > /home/icke/backups/nextcloud_mariadb_before_10.6_$(date +%Y%m%d).sql`
|
|
2. Stop: `cd /home/icke/compose_files && docker-compose -f nextcloud.yml down`
|
|
3. Edit: Change `image: mariadb:10.5` → `image: mariadb:10.6`
|
|
4. Start: `docker-compose -f nextcloud.yml up -d`
|
|
5. Upgrade: `docker exec compose_files_db_1 mariadb-upgrade -uroot -p'eccmts42*'`
|
|
- Impact: Better performance, Nextcloud 30 compatibility
|
|
- Downtime: ~5 minutes
|
|
|
|
- [ ] **Change N8N password** from "changeme" to secure password
|
|
- File: `n8n.yml`
|
|
- Impact: Critical security fix
|
|
- Downtime: < 1 minute
|
|
|
|
- [ ] **Add healthchecks to critical services**
|
|
- [ ] Bitwarden (password manager)
|
|
- [ ] Gitea (code repository)
|
|
- [ ] N8N (automation)
|
|
- [ ] Synapse (Matrix server)
|
|
- [ ] MariaDB instances
|
|
- Benefit: Auto-restart on failure, better monitoring
|
|
|
|
- [ ] **Enable Loki logging for remaining 15 services**
|
|
- Services missing logging: element-web, telegram-bridge, whatsapp-bridge, piper, whisper, gitea, coturn, trading-bot, postgres, and others
|
|
- Benefit: Centralized log management
|
|
|
|
- [ ] **Add `depends_on` to multi-container stacks**
|
|
- [ ] Blog → mysql-blog
|
|
- [ ] Helferlein → mysql-helferlein
|
|
- [ ] Traccar → mysql-traccar
|
|
- [ ] Zabbix components
|
|
- [ ] Matrix bridges → Synapse
|
|
- Benefit: Proper startup order
|
|
|
|
---
|
|
|
|
## Phase 2: Security Hardening (Medium Risk)
|
|
|
|
**Estimated Time:** 4-8 hours
|
|
**Risk Level:** Medium
|
|
**Downtime:** 5-10 minutes per service
|
|
|
|
### Tasks
|
|
|
|
- [ ] **Move passwords to environment files**
|
|
- [ ] Create `/home/icke/env_files/` directory structure
|
|
- [ ] Move passwords from compose files to `.env` files:
|
|
- [ ] blog.yml → `eccmts42*`
|
|
- [ ] nextcloud.yml → `eccmts42*`
|
|
- [ ] helferlein.yml → `eccmts42*`
|
|
- [ ] traccar.yml → `eccmts42*`
|
|
- [ ] wallabag.yml → `eccmts42*`
|
|
- [ ] zabbix.yml → `eccmts42*`
|
|
- [ ] firefly.yml → `firefly_secure_password_123`
|
|
- [ ] matamo.yml → `matomo`
|
|
- [ ] n8n.yml → new secure password
|
|
- [ ] Update `.gitignore` to exclude `.env` files
|
|
- [ ] Document password locations in separate secure file
|
|
|
|
- [ ] **Move admin tokens to secrets**
|
|
- [ ] Bitwarden admin token → env file
|
|
- [ ] Firefly cron token → env file
|
|
- [ ] Coturn static auth secret → config file
|
|
|
|
- [ ] **Create dedicated networks for isolated services**
|
|
- [ ] Element-web (currently no network)
|
|
- [ ] Telegram-bridge (currently no network)
|
|
- [ ] Whatsapp-bridge (currently no network)
|
|
- [ ] Piper (currently no network)
|
|
- [ ] Whisper (currently no network)
|
|
- [ ] Coturn (currently no network)
|
|
|
|
- [ ] **Remove services from shared default network**
|
|
- Services on `compose_files_default`:
|
|
- [ ] n8n → dedicated network
|
|
- [ ] plex → dedicated network
|
|
- [ ] whisper → dedicated network
|
|
- [ ] unifi → dedicated network
|
|
- [ ] synapse + bridges → shared matrix network
|
|
- [ ] piper → dedicated network
|
|
- [ ] coturn → can stay (needs to be accessible)
|
|
|
|
- [ ] **Remove deprecated `links:` directives** (7 instances)
|
|
- [ ] blog.yml
|
|
- [ ] helferlein.yml
|
|
- [ ] traccar.yml
|
|
- [ ] zabbix.yml
|
|
- Replace with network aliases and `depends_on`
|
|
|
|
- [ ] **Review and fix user permissions**
|
|
- [ ] Plex: Change from UID=0 to proper user
|
|
- [ ] Jellyfin: Change from UID=0 to proper user
|
|
- [ ] Verify other services aren't running as root unnecessarily
|
|
|
|
---
|
|
|
|
## Phase 3: Stability & Reliability Improvements (Medium-High Risk)
|
|
|
|
**Estimated Time:** 8-16 hours
|
|
**Risk Level:** Medium-High
|
|
**Downtime:** 10-30 minutes per service
|
|
|
|
### Tasks
|
|
|
|
- [ ] **Remove `container_name` from all services** (54 instances)
|
|
- Use compose project naming with network aliases instead
|
|
- Prevents stale endpoint issues after `docker system prune`
|
|
- Priority services:
|
|
- [ ] bitwarden.yml
|
|
- [ ] blog.yml
|
|
- [ ] gitea.yml
|
|
- [ ] jellyfin.yml
|
|
- [ ] plex.yml
|
|
- [ ] synapse.yml
|
|
- [ ] n8n.yml
|
|
- [ ] unifi.yml
|
|
- [ ] zabbix.yml (multiple containers)
|
|
- [ ] firefly.yml (multiple containers)
|
|
- [ ] Element-web, bridges (all)
|
|
- [ ] Trading bot components
|
|
- Note: Nextcloud already fixed ✅
|
|
|
|
- [ ] **Remove static IP addresses** (16 instances)
|
|
- [ ] bitwarden.yml → use DNS aliases
|
|
- [ ] blog.yml → use DNS aliases
|
|
- [ ] jellyfin.yml → use DNS aliases
|
|
- [ ] zabbix.yml → use DNS aliases
|
|
- Replace with network aliases for service discovery
|
|
|
|
- [ ] **Add resource limits to all services**
|
|
- Template (adjust per service):
|
|
```yaml
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 1G
|
|
cpus: '0.5'
|
|
reservations:
|
|
memory: 256M
|
|
```
|
|
- Priority services to limit:
|
|
- [ ] Plex (media server - high memory)
|
|
- [ ] Jellyfin (media server - high memory)
|
|
- [ ] N8N (automation - can grow)
|
|
- [ ] Nextcloud (web app - high memory)
|
|
- [ ] Synapse (Matrix - high memory)
|
|
- [ ] MySQL/MariaDB instances
|
|
- [ ] Zabbix server
|
|
- Less critical services: 512M limits
|
|
|
|
- [ ] **Standardize compose file format**
|
|
- [ ] Remove `version:` declarations (deprecated in current compose spec)
|
|
- [ ] Use consistent YAML formatting
|
|
- [ ] Add comments for complex configurations
|
|
|
|
- [ ] **Add volume backup labels/annotations**
|
|
- Label critical data volumes:
|
|
- [ ] Bitwarden data
|
|
- [ ] Gitea data
|
|
- [ ] Nextcloud data
|
|
- [ ] Database volumes
|
|
- [ ] N8N workflows
|
|
- Prepare for automated backup solutions
|
|
|
|
---
|
|
|
|
## Phase 4: Software Upgrades (High Risk)
|
|
|
|
**Estimated Time:** 4-8 hours
|
|
**Risk Level:** High
|
|
**Downtime:** 30-60 minutes per service
|
|
**Recommendation:** Test in development first
|
|
|
|
### Tasks
|
|
|
|
- [ ] **Upgrade EOL MySQL 5.7 to MariaDB 10.11+**
|
|
- [ ] Blog (mysql-blog)
|
|
- Backup database
|
|
- Export data
|
|
- Switch to MariaDB
|
|
- Import data
|
|
- Test thoroughly
|
|
- [ ] Helferlein (mysql-helferlein)
|
|
- Same process as blog
|
|
|
|
- [ ] **Upgrade Zabbix 6.4 → 7.0+**
|
|
- Current: `zabbix/zabbix-server-mysql:6.4-ubuntu-latest`
|
|
- Target: `zabbix/zabbix-server-mysql:7.0-alpine-latest`
|
|
- Steps:
|
|
- [ ] Read Zabbix 7.0 migration guide
|
|
- [ ] Backup Zabbix database
|
|
- [ ] Update images in zabbix.yml
|
|
- [ ] Test web UI and agents
|
|
|
|
- [ ] **Pin `:latest` tags to specific versions**
|
|
- Services currently using `:latest`:
|
|
- [ ] Synapse
|
|
- [ ] Element-web
|
|
- [ ] Jellyfin
|
|
- [ ] Gitea
|
|
- [ ] Telegram-bridge
|
|
- [ ] Whatsapp-bridge
|
|
- [ ] And others
|
|
- Benefit: Predictable updates, easier rollback
|
|
|
|
- [ ] **Consider N8N database backend migration**
|
|
- Current: File-based storage
|
|
- Recommended: PostgreSQL for better performance
|
|
- Would require N8N reconfiguration
|
|
|
|
- [ ] **Review Unifi duplicate mount**
|
|
- Currently mounts `/home/icke/unifi` to both `/config` and `/data`
|
|
- Clean up redundant configuration
|
|
|
|
---
|
|
|
|
## Critical Services Priority List
|
|
|
|
Fix these services first due to security/stability concerns:
|
|
|
|
1. **N8N** (automation) - Weak password, no network isolation
|
|
2. **Bitwarden** (passwords) - Exposed admin token
|
|
3. **Gitea** (code repo) - No healthcheck, no dedicated network
|
|
4. **Blog/Helferlein** - EOL MySQL version
|
|
5. **Synapse + Bridges** - Network architecture needs improvement
|
|
6. **Services on compose_files_default** - Need network isolation
|
|
|
|
---
|
|
|
|
## Statistics
|
|
|
|
- **Total Services:** 39 running containers
|
|
- **Services with `container_name`:** 54 instances
|
|
- **Services with hardcoded passwords:** 20+ instances
|
|
- **Services using deprecated `links`:** 7 instances
|
|
- **Services with static IPs:** 16 instances
|
|
- **Services with Loki logging:** 24/39 (61%)
|
|
- **Services with healthchecks:** 2/39 (5%)
|
|
- **Services with resource limits:** 1/39 (3%)
|
|
- **Services using old MySQL 5.7:** 2 instances
|
|
- **Shared networks:** 13 custom networks (some overloaded)
|
|
|
|
---
|
|
|
|
## Implementation Notes
|
|
|
|
### Before Starting Any Phase
|
|
|
|
1. **Full system backup**
|
|
- Backup all `/home/icke/` directories
|
|
- Export all databases
|
|
- Document current working state
|
|
|
|
2. **Create rollback plan**
|
|
- Keep old compose files as `.yml.backup`
|
|
- Document current container states
|
|
- Test rollback procedure
|
|
|
|
3. **Schedule maintenance window**
|
|
- Notify users of potential downtime
|
|
- Choose low-traffic time period
|
|
- Have monitoring ready
|
|
|
|
### Testing Strategy
|
|
|
|
1. Test changes on one service first
|
|
2. Monitor for 24 hours
|
|
3. Apply to similar services in batches
|
|
4. Keep previous configs for quick rollback
|
|
|
|
### Success Criteria
|
|
|
|
- All services start successfully
|
|
- No stale endpoint errors after `docker system prune`
|
|
- All services accessible via their original URLs/ports
|
|
- Logs flowing to Loki
|
|
- Healthchecks reporting healthy status
|
|
|
|
---
|
|
|
|
## Maintenance Schedule Recommendation
|
|
|
|
- **Phase 1:** Can be done immediately, low risk
|
|
- **Phase 2:** Schedule over 2-3 weekends
|
|
- **Phase 3:** One service per weekend, monitor for a week
|
|
- **Phase 4:** Full maintenance window, test environment first
|
|
|
|
---
|
|
|
|
## Additional Recommendations
|
|
|
|
### Future Improvements (Not in Roadmap)
|
|
|
|
- Consider Traefik/Nginx Proxy Manager for unified reverse proxy
|
|
- Implement automated backup solution (Duplicati, Restic, etc.)
|
|
- Add Prometheus monitoring for metrics collection
|
|
- Consider Watchtower for automated updates (carefully configured)
|
|
- Create Docker Swarm or K8s cluster for HA (if needed)
|
|
- Implement secrets management (Vault, Docker Secrets)
|
|
- Add CI/CD pipeline for compose file validation
|
|
|
|
### Documentation
|
|
|
|
- Document network architecture diagram
|
|
- Create service dependency map
|
|
- Maintain service inventory with versions
|
|
- Document backup and restore procedures
|
|
- Create runbooks for common issues
|
|
|
|
---
|
|
|
|
## Progress Tracking
|
|
|
|
Use this section to track completion:
|
|
|
|
```
|
|
Phase 1: [ ] 0/4 major tasks
|
|
Phase 2: [ ] 0/7 major tasks
|
|
Phase 3: [ ] 0/5 major tasks
|
|
Phase 4: [ ] 0/5 major tasks
|
|
|
|
Overall Progress: 0%
|
|
```
|
|
|
|
---
|
|
|
|
## Notes & Decisions
|
|
|
|
Document any decisions or deviations from this roadmap here:
|
|
|
|
- 2025-11-11: Roadmap created based on infrastructure analysis
|
|
- 2025-11-11: Nextcloud fixed (removed container_name, added dedicated network)
|
|
|
|
---
|
|
|
|
**Last Updated:** 2025-11-11
|
|
**Next Review:** After Phase 1 completion
|