Add infrastructure improvement roadmap
- Comprehensive analysis of all 39 running containers - Identified critical issues: container names, hardcoded passwords, network conflicts - 4-phase improvement plan prioritized by risk and impact - Documents specific tasks for security, stability, and upgrades - Includes statistics and implementation guidelines
This commit is contained in:
350
compose_files/INFRASTRUCTURE_ROADMAP.md
Normal file
350
compose_files/INFRASTRUCTURE_ROADMAP.md
Normal file
@@ -0,0 +1,350 @@
|
||||
# Docker Infrastructure Improvement Roadmap
|
||||
|
||||
**Generated:** November 11, 2025
|
||||
**Status:** Planning Phase
|
||||
**Total Services:** 39 running containers
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This roadmap addresses critical issues, security vulnerabilities, and operational improvements identified in the Docker Compose infrastructure. The plan is divided into 4 phases, prioritizing quick wins and critical security issues first.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Quick Wins (Low Risk, High Impact)
|
||||
|
||||
**Estimated Time:** 2-4 hours
|
||||
**Risk Level:** Low
|
||||
**Downtime:** Minimal
|
||||
|
||||
### Tasks
|
||||
|
||||
- [ ] **Change N8N password** from "changeme" to secure password
|
||||
- File: `n8n.yml`
|
||||
- Impact: Critical security fix
|
||||
- Downtime: < 1 minute
|
||||
|
||||
- [ ] **Add healthchecks to critical services**
|
||||
- [ ] Bitwarden (password manager)
|
||||
- [ ] Gitea (code repository)
|
||||
- [ ] N8N (automation)
|
||||
- [ ] Synapse (Matrix server)
|
||||
- [ ] MariaDB instances
|
||||
- Benefit: Auto-restart on failure, better monitoring
|
||||
|
||||
- [ ] **Enable Loki logging for remaining 15 services**
|
||||
- Services missing logging: element-web, telegram-bridge, whatsapp-bridge, piper, whisper, gitea, coturn, trading-bot, postgres, and others
|
||||
- Benefit: Centralized log management
|
||||
|
||||
- [ ] **Add `depends_on` to multi-container stacks**
|
||||
- [ ] Blog → mysql-blog
|
||||
- [ ] Helferlein → mysql-helferlein
|
||||
- [ ] Traccar → mysql-traccar
|
||||
- [ ] Zabbix components
|
||||
- [ ] Matrix bridges → Synapse
|
||||
- Benefit: Proper startup order
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Security Hardening (Medium Risk)
|
||||
|
||||
**Estimated Time:** 4-8 hours
|
||||
**Risk Level:** Medium
|
||||
**Downtime:** 5-10 minutes per service
|
||||
|
||||
### Tasks
|
||||
|
||||
- [ ] **Move passwords to environment files**
|
||||
- [ ] Create `/home/icke/env_files/` directory structure
|
||||
- [ ] Move passwords from compose files to `.env` files:
|
||||
- [ ] blog.yml → `eccmts42*`
|
||||
- [ ] nextcloud.yml → `eccmts42*`
|
||||
- [ ] helferlein.yml → `eccmts42*`
|
||||
- [ ] traccar.yml → `eccmts42*`
|
||||
- [ ] wallabag.yml → `eccmts42*`
|
||||
- [ ] zabbix.yml → `eccmts42*`
|
||||
- [ ] firefly.yml → `firefly_secure_password_123`
|
||||
- [ ] matamo.yml → `matomo`
|
||||
- [ ] n8n.yml → new secure password
|
||||
- [ ] Update `.gitignore` to exclude `.env` files
|
||||
- [ ] Document password locations in separate secure file
|
||||
|
||||
- [ ] **Move admin tokens to secrets**
|
||||
- [ ] Bitwarden admin token → env file
|
||||
- [ ] Firefly cron token → env file
|
||||
- [ ] Coturn static auth secret → config file
|
||||
|
||||
- [ ] **Create dedicated networks for isolated services**
|
||||
- [ ] Element-web (currently no network)
|
||||
- [ ] Telegram-bridge (currently no network)
|
||||
- [ ] Whatsapp-bridge (currently no network)
|
||||
- [ ] Piper (currently no network)
|
||||
- [ ] Whisper (currently no network)
|
||||
- [ ] Coturn (currently no network)
|
||||
|
||||
- [ ] **Remove services from shared default network**
|
||||
- Services on `compose_files_default`:
|
||||
- [ ] n8n → dedicated network
|
||||
- [ ] plex → dedicated network
|
||||
- [ ] whisper → dedicated network
|
||||
- [ ] unifi → dedicated network
|
||||
- [ ] synapse + bridges → shared matrix network
|
||||
- [ ] piper → dedicated network
|
||||
- [ ] coturn → can stay (needs to be accessible)
|
||||
|
||||
- [ ] **Remove deprecated `links:` directives** (7 instances)
|
||||
- [ ] blog.yml
|
||||
- [ ] helferlein.yml
|
||||
- [ ] traccar.yml
|
||||
- [ ] zabbix.yml
|
||||
- Replace with network aliases and `depends_on`
|
||||
|
||||
- [ ] **Review and fix user permissions**
|
||||
- [ ] Plex: Change from UID=0 to proper user
|
||||
- [ ] Jellyfin: Change from UID=0 to proper user
|
||||
- [ ] Verify other services aren't running as root unnecessarily
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Stability & Reliability Improvements (Medium-High Risk)
|
||||
|
||||
**Estimated Time:** 8-16 hours
|
||||
**Risk Level:** Medium-High
|
||||
**Downtime:** 10-30 minutes per service
|
||||
|
||||
### Tasks
|
||||
|
||||
- [ ] **Remove `container_name` from all services** (54 instances)
|
||||
- Use compose project naming with network aliases instead
|
||||
- Prevents stale endpoint issues after `docker system prune`
|
||||
- Priority services:
|
||||
- [ ] bitwarden.yml
|
||||
- [ ] blog.yml
|
||||
- [ ] gitea.yml
|
||||
- [ ] jellyfin.yml
|
||||
- [ ] plex.yml
|
||||
- [ ] synapse.yml
|
||||
- [ ] n8n.yml
|
||||
- [ ] unifi.yml
|
||||
- [ ] zabbix.yml (multiple containers)
|
||||
- [ ] firefly.yml (multiple containers)
|
||||
- [ ] Element-web, bridges (all)
|
||||
- [ ] Trading bot components
|
||||
- Note: Nextcloud already fixed ✅
|
||||
|
||||
- [ ] **Remove static IP addresses** (16 instances)
|
||||
- [ ] bitwarden.yml → use DNS aliases
|
||||
- [ ] blog.yml → use DNS aliases
|
||||
- [ ] jellyfin.yml → use DNS aliases
|
||||
- [ ] zabbix.yml → use DNS aliases
|
||||
- Replace with network aliases for service discovery
|
||||
|
||||
- [ ] **Add resource limits to all services**
|
||||
- Template (adjust per service):
|
||||
```yaml
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 1G
|
||||
cpus: '0.5'
|
||||
reservations:
|
||||
memory: 256M
|
||||
```
|
||||
- Priority services to limit:
|
||||
- [ ] Plex (media server - high memory)
|
||||
- [ ] Jellyfin (media server - high memory)
|
||||
- [ ] N8N (automation - can grow)
|
||||
- [ ] Nextcloud (web app - high memory)
|
||||
- [ ] Synapse (Matrix - high memory)
|
||||
- [ ] MySQL/MariaDB instances
|
||||
- [ ] Zabbix server
|
||||
- Less critical services: 512M limits
|
||||
|
||||
- [ ] **Standardize compose file format**
|
||||
- [ ] Remove `version:` declarations (deprecated in current compose spec)
|
||||
- [ ] Use consistent YAML formatting
|
||||
- [ ] Add comments for complex configurations
|
||||
|
||||
- [ ] **Add volume backup labels/annotations**
|
||||
- Label critical data volumes:
|
||||
- [ ] Bitwarden data
|
||||
- [ ] Gitea data
|
||||
- [ ] Nextcloud data
|
||||
- [ ] Database volumes
|
||||
- [ ] N8N workflows
|
||||
- Prepare for automated backup solutions
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Software Upgrades (High Risk)
|
||||
|
||||
**Estimated Time:** 4-8 hours
|
||||
**Risk Level:** High
|
||||
**Downtime:** 30-60 minutes per service
|
||||
**Recommendation:** Test in development first
|
||||
|
||||
### Tasks
|
||||
|
||||
- [ ] **Upgrade EOL MySQL 5.7 to MariaDB 10.11+**
|
||||
- [ ] Blog (mysql-blog)
|
||||
- Backup database
|
||||
- Export data
|
||||
- Switch to MariaDB
|
||||
- Import data
|
||||
- Test thoroughly
|
||||
- [ ] Helferlein (mysql-helferlein)
|
||||
- Same process as blog
|
||||
|
||||
- [ ] **Upgrade Zabbix 6.4 → 7.0+**
|
||||
- Current: `zabbix/zabbix-server-mysql:6.4-ubuntu-latest`
|
||||
- Target: `zabbix/zabbix-server-mysql:7.0-alpine-latest`
|
||||
- Steps:
|
||||
- [ ] Read Zabbix 7.0 migration guide
|
||||
- [ ] Backup Zabbix database
|
||||
- [ ] Update images in zabbix.yml
|
||||
- [ ] Test web UI and agents
|
||||
|
||||
- [ ] **Pin `:latest` tags to specific versions**
|
||||
- Services currently using `:latest`:
|
||||
- [ ] Synapse
|
||||
- [ ] Element-web
|
||||
- [ ] Jellyfin
|
||||
- [ ] Gitea
|
||||
- [ ] Telegram-bridge
|
||||
- [ ] Whatsapp-bridge
|
||||
- [ ] And others
|
||||
- Benefit: Predictable updates, easier rollback
|
||||
|
||||
- [ ] **Consider N8N database backend migration**
|
||||
- Current: File-based storage
|
||||
- Recommended: PostgreSQL for better performance
|
||||
- Would require N8N reconfiguration
|
||||
|
||||
- [ ] **Review Unifi duplicate mount**
|
||||
- Currently mounts `/home/icke/unifi` to both `/config` and `/data`
|
||||
- Clean up redundant configuration
|
||||
|
||||
---
|
||||
|
||||
## Critical Services Priority List
|
||||
|
||||
Fix these services first due to security/stability concerns:
|
||||
|
||||
1. **N8N** (automation) - Weak password, no network isolation
|
||||
2. **Bitwarden** (passwords) - Exposed admin token
|
||||
3. **Gitea** (code repo) - No healthcheck, no dedicated network
|
||||
4. **Blog/Helferlein** - EOL MySQL version
|
||||
5. **Synapse + Bridges** - Network architecture needs improvement
|
||||
6. **Services on compose_files_default** - Need network isolation
|
||||
|
||||
---
|
||||
|
||||
## Statistics
|
||||
|
||||
- **Total Services:** 39 running containers
|
||||
- **Services with `container_name`:** 54 instances
|
||||
- **Services with hardcoded passwords:** 20+ instances
|
||||
- **Services using deprecated `links`:** 7 instances
|
||||
- **Services with static IPs:** 16 instances
|
||||
- **Services with Loki logging:** 24/39 (61%)
|
||||
- **Services with healthchecks:** 2/39 (5%)
|
||||
- **Services with resource limits:** 1/39 (3%)
|
||||
- **Services using old MySQL 5.7:** 2 instances
|
||||
- **Shared networks:** 13 custom networks (some overloaded)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Before Starting Any Phase
|
||||
|
||||
1. **Full system backup**
|
||||
- Backup all `/home/icke/` directories
|
||||
- Export all databases
|
||||
- Document current working state
|
||||
|
||||
2. **Create rollback plan**
|
||||
- Keep old compose files as `.yml.backup`
|
||||
- Document current container states
|
||||
- Test rollback procedure
|
||||
|
||||
3. **Schedule maintenance window**
|
||||
- Notify users of potential downtime
|
||||
- Choose low-traffic time period
|
||||
- Have monitoring ready
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
1. Test changes on one service first
|
||||
2. Monitor for 24 hours
|
||||
3. Apply to similar services in batches
|
||||
4. Keep previous configs for quick rollback
|
||||
|
||||
### Success Criteria
|
||||
|
||||
- All services start successfully
|
||||
- No stale endpoint errors after `docker system prune`
|
||||
- All services accessible via their original URLs/ports
|
||||
- Logs flowing to Loki
|
||||
- Healthchecks reporting healthy status
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Schedule Recommendation
|
||||
|
||||
- **Phase 1:** Can be done immediately, low risk
|
||||
- **Phase 2:** Schedule over 2-3 weekends
|
||||
- **Phase 3:** One service per weekend, monitor for a week
|
||||
- **Phase 4:** Full maintenance window, test environment first
|
||||
|
||||
---
|
||||
|
||||
## Additional Recommendations
|
||||
|
||||
### Future Improvements (Not in Roadmap)
|
||||
|
||||
- Consider Traefik/Nginx Proxy Manager for unified reverse proxy
|
||||
- Implement automated backup solution (Duplicati, Restic, etc.)
|
||||
- Add Prometheus monitoring for metrics collection
|
||||
- Consider Watchtower for automated updates (carefully configured)
|
||||
- Create Docker Swarm or K8s cluster for HA (if needed)
|
||||
- Implement secrets management (Vault, Docker Secrets)
|
||||
- Add CI/CD pipeline for compose file validation
|
||||
|
||||
### Documentation
|
||||
|
||||
- Document network architecture diagram
|
||||
- Create service dependency map
|
||||
- Maintain service inventory with versions
|
||||
- Document backup and restore procedures
|
||||
- Create runbooks for common issues
|
||||
|
||||
---
|
||||
|
||||
## Progress Tracking
|
||||
|
||||
Use this section to track completion:
|
||||
|
||||
```
|
||||
Phase 1: [ ] 0/4 major tasks
|
||||
Phase 2: [ ] 0/7 major tasks
|
||||
Phase 3: [ ] 0/5 major tasks
|
||||
Phase 4: [ ] 0/5 major tasks
|
||||
|
||||
Overall Progress: 0%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Notes & Decisions
|
||||
|
||||
Document any decisions or deviations from this roadmap here:
|
||||
|
||||
- 2025-11-11: Roadmap created based on infrastructure analysis
|
||||
- 2025-11-11: Nextcloud fixed (removed container_name, added dedicated network)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025-11-11
|
||||
**Next Review:** After Phase 1 completion
|
||||
Reference in New Issue
Block a user