- Diun monitors Docker images - Automated updates for nginx, manual approval for gitea/postgres - Weekly cert renewal automation via cron - Health checks with automatic rollback on failure - AWS SES email notifications on update failures - Daily S3 backups + pre-update snapshots - Integration tests with Gitea Actions quality gate - Change domain from gitea.poll-streams.com to git.poll-streams.com - Add diagrams
295 lines
9.2 KiB
Markdown
295 lines
9.2 KiB
Markdown
# Architecture Decision Records (ADR)
|
|
|
|
This document tracks all significant architectural decisions made during the project, including rationale and trade-offs.
|
|
|
|
---
|
|
|
|
## ADR-001: Cloud Provider - AWS
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted
|
|
|
|
**Decision**: Use Amazon Web Services (AWS)
|
|
|
|
**Rationale**:
|
|
- Industry-standard cloud provider with comprehensive service portfolio
|
|
- Access to managed services when beneficial
|
|
- Strong ecosystem and community support
|
|
- Terraform has excellent AWS provider support
|
|
|
|
---
|
|
|
|
## ADR-002: Infrastructure as Code - Terraform
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted
|
|
|
|
**Decision**: Use Terraform for infrastructure provisioning
|
|
|
|
**Rationale**:
|
|
- Declarative approach (aligns with project philosophy)
|
|
- Industry standard for cloud infrastructure
|
|
- Excellent AWS provider
|
|
- State management enables reproducibility
|
|
|
|
**Scope**: VPC, EC2, Security Groups, S3, Route 53
|
|
|
|
---
|
|
|
|
## ADR-003: Configuration Management - Ansible
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted
|
|
|
|
**Decision**: Use Ansible for system configuration (kept minimal)
|
|
|
|
**Rationale**:
|
|
- Avoids problematic user-data scripts (bad experience with debugging)
|
|
- Idempotent - can re-run if setup fails
|
|
- Real-time output visibility via SSH
|
|
- Professional separation of concerns: Terraform (infra) → Ansible (config) → Docker (apps)
|
|
|
|
**Scope**: Install Docker, configure system basics, setup firewall
|
|
**Philosophy**: Keep Ansible simple - no fancy roles or complexity
|
|
|
|
**Alternative Considered**: User-data scripts - rejected due to debugging difficulty and one-shot nature
|
|
|
|
---
|
|
|
|
## ADR-004: Application Deployment - Docker + Docker Compose
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted
|
|
|
|
**Decision**: Use Docker with Docker Compose for application orchestration
|
|
|
|
**Rationale**:
|
|
- Fully declarative (docker-compose.yml)
|
|
- Easy to test locally (dev/prod parity)
|
|
- Simple version control and updates
|
|
- Gitea has official Docker images
|
|
- Portable and reproducible
|
|
|
|
**Scope**: Gitea, nginx, PostgreSQL, monitoring stack (later)
|
|
|
|
---
|
|
|
|
## ADR-005: Database - Self-Hosted PostgreSQL in Docker
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted
|
|
|
|
**Decision**: PostgreSQL container, not RDS
|
|
|
|
**Rationale**:
|
|
- Simpler architecture (everything in docker-compose.yml)
|
|
- Shows ability to build and manage backups ourselves
|
|
- More control over configuration
|
|
- Cost-effective
|
|
- PostgreSQL is Gitea's recommended database
|
|
|
|
**Trade-offs**:
|
|
- **Pros**: Greater control, cost-effective, simpler architecture
|
|
- **Cons**: Requires custom backup automation and testing
|
|
|
|
**Backup Strategy**: Custom scripts with pg_dump to S3 (detailed in backup phase)
|
|
|
|
**Future Consideration**: For higher availability requirements or larger scale, RDS would provide managed backups, point-in-time recovery, and Multi-AZ deployment
|
|
|
|
---
|
|
|
|
## ADR-006: Reverse Proxy - Nginx
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted
|
|
|
|
**Decision**: Nginx as reverse proxy
|
|
|
|
**Rationale**:
|
|
- Lightweight and performant
|
|
- Simple configuration for basic proxying
|
|
- Industry standard
|
|
- Works well in Docker
|
|
|
|
**Scope**: SSL termination, proxy to Gitea, HTTP→HTTPS redirect
|
|
|
|
---
|
|
|
|
## ADR-007: SSL Certificates - Let's Encrypt
|
|
|
|
**Date**: 2026-06-08 (Updated 2026-06-11)
|
|
**Status**: Accepted
|
|
|
|
**Decision**: Let's Encrypt with certbot
|
|
|
|
**Rationale**:
|
|
- Free, automated, trusted certificates
|
|
- Widely accepted by all browsers (no certificate warnings)
|
|
- Auto-renewal reduces operational burden
|
|
- Industry-standard solution for SSL/TLS
|
|
|
|
**Requirement**: Valid domain name pointing to server
|
|
|
|
**Domain**: git.poll-streams.com (changed from gitea.poll-streams.com)
|
|
|
|
**Implementation Note**: Initially encountered Let's Encrypt rate limits (5 certificates per week). Resolved by migrating to a fresh domain identifier (git.poll-streams.com), allowing immediate production certificate issuance. Production certificates obtained successfully.
|
|
|
|
---
|
|
|
|
## ADR-008: Update Automation - Diun + Custom Scripts
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted (Updated 2026-06-09)
|
|
|
|
**Decision**: Diun (Docker Image Update Notifier) for monitoring + custom bash scripts for orchestration
|
|
|
|
**Rationale**:
|
|
- Diun monitors for updates and sends email notifications (built-in)
|
|
- Enables differentiated update policies per container
|
|
- Custom scripts provide full control over update workflow
|
|
- Supports pre-update backups and health checks
|
|
- Allows manual approval for critical components (Gitea, PostgreSQL)
|
|
- Auto-update for low-risk components (nginx, certbot)
|
|
- Demonstrates production-level engineering (not just "update everything")
|
|
|
|
**Update Strategy**:
|
|
- **Schedule**: Weekly checks during off-hours
|
|
- **Nginx/Certbot**: Automatic updates after backup
|
|
- **Gitea/PostgreSQL**: Email notification, manual approval required
|
|
- **Backup**: Pre-update backup to S3 (database + Gitea data)
|
|
- **Health Checks**: Post-update validation
|
|
- **Rollback**: Automatic rollback on health check failure
|
|
- **Notifications**: Email alerts on critical failures, logs for successful updates
|
|
|
|
**Scope**:
|
|
- Diun container monitors all Docker images
|
|
- `auto-update.sh` - automated update for nginx/certbot
|
|
- `manual-update.sh` - operator-approved update for gitea/postgres
|
|
- Health check and rollback logic
|
|
|
|
**Alternative Considered**: Watchtower - rejected because it lacks per-container policies, pre-update backups, and proper notification support
|
|
|
|
---
|
|
|
|
## ADR-012: CI/CD - Gitea Actions with Self-Hosted Runners
|
|
|
|
**Date**: 2026-06-11
|
|
**Status**: Accepted
|
|
|
|
**Decision**: Use Gitea Actions with self-hosted runners for CI/CD
|
|
|
|
**Rationale**:
|
|
- Native integration with Gitea (no external CI service)
|
|
- Self-hosted runners provide full control and security
|
|
- GitHub Actions-compatible workflow syntax (familiar, well-documented)
|
|
- Enables automated testing before merging changes
|
|
- Demonstrates production-grade CI/CD practices
|
|
|
|
**Implementation**:
|
|
- **Runners**: 2x act_runner v0.2.10 instances as systemd services
|
|
- **Automation**: Ansible playbook (setup-runner.yml) for reproducible deployment
|
|
- **Runner Registration**: Automated via Gitea API with token from AWS Secrets Manager
|
|
- **Networking**: Host network mode for job containers to access Gitea
|
|
- **Registration URL**: https://git.poll-streams.com (public URL for git clone operations)
|
|
- **Workflow**: .gitea/workflows/test.yml runs integration tests on PRs
|
|
- **Features**: Docker layer caching, artifact uploads, workflow_dispatch support
|
|
|
|
**Technical Details**:
|
|
- Each runner has dedicated config directory (/etc/act_runner-{1,2})
|
|
- Configuration includes host networking to allow job containers to reach services
|
|
- Runners registered with public URL to avoid localhost connection issues
|
|
- Systemd manages runner lifecycle with automatic restart
|
|
|
|
**Benefits**:
|
|
- Automated quality gates before merging
|
|
- Consistent test environment (matches CI exactly)
|
|
- Fast feedback on code changes
|
|
- Self-contained solution (no external dependencies)
|
|
|
|
---
|
|
|
|
## ADR-009: Monitoring - Prometheus + Grafana
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted (implementation later)
|
|
|
|
**Decision**: Prometheus for metrics, Grafana for visualization
|
|
|
|
**Rationale**:
|
|
- Industry standard monitoring stack
|
|
- Powerful querying with PromQL
|
|
- Rich visualization and alerting capabilities
|
|
- Strong community and pre-built dashboards
|
|
|
|
**Note**: To be implemented in later phase
|
|
|
|
---
|
|
|
|
## ADR-010: Logging - Loki + Promtail
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted (implementation later)
|
|
|
|
**Decision**: Loki for log aggregation, Promtail for collection
|
|
|
|
**Rationale**:
|
|
- Lightweight compared to ELK stack
|
|
- Integrates with Grafana (single pane of glass)
|
|
- Good fit for Docker environments
|
|
|
|
**Note**: To be implemented in later phase
|
|
|
|
---
|
|
|
|
## ADR-011: Backup Strategy - Custom Scripts + S3
|
|
|
|
**Date**: 2026-06-08
|
|
**Status**: Accepted (implementation later)
|
|
|
|
**Decision**: Bash scripts with pg_dump and AWS S3
|
|
|
|
**Rationale**:
|
|
- Simple and maintainable
|
|
- Full control over backup process and scheduling
|
|
- S3 provides highly durable storage (99.999999999%)
|
|
- Easy to test and validate restore procedures
|
|
|
|
**Scope**:
|
|
- Database backups (pg_dump)
|
|
- Gitea repository data
|
|
- Configuration files
|
|
- Automated scheduling with cron
|
|
|
|
**Note**: Details to be designed in backup phase
|
|
|
|
---
|
|
|
|
## Technology Stack Summary
|
|
|
|
| Layer | Technology | Rationale |
|
|
|-------|-----------|-----------|
|
|
| **Cloud** | AWS | Industry standard |
|
|
| **Infrastructure** | Terraform | Declarative IaC |
|
|
| **Configuration** | Ansible (minimal) | System setup, avoids user-data |
|
|
| **Compute** | EC2 | Flexible VM hosting |
|
|
| **Application** | Docker Compose | Declarative orchestration |
|
|
| **Database** | PostgreSQL (Docker) | Self-managed, shows control |
|
|
| **Reverse Proxy** | Nginx | Lightweight, standard |
|
|
| **SSL** | Let's Encrypt | Free, automated, professional |
|
|
| **DNS** | Route 53 | AWS-native |
|
|
| **Updates** | Diun + Scripts | Per-container policies, backup/rollback |
|
|
| **CI/CD** | Gitea Actions | Self-hosted runners, native integration |
|
|
| **Backups** | Scripts + S3 | Custom, controlled |
|
|
| **Monitoring** | Prometheus + Grafana | Industry standard |
|
|
| **Logging** | Loki + Promtail | Lightweight, integrated |
|
|
|
|
---
|
|
|
|
## Core Principles
|
|
|
|
1. **Simplicity First**: Avoid overengineering
|
|
2. **Declarative Over Imperative**: Terraform, Docker Compose
|
|
3. **Infrastructure as Code**: Everything version-controlled
|
|
4. **Show Control**: Build things ourselves where it demonstrates skill
|
|
5. **Professional**: Production-grade practices
|