qvest-task/ADR.md
gitea_admin 685de1816d feat: implement update automation and backup system with CI tests (#1)
- Diun monitors Docker images
- Automated updates for nginx, manual approval for gitea/postgres
- Weekly cert renewal automation via cron
- Health checks with automatic rollback on failure
- AWS SES email notifications on update failures
- Daily S3 backups + pre-update snapshots
- Integration tests with Gitea Actions quality gate
- Change domain from gitea.poll-streams.com to git.poll-streams.com
- Add diagrams
2026-06-11 15:51:48 +00:00

295 lines
9.2 KiB
Markdown

# Architecture Decision Records (ADR)
This document tracks all significant architectural decisions made during the project, including rationale and trade-offs.
---
## ADR-001: Cloud Provider - AWS
**Date**: 2026-06-08
**Status**: Accepted
**Decision**: Use Amazon Web Services (AWS)
**Rationale**:
- Industry-standard cloud provider with comprehensive service portfolio
- Access to managed services when beneficial
- Strong ecosystem and community support
- Terraform has excellent AWS provider support
---
## ADR-002: Infrastructure as Code - Terraform
**Date**: 2026-06-08
**Status**: Accepted
**Decision**: Use Terraform for infrastructure provisioning
**Rationale**:
- Declarative approach (aligns with project philosophy)
- Industry standard for cloud infrastructure
- Excellent AWS provider
- State management enables reproducibility
**Scope**: VPC, EC2, Security Groups, S3, Route 53
---
## ADR-003: Configuration Management - Ansible
**Date**: 2026-06-08
**Status**: Accepted
**Decision**: Use Ansible for system configuration (kept minimal)
**Rationale**:
- Avoids problematic user-data scripts (bad experience with debugging)
- Idempotent - can re-run if setup fails
- Real-time output visibility via SSH
- Professional separation of concerns: Terraform (infra) → Ansible (config) → Docker (apps)
**Scope**: Install Docker, configure system basics, setup firewall
**Philosophy**: Keep Ansible simple - no fancy roles or complexity
**Alternative Considered**: User-data scripts - rejected due to debugging difficulty and one-shot nature
---
## ADR-004: Application Deployment - Docker + Docker Compose
**Date**: 2026-06-08
**Status**: Accepted
**Decision**: Use Docker with Docker Compose for application orchestration
**Rationale**:
- Fully declarative (docker-compose.yml)
- Easy to test locally (dev/prod parity)
- Simple version control and updates
- Gitea has official Docker images
- Portable and reproducible
**Scope**: Gitea, nginx, PostgreSQL, monitoring stack (later)
---
## ADR-005: Database - Self-Hosted PostgreSQL in Docker
**Date**: 2026-06-08
**Status**: Accepted
**Decision**: PostgreSQL container, not RDS
**Rationale**:
- Simpler architecture (everything in docker-compose.yml)
- Shows ability to build and manage backups ourselves
- More control over configuration
- Cost-effective
- PostgreSQL is Gitea's recommended database
**Trade-offs**:
- **Pros**: Greater control, cost-effective, simpler architecture
- **Cons**: Requires custom backup automation and testing
**Backup Strategy**: Custom scripts with pg_dump to S3 (detailed in backup phase)
**Future Consideration**: For higher availability requirements or larger scale, RDS would provide managed backups, point-in-time recovery, and Multi-AZ deployment
---
## ADR-006: Reverse Proxy - Nginx
**Date**: 2026-06-08
**Status**: Accepted
**Decision**: Nginx as reverse proxy
**Rationale**:
- Lightweight and performant
- Simple configuration for basic proxying
- Industry standard
- Works well in Docker
**Scope**: SSL termination, proxy to Gitea, HTTP→HTTPS redirect
---
## ADR-007: SSL Certificates - Let's Encrypt
**Date**: 2026-06-08 (Updated 2026-06-11)
**Status**: Accepted
**Decision**: Let's Encrypt with certbot
**Rationale**:
- Free, automated, trusted certificates
- Widely accepted by all browsers (no certificate warnings)
- Auto-renewal reduces operational burden
- Industry-standard solution for SSL/TLS
**Requirement**: Valid domain name pointing to server
**Domain**: git.poll-streams.com (changed from gitea.poll-streams.com)
**Implementation Note**: Initially encountered Let's Encrypt rate limits (5 certificates per week). Resolved by migrating to a fresh domain identifier (git.poll-streams.com), allowing immediate production certificate issuance. Production certificates obtained successfully.
---
## ADR-008: Update Automation - Diun + Custom Scripts
**Date**: 2026-06-08
**Status**: Accepted (Updated 2026-06-09)
**Decision**: Diun (Docker Image Update Notifier) for monitoring + custom bash scripts for orchestration
**Rationale**:
- Diun monitors for updates and sends email notifications (built-in)
- Enables differentiated update policies per container
- Custom scripts provide full control over update workflow
- Supports pre-update backups and health checks
- Allows manual approval for critical components (Gitea, PostgreSQL)
- Auto-update for low-risk components (nginx, certbot)
- Demonstrates production-level engineering (not just "update everything")
**Update Strategy**:
- **Schedule**: Weekly checks during off-hours
- **Nginx/Certbot**: Automatic updates after backup
- **Gitea/PostgreSQL**: Email notification, manual approval required
- **Backup**: Pre-update backup to S3 (database + Gitea data)
- **Health Checks**: Post-update validation
- **Rollback**: Automatic rollback on health check failure
- **Notifications**: Email alerts on critical failures, logs for successful updates
**Scope**:
- Diun container monitors all Docker images
- `auto-update.sh` - automated update for nginx/certbot
- `manual-update.sh` - operator-approved update for gitea/postgres
- Health check and rollback logic
**Alternative Considered**: Watchtower - rejected because it lacks per-container policies, pre-update backups, and proper notification support
---
## ADR-012: CI/CD - Gitea Actions with Self-Hosted Runners
**Date**: 2026-06-11
**Status**: Accepted
**Decision**: Use Gitea Actions with self-hosted runners for CI/CD
**Rationale**:
- Native integration with Gitea (no external CI service)
- Self-hosted runners provide full control and security
- GitHub Actions-compatible workflow syntax (familiar, well-documented)
- Enables automated testing before merging changes
- Demonstrates production-grade CI/CD practices
**Implementation**:
- **Runners**: 2x act_runner v0.2.10 instances as systemd services
- **Automation**: Ansible playbook (setup-runner.yml) for reproducible deployment
- **Runner Registration**: Automated via Gitea API with token from AWS Secrets Manager
- **Networking**: Host network mode for job containers to access Gitea
- **Registration URL**: https://git.poll-streams.com (public URL for git clone operations)
- **Workflow**: .gitea/workflows/test.yml runs integration tests on PRs
- **Features**: Docker layer caching, artifact uploads, workflow_dispatch support
**Technical Details**:
- Each runner has dedicated config directory (/etc/act_runner-{1,2})
- Configuration includes host networking to allow job containers to reach services
- Runners registered with public URL to avoid localhost connection issues
- Systemd manages runner lifecycle with automatic restart
**Benefits**:
- Automated quality gates before merging
- Consistent test environment (matches CI exactly)
- Fast feedback on code changes
- Self-contained solution (no external dependencies)
---
## ADR-009: Monitoring - Prometheus + Grafana
**Date**: 2026-06-08
**Status**: Accepted (implementation later)
**Decision**: Prometheus for metrics, Grafana for visualization
**Rationale**:
- Industry standard monitoring stack
- Powerful querying with PromQL
- Rich visualization and alerting capabilities
- Strong community and pre-built dashboards
**Note**: To be implemented in later phase
---
## ADR-010: Logging - Loki + Promtail
**Date**: 2026-06-08
**Status**: Accepted (implementation later)
**Decision**: Loki for log aggregation, Promtail for collection
**Rationale**:
- Lightweight compared to ELK stack
- Integrates with Grafana (single pane of glass)
- Good fit for Docker environments
**Note**: To be implemented in later phase
---
## ADR-011: Backup Strategy - Custom Scripts + S3
**Date**: 2026-06-08
**Status**: Accepted (implementation later)
**Decision**: Bash scripts with pg_dump and AWS S3
**Rationale**:
- Simple and maintainable
- Full control over backup process and scheduling
- S3 provides highly durable storage (99.999999999%)
- Easy to test and validate restore procedures
**Scope**:
- Database backups (pg_dump)
- Gitea repository data
- Configuration files
- Automated scheduling with cron
**Note**: Details to be designed in backup phase
---
## Technology Stack Summary
| Layer | Technology | Rationale |
|-------|-----------|-----------|
| **Cloud** | AWS | Industry standard |
| **Infrastructure** | Terraform | Declarative IaC |
| **Configuration** | Ansible (minimal) | System setup, avoids user-data |
| **Compute** | EC2 | Flexible VM hosting |
| **Application** | Docker Compose | Declarative orchestration |
| **Database** | PostgreSQL (Docker) | Self-managed, shows control |
| **Reverse Proxy** | Nginx | Lightweight, standard |
| **SSL** | Let's Encrypt | Free, automated, professional |
| **DNS** | Route 53 | AWS-native |
| **Updates** | Diun + Scripts | Per-container policies, backup/rollback |
| **CI/CD** | Gitea Actions | Self-hosted runners, native integration |
| **Backups** | Scripts + S3 | Custom, controlled |
| **Monitoring** | Prometheus + Grafana | Industry standard |
| **Logging** | Loki + Promtail | Lightweight, integrated |
---
## Core Principles
1. **Simplicity First**: Avoid overengineering
2. **Declarative Over Imperative**: Terraform, Docker Compose
3. **Infrastructure as Code**: Everything version-controlled
4. **Show Control**: Build things ourselves where it demonstrates skill
5. **Professional**: Production-grade practices