qvest-task/ROADMAP.md
gitea_admin 2e368a3a7c feat: implement disaster recovery with automated restore (#2)
- Create restore.sh for automated S3 backup recovery
  - Fetches backups, stops services, restores database/data/config, restarts & validates
- Successfully tested on production system
- Document procedures in backup-strategy.md
- Add Test 6: Full backup/restore cycle with disaster simulation
- Rename test-update.sh → test-integration.sh

Co-authored-by: aviyadeveloper <aviya.developer@gmail.com>
Reviewed-on: #2
2026-06-11 17:29:55 +00:00

17 KiB

Roadmap

This is the implementation road map for the project. It outlines the key milestones and features in incremental steps, allowing for a structured approach to development and deployment.

Phase 1: Conceptualization and Planning

This phase will be achieved through discussion and research and will include the following steps (no code should be implemented in this phase):

1.1 Requirements Analysis

  • Define the scope and requirements of the project
  • Identify constraints and non-functional requirements
  • Determine host environment (cloud provider, VPS, or local)

1.2 Technology Selection

Decisions documented in ADR.md

  • Cloud: AWS
  • Infrastructure as Code: Terraform
  • Configuration Management: Ansible (kept minimal)
  • Application Deployment: Docker + Docker Compose
  • Database: PostgreSQL (self-hosted in Docker)
  • Reverse Proxy: Nginx
  • SSL: Let's Encrypt with certbot
  • Update Automation: Diun + Custom Scripts
  • Monitoring: Prometheus + Grafana (later phase)
  • Logging: Loki + Promtail (later phase)
  • Backup: Custom scripts + S3 (later phase)

1.3 Architecture Design

  • Overall system architecture designed
  • Network topology planned (VPC, subnets, security groups)
  • Three architecture diagrams created in docs/diagrams/

1.4 Project Structure

  • Directory structure planned (will create incrementally per phase)
  • Documentation structure in place (docs/diagrams/)
  • Naming conventions: lowercase, hyphens for files, descriptive names

Goals:

  • A clear full Roadmap for the project available in this file
  • Technology stack documented with rationale (see ADR.md)
  • Architecture diagrams created (3 diagrams in docs/diagrams/)
  • Project structure planned

Phase 1 Complete! Ready to begin Phase 2 (Infrastructure Setup).


Phase 2: Infrastructure Setup

This phase provisions the AWS infrastructure using Terraform.

2.1 Terraform Backend Setup

  • Configure AWS CLI and credentials locally
  • Set up Terraform backend (S3 bucket for state storage)
  • Initialize Terraform working directory

2.2 Core Infrastructure

  • Create VPC with single public subnet
  • Set up Internet Gateway
  • Configure Security Group for EC2 (ports 22, 80, 443)
  • Provision EC2 instance (t3.medium, Ubuntu 24.04) with IAM role
  • Create S3 bucket for backups (with versioning & encryption)
  • Configure Route 53 DNS records (A record: git.poll-streams.com → EC2)
  • Use official Terraform AWS modules (VPC, Security Group)
  • Refactored into separate files: main.tf, vpc.tf, security.tf, compute.tf, storage.tf, iam.tf, dns.tf, outputs.tf

2.3 Security Configuration

  • Configure SSH key-based authentication (Ed25519, generated via Terraform)
  • SSH access from anywhere (0.0.0.0/0) - security via key-based auth
  • Apply IAM policies (AmazonS3FullAccess for EC2 backups)
  • Security group follows least access (only 22, 80, 443 inbound; all outbound)
  • Encrypted EBS root volume (30GB gp3)

Goals:

  • AWS infrastructure fully defined in Terraform code
  • EC2 instance provisioned and accessible via SSH
  • S3 backup bucket created
  • Domain DNS configured and resolving
  • Infrastructure can be destroyed and recreated with terraform apply

Phase 2 Complete! Ready to begin Phase 3 (Automated Gitea Deployment).


Phase 3: Automated Gitea Deployment

This phase implements the automated, reproducible Gitea installation.

3.1 Database Setup

  • PostgreSQL 18.4 deployed via Docker Compose
  • Database credentials stored in AWS Secrets Manager
  • Random password generation via Terraform
  • Volume mounted at /var/lib/postgresql (PostgreSQL 18+ requirement)
  • Health checks configured with pg_isready

3.2 Gitea Installation

  • Gitea 1.22.6 deployed via Docker Compose
  • Ansible playbooks created: setup-system.yml, deploy-gitea.yml, setup-ssl.yml, site.yml
  • Docker + AWS CLI installation automated
  • Gitea configured with environment variables (database, domain, ROOT_URL)
  • SSH git access on port 2222
  • Volumes for persistent data

3.3 Reverse Proxy Configuration

  • Nginx 1.27-alpine deployed via Docker Compose
  • Let's Encrypt SSL certificate obtained via certbot (production)
  • Domain: git.poll-streams.com (migrated to avoid rate limits)
  • Two-stage nginx config (HTTP-only for ACME, then HTTPS)
  • SSL termination at nginx, proxy to Gitea on port 3000
  • HTTP to HTTPS redirect configured
  • Security headers (HSTS, X-Frame-Options, etc.)
  • WebSocket support for real-time features
  • 512MB upload limit

3.4 Testing

  • HTTPS access verified: https://git.poll-streams.com
  • Valid SSL certificate (Let's Encrypt production)
  • HTTP → HTTPS redirect working
  • Gitea web interface accessible and functional
  • User account created, repository created
  • Git push via HTTPS tested successfully
  • Full deployment reproducible via ansible-playbook site.yml

Goals:

  • Gitea running and accessible via HTTPS through reverse proxy
  • Installation fully automated and reproducible
  • Production-grade deployment with SSL

Phase 3 Complete! Gitea is fully deployed, secured with SSL, and accessible from the internet.


Phase 4: Update Automation

This phase implements automated update mechanisms for Gitea and related components.

4.1 Update Strategy Design

  • Weekly update checks (Sunday 3:00 AM)
  • Per-container update policies (automatic vs manual)
  • Pre-update backup to S3
  • Post-update health checks
  • Automatic rollback on failure
  • Email notifications via AWS SES

4.2 Update Monitoring

  • Diun 4.33 deployed for Docker image update detection
  • Scheduled weekly checks (cron: 0 3 * * 0)
  • Monitors: postgres, gitea, nginx, diun
  • Email notifications configured via AWS SES SMTP
  • IAM user created for SMTP credentials
  • Labels define update policies per container

4.3 Automated Scripts

  • backup.sh: Database + Gitea data backup to S3 bucket
  • health-check.sh: Validates all services running and responsive
  • auto-update.sh: Automatic updates for low-risk containers (nginx)
    • Backup before update
    • Pull new image
    • Recreate container
    • Health check validation
    • Automatic rollback on failure
    • Email notifications
  • manual-update.sh: Manual updates for critical containers (gitea/postgres)
    • Operator confirmation required
    • Same safety flow as auto-update
    • Success/failure notifications
  • test-integration.sh: Comprehensive integration test suite for CI/CD
    • Script syntax validation (bash -n)
    • Docker Compose configuration validation
    • Backup archive creation and validation
    • Health check failure detection
    • Update workflow with rollback simulation
    • Full backup and restore cycle testing (22 assertions total)
    • Isolated test environment (/tmp)
    • No dependencies on live services
  • restore.sh: Disaster recovery from S3 backups
    • Downloads latest backups from S3
    • Restores database, Gitea data, and configuration
    • Service stop/start orchestration
    • Tested successfully on live system (timestamp 20260611_164408)

Script Quality:

  • All scripts follow DRY principles with extracted helper functions
  • Consistent error handling and logging patterns
  • Configurable timeouts and magic numbers replaced with constants
  • Comprehensive comments and documentation headers

4.4 Cron Jobs

  • Weekly automatic update (nginx only): Sunday 3:15 AM
  • Weekly certificate renewal: Sunday 3:30 AM
  • Daily backups: 2:00 AM
  • All configured via Ansible (setup-cron.yml)

4.5 Certificate Renewal

  • Automated weekly renewal check via cron
  • Uses certbot container: docker compose run --rm certbot renew
  • Restarts nginx to load new certificates
  • Process is idempotent (safe to run weekly)

4.6 Testing & Validation

  • Integration tests created (test-integration.sh)
  • All scripts tested on live system
  • Cron jobs verified
  • Email notifications tested
  • Diun monitoring confirmed (4 containers)
  • Update workflow diagram created

4.7 CI/CD Implementation

  • Gitea Actions enabled on instance
  • Self-hosted runners deployed (2x act_runner v0.2.10)
  • Runner automation via Ansible (setup-runner.yml)
  • Systemd services for runner management
  • Host networking configuration for job containers
  • CI workflow created (.gitea/workflows/test.yml)
  • Automated testing on pull requests
  • Docker layer caching for performance
  • Artifact upload on test failure
  • Full CI/CD pipeline tested and operational

Goals:

  • Automated update system operational
  • Update process tested and validated on live system
  • Rollback procedure implemented and tested
  • Quality gate for CI/local environments
  • CI/CD pipeline with self-hosted runners
  • Documentation complete (workflow diagram)

Implementation Summary:

  • 5 bash scripts following best practices (DRY, error handling, logging)
  • Diun monitoring with AWS SES email notifications
  • Per-container update policies (automatic: nginx, manual: gitea/postgres)
  • Pre-update backups with automatic rollback on failure
  • Certificate renewal automation
  • Comprehensive testing framework
  • CI/CD with Gitea Actions and 2 self-hosted runners
  • Visual workflow documentation (including CI/CD flow)

Phase 4 Complete! Update automation and CI/CD fully operational with safety mechanisms.


Phase 5: Backup Strategy Implementation

This phase implements comprehensive backup solutions.

5.1 Backup Concept Document

  • Document backup strategy (3-2-1 rule)
  • Define backup scope (database, repos, config, etc.)
  • Define retention policy
  • Define RTO and RPO targets

5.2 Backup Implementation

  • Automate database backups (pg_dump)
  • Automate Gitea data directory backups (tar.gz)
  • Automate configuration backups (docker-compose.yml, .env, scripts)
  • Set up backup storage (S3 with versioning)
  • Implement backup rotation and cleanup (S3 lifecycle policy)
  • Schedule automated backups (daily 2:00 AM cron)
  • Pre-update backups integrated into update workflow

5.3 Recovery Testing

  • Document restore procedures (docs/backup-strategy.md + restore.sh script)
  • Test database restore on live system (timestamp: 20260611_164408)
  • Test full system restore (database + data + config)
  • Verify services operational post-restore (all containers healthy)
  • Document recovery time (RTO: ~45 minutes, RPO: 24 hours)
  • Integration test suite includes full backup/restore cycle validation

Goals:

  • Automated backup system operational
  • Restore procedures tested and documented
  • Backup strategy document completed (docs/backup-strategy.md - 145 lines, concise)
  • Disaster recovery validated on production system

Phase 5 Complete! Backup and restore fully operational and validated.


Phase 6: Monitoring Concept 🔄

This phase documents a monitoring strategy for future implementation.

6.1 Monitoring Concept Document 🔄

  • 🔄 Define key metrics to monitor (CPU, RAM, disk, network, Gitea-specific)
  • 🔄 Define alerting thresholds and conditions
  • 🔄 Define alert channels (email, Slack, etc.)
  • 🔄 Technology selection (Prometheus + Grafana)
  • 🔄 Architecture design (exporters, retention, dashboards)
  • 🔄 Implementation plan and effort estimation

Goals:

  • 🔄 Monitoring concept document completed (docs/monitoring-concept.md)
  • 🔄 Clear roadmap for future monitoring implementation

Note: Full implementation deferred - concept document shows architectural understanding and planning.


Phase 7: Logging Concept 🔄

This phase documents a centralized logging strategy for future implementation.

7.1 Logging Concept Document 🔄

  • 🔄 Define logging architecture (Loki + Promtail)
  • 🔄 Define log sources (Gitea, nginx, PostgreSQL, system)
  • 🔄 Define log retention policy
  • 🔄 Define log analysis requirements and use cases
  • 🔄 Integration with Grafana for visualization
  • 🔄 Implementation plan and resource requirements

Goals:

  • 🔄 Logging concept document completed (docs/logging-concept.md)
  • 🔄 Clear roadmap for future logging implementation

Note: Full implementation deferred - concept document shows architectural understanding and planning.


Phase 8: High Availability Concept 🔄

This phase documents a high availability strategy for future implementation.

8.1 HA Concept Document 🔄

  • 🔄 Document SPOF (Single Points of Failure) analysis
  • 🔄 Design HA architecture (Multi-AZ, load balancing)
  • 🔄 Database redundancy strategy (RDS Multi-AZ or PostgreSQL replication)
  • 🔄 Application redundancy (multiple Gitea instances)
  • 🔄 Shared storage considerations (EFS or S3 for Gitea data)
  • 🔄 Load balancer configuration (ALB)
  • 🔄 Define failover strategy and automation
  • 🔄 Define RTO/RPO targets for HA scenario
  • 🔄 Cost analysis and trade-offs

Goals:

  • 🔄 HA concept document completed (docs/ha-concept.md)
  • 🔄 Clear architecture for scaling to high availability

Note: Full implementation deferred - concept document shows architectural understanding and planning.


Phase 9: Documentation and Final Testing

This phase consolidates all documentation and performs end-to-end testing.

9.1 Documentation

  • Create comprehensive README.md
    • Project overview and objectives
    • Architecture summary
    • Prerequisites and setup instructions
    • Deployment procedures
    • Operational procedures
    • Troubleshooting guide
  • Document architecture with diagrams (4 diagrams in docs/diagrams/)
  • Document all decisions (ADR.md)
  • Document all procedures (deployment, updates, backup/restore)
  • Backup strategy documentation (docs/backup-strategy.md - 152 lines)
  • Future enhancements (monitoring, logging, HA concept docs created)

9.2 Final Testing

  • Perform end-to-end deployment test (make configure tested)
  • Test all automated processes (updates, backups, CI/CD)
  • Verify all automation is functional
  • System accessible via HTTPS with production SSL

9.3 Repository Organization

  • Well-organized directory structure
  • Clear separation of concerns (terraform, ansible, docker, scripts)
  • 🔄 Comprehensive README.md

Goals:

  • 🔄 Complete documentation package
  • All automation tested and validated
  • 🔄 Ready for interview presentation

Phase 10: Interview Preparation

This phase prepares for the interview discussion.

10.1 Preparation

  • Review all concept documents
  • Prepare to explain technology choices
  • Prepare architecture diagrams for presentation
  • Prepare to demonstrate the system
  • List lessons learned and trade-offs made
  • Prepare improvement suggestions

Goals:

  • Ready to discuss all aspects of the implementation
  • Demo environment functional and accessible
  • Confident in technology choices and concepts

Success Criteria

  • Gitea accessible via HTTPS through reverse proxy (production SSL)
  • Installation fully automated and reproducible (Terraform + Ansible)
  • Automated updates configured and tested (Diun + custom scripts)
  • CI/CD pipeline operational (Gitea Actions with self-hosted runners)
  • Automated backups implemented (daily to S3)
  • 🔄 Comprehensive concept documents for: Backup, Monitoring, Logging, HA
  • All code in version control with proper structure
  • System accessible to interviewer over internet (https://git.poll-streams.com)
  • 🔄 Complete README.md with deployment and operational procedures

Current Status: Production-ready system with comprehensive automation. Completing final documentation phase before interview.


Remaining Work (Phase 9 Completion)

Documentation Tasks

  1. README.md - Comprehensive project documentation

    • Overview and objectives
    • Architecture summary with diagram references
    • Prerequisites and deployment guide
    • Operational procedures (updates, backups, troubleshooting)
  2. docs/backup-strategy.md - Complete backup documentation

    • 3-2-1 backup strategy
    • RTO/RPO targets
    • Backup scope and retention policy
    • Restore procedures with step-by-step instructions
    • S3 lifecycle policy for rotation
    • Configuration backup automation
  3. docs/monitoring-concept.md - Future monitoring architecture

    • Prometheus + Grafana architecture
    • Key metrics and alerting thresholds
    • Implementation plan
  4. docs/logging-concept.md - Future logging architecture

    • Loki + Promtail architecture
    • Log sources and retention
    • Implementation plan
  5. docs/ha-concept.md - High availability design

    • SPOF analysis
    • Multi-AZ architecture with load balancing
    • Database replication strategy
    • Cost/benefit analysis

Estimated Completion: 2-3 hours