qvest-task/ROADMAP.md
gitea_admin 2e368a3a7c feat: implement disaster recovery with automated restore (#2)
- Create restore.sh for automated S3 backup recovery
  - Fetches backups, stops services, restores database/data/config, restarts & validates
- Successfully tested on production system
- Document procedures in backup-strategy.md
- Add Test 6: Full backup/restore cycle with disaster simulation
- Rename test-update.sh → test-integration.sh

Co-authored-by: aviyadeveloper <aviya.developer@gmail.com>
Reviewed-on: #2
2026-06-11 17:29:55 +00:00

447 lines
17 KiB
Markdown

# Roadmap
This is the implementation road map for the project. It outlines the key milestones and features in incremental steps, allowing for a structured approach to development and deployment.
## Phase 1: Conceptualization and Planning
This phase will be achieved through discussion and research and will include the following steps (no code should be implemented in this phase):
### 1.1 Requirements Analysis
- Define the scope and requirements of the project
- Identify constraints and non-functional requirements
- Determine host environment (cloud provider, VPS, or local)
### 1.2 Technology Selection ✅
**Decisions documented in [ADR.md](ADR.md)**
- **Cloud**: AWS
- **Infrastructure as Code**: Terraform
- **Configuration Management**: Ansible (kept minimal)
- **Application Deployment**: Docker + Docker Compose
- **Database**: PostgreSQL (self-hosted in Docker)
- **Reverse Proxy**: Nginx
- **SSL**: Let's Encrypt with certbot
- **Update Automation**: Diun + Custom Scripts
- **Monitoring**: Prometheus + Grafana (later phase)
- **Logging**: Loki + Promtail (later phase)
- **Backup**: Custom scripts + S3 (later phase)
### 1.3 Architecture Design ✅
- ✅ Overall system architecture designed
- ✅ Network topology planned (VPC, subnets, security groups)
- ✅ Three architecture diagrams created in docs/diagrams/
### 1.4 Project Structure ✅
- Directory structure planned (will create incrementally per phase)
- Documentation structure in place (`docs/diagrams/`)
- Naming conventions: lowercase, hyphens for files, descriptive names
### Goals:
- ✅ A clear full Roadmap for the project available in this file
- ✅ Technology stack documented with rationale (see ADR.md)
- ✅ Architecture diagrams created (3 diagrams in docs/diagrams/)
- ✅ Project structure planned
**Phase 1 Complete!** Ready to begin Phase 2 (Infrastructure Setup).
---
## Phase 2: Infrastructure Setup
This phase provisions the AWS infrastructure using Terraform.
### 2.1 Terraform Backend Setup ✅
- Configure AWS CLI and credentials locally
- Set up Terraform backend (S3 bucket for state storage)
- Initialize Terraform working directory
### 2.2 Core Infrastructure ✅
- ✅ Create VPC with single public subnet
- ✅ Set up Internet Gateway
- ✅ Configure Security Group for EC2 (ports 22, 80, 443)
- ✅ Provision EC2 instance (t3.medium, Ubuntu 24.04) with IAM role
- ✅ Create S3 bucket for backups (with versioning & encryption)
- ✅ Configure Route 53 DNS records (A record: git.poll-streams.com → EC2)
- ✅ Use official Terraform AWS modules (VPC, Security Group)
- ✅ Refactored into separate files: main.tf, vpc.tf, security.tf, compute.tf, storage.tf, iam.tf, dns.tf, outputs.tf
### 2.3 Security Configuration ✅
- ✅ Configure SSH key-based authentication (Ed25519, generated via Terraform)
- ✅ SSH access from anywhere (0.0.0.0/0) - security via key-based auth
- ✅ Apply IAM policies (AmazonS3FullAccess for EC2 backups)
- ✅ Security group follows least access (only 22, 80, 443 inbound; all outbound)
- ✅ Encrypted EBS root volume (30GB gp3)
### Goals: ✅
- ✅ AWS infrastructure fully defined in Terraform code
- ✅ EC2 instance provisioned and accessible via SSH
- ✅ S3 backup bucket created
- ✅ Domain DNS configured and resolving
- ✅ Infrastructure can be destroyed and recreated with `terraform apply`
**Phase 2 Complete!** Ready to begin Phase 3 (Automated Gitea Deployment).
---
## Phase 3: Automated Gitea Deployment
This phase implements the automated, reproducible Gitea installation.
### 3.1 Database Setup ✅
- ✅ PostgreSQL 18.4 deployed via Docker Compose
- ✅ Database credentials stored in AWS Secrets Manager
- ✅ Random password generation via Terraform
- ✅ Volume mounted at /var/lib/postgresql (PostgreSQL 18+ requirement)
- ✅ Health checks configured with pg_isready
### 3.2 Gitea Installation ✅
- ✅ Gitea 1.22.6 deployed via Docker Compose
- ✅ Ansible playbooks created: setup-system.yml, deploy-gitea.yml, setup-ssl.yml, site.yml
- ✅ Docker + AWS CLI installation automated
- ✅ Gitea configured with environment variables (database, domain, ROOT_URL)
- ✅ SSH git access on port 2222
- ✅ Volumes for persistent data
### 3.3 Reverse Proxy Configuration ✅
- ✅ Nginx 1.27-alpine deployed via Docker Compose
- ✅ Let's Encrypt SSL certificate obtained via certbot (production)
- ✅ Domain: git.poll-streams.com (migrated to avoid rate limits)
- ✅ Two-stage nginx config (HTTP-only for ACME, then HTTPS)
- ✅ SSL termination at nginx, proxy to Gitea on port 3000
- ✅ HTTP to HTTPS redirect configured
- ✅ Security headers (HSTS, X-Frame-Options, etc.)
- ✅ WebSocket support for real-time features
- ✅ 512MB upload limit
### 3.4 Testing ✅
- ✅ HTTPS access verified: https://git.poll-streams.com
- ✅ Valid SSL certificate (Let's Encrypt production)
- ✅ HTTP → HTTPS redirect working
- ✅ Gitea web interface accessible and functional
- ✅ User account created, repository created
- ✅ Git push via HTTPS tested successfully
- ✅ Full deployment reproducible via `ansible-playbook site.yml`
### Goals: ✅
- ✅ Gitea running and accessible via HTTPS through reverse proxy
- ✅ Installation fully automated and reproducible
- ✅ Production-grade deployment with SSL
**Phase 3 Complete!** Gitea is fully deployed, secured with SSL, and accessible from the internet.
---
## Phase 4: Update Automation ✅
This phase implements automated update mechanisms for Gitea and related components.
### 4.1 Update Strategy Design ✅
- ✅ Weekly update checks (Sunday 3:00 AM)
- ✅ Per-container update policies (automatic vs manual)
- ✅ Pre-update backup to S3
- ✅ Post-update health checks
- ✅ Automatic rollback on failure
- ✅ Email notifications via AWS SES
### 4.2 Update Monitoring ✅
- ✅ Diun 4.33 deployed for Docker image update detection
- ✅ Scheduled weekly checks (cron: `0 3 * * 0`)
- ✅ Monitors: postgres, gitea, nginx, diun
- ✅ Email notifications configured via AWS SES SMTP
- ✅ IAM user created for SMTP credentials
- ✅ Labels define update policies per container
### 4.3 Automated Scripts ✅
-**backup.sh**: Database + Gitea data backup to S3 bucket
-**health-check.sh**: Validates all services running and responsive
-**auto-update.sh**: Automatic updates for low-risk containers (nginx)
- Backup before update
- Pull new image
- Recreate container
- Health check validation
- Automatic rollback on failure
- Email notifications
-**manual-update.sh**: Manual updates for critical containers (gitea/postgres)
- Operator confirmation required
- Same safety flow as auto-update
- Success/failure notifications
-**test-integration.sh**: Comprehensive integration test suite for CI/CD
- Script syntax validation (bash -n)
- Docker Compose configuration validation
- Backup archive creation and validation
- Health check failure detection
- Update workflow with rollback simulation
- Full backup and restore cycle testing (22 assertions total)
- Isolated test environment (/tmp)
- No dependencies on live services
-**restore.sh**: Disaster recovery from S3 backups
- Downloads latest backups from S3
- Restores database, Gitea data, and configuration
- Service stop/start orchestration
- Tested successfully on live system (timestamp 20260611_164408)
**Script Quality:**
- All scripts follow DRY principles with extracted helper functions
- Consistent error handling and logging patterns
- Configurable timeouts and magic numbers replaced with constants
- Comprehensive comments and documentation headers
### 4.4 Cron Jobs ✅
- ✅ Weekly automatic update (nginx only): Sunday 3:15 AM
- ✅ Weekly certificate renewal: Sunday 3:30 AM
- ✅ Daily backups: 2:00 AM
- ✅ All configured via Ansible (setup-cron.yml)
### 4.5 Certificate Renewal ✅
- ✅ Automated weekly renewal check via cron
- ✅ Uses certbot container: `docker compose run --rm certbot renew`
- ✅ Restarts nginx to load new certificates
- ✅ Process is idempotent (safe to run weekly)
### 4.6 Testing & Validation ✅
- ✅ Integration tests created (test-integration.sh)
- ✅ All scripts tested on live system
- ✅ Cron jobs verified
- ✅ Email notifications tested
- ✅ Diun monitoring confirmed (4 containers)
- ✅ Update workflow diagram created
### 4.7 CI/CD Implementation ✅
- ✅ Gitea Actions enabled on instance
- ✅ Self-hosted runners deployed (2x act_runner v0.2.10)
- ✅ Runner automation via Ansible (setup-runner.yml)
- ✅ Systemd services for runner management
- ✅ Host networking configuration for job containers
- ✅ CI workflow created (.gitea/workflows/test.yml)
- ✅ Automated testing on pull requests
- ✅ Docker layer caching for performance
- ✅ Artifact upload on test failure
- ✅ Full CI/CD pipeline tested and operational
### Goals:
- ✅ Automated update system operational
- ✅ Update process tested and validated on live system
- ✅ Rollback procedure implemented and tested
- ✅ Quality gate for CI/local environments
- ✅ CI/CD pipeline with self-hosted runners
- ✅ Documentation complete (workflow diagram)
**Implementation Summary:**
- 5 bash scripts following best practices (DRY, error handling, logging)
- Diun monitoring with AWS SES email notifications
- Per-container update policies (automatic: nginx, manual: gitea/postgres)
- Pre-update backups with automatic rollback on failure
- Certificate renewal automation
- Comprehensive testing framework
- CI/CD with Gitea Actions and 2 self-hosted runners
- Visual workflow documentation (including CI/CD flow)
**Phase 4 Complete!** Update automation and CI/CD fully operational with safety mechanisms.
---
## Phase 5: Backup Strategy Implementation ✅
This phase implements comprehensive backup solutions.
### 5.1 Backup Concept Document ✅
- ✅ Document backup strategy (3-2-1 rule)
- ✅ Define backup scope (database, repos, config, etc.)
- ✅ Define retention policy
- ✅ Define RTO and RPO targets
### 5.2 Backup Implementation ✅
- ✅ Automate database backups (pg_dump)
- ✅ Automate Gitea data directory backups (tar.gz)
- ✅ Automate configuration backups (docker-compose.yml, .env, scripts)
- ✅ Set up backup storage (S3 with versioning)
- ✅ Implement backup rotation and cleanup (S3 lifecycle policy)
- ✅ Schedule automated backups (daily 2:00 AM cron)
- ✅ Pre-update backups integrated into update workflow
### 5.3 Recovery Testing ✅
- ✅ Document restore procedures (docs/backup-strategy.md + restore.sh script)
- ✅ Test database restore on live system (timestamp: 20260611_164408)
- ✅ Test full system restore (database + data + config)
- ✅ Verify services operational post-restore (all containers healthy)
- ✅ Document recovery time (RTO: ~45 minutes, RPO: 24 hours)
- ✅ Integration test suite includes full backup/restore cycle validation
### Goals:
- ✅ Automated backup system operational
- ✅ Restore procedures tested and documented
- ✅ Backup strategy document completed (docs/backup-strategy.md - 145 lines, concise)
- ✅ Disaster recovery validated on production system
**Phase 5 Complete!** Backup and restore fully operational and validated.
---
## Phase 6: Monitoring Concept 🔄
This phase documents a monitoring strategy for future implementation.
### 6.1 Monitoring Concept Document 🔄
- 🔄 Define key metrics to monitor (CPU, RAM, disk, network, Gitea-specific)
- 🔄 Define alerting thresholds and conditions
- 🔄 Define alert channels (email, Slack, etc.)
- 🔄 Technology selection (Prometheus + Grafana)
- 🔄 Architecture design (exporters, retention, dashboards)
- 🔄 Implementation plan and effort estimation
### Goals:
- 🔄 Monitoring concept document completed (docs/monitoring-concept.md)
- 🔄 Clear roadmap for future monitoring implementation
**Note**: Full implementation deferred - concept document shows architectural understanding and planning.
---
## Phase 7: Logging Concept 🔄
This phase documents a centralized logging strategy for future implementation.
### 7.1 Logging Concept Document 🔄
- 🔄 Define logging architecture (Loki + Promtail)
- 🔄 Define log sources (Gitea, nginx, PostgreSQL, system)
- 🔄 Define log retention policy
- 🔄 Define log analysis requirements and use cases
- 🔄 Integration with Grafana for visualization
- 🔄 Implementation plan and resource requirements
### Goals:
- 🔄 Logging concept document completed (docs/logging-concept.md)
- 🔄 Clear roadmap for future logging implementation
**Note**: Full implementation deferred - concept document shows architectural understanding and planning.
---
## Phase 8: High Availability Concept 🔄
This phase documents a high availability strategy for future implementation.
### 8.1 HA Concept Document 🔄
- 🔄 Document SPOF (Single Points of Failure) analysis
- 🔄 Design HA architecture (Multi-AZ, load balancing)
- 🔄 Database redundancy strategy (RDS Multi-AZ or PostgreSQL replication)
- 🔄 Application redundancy (multiple Gitea instances)
- 🔄 Shared storage considerations (EFS or S3 for Gitea data)
- 🔄 Load balancer configuration (ALB)
- 🔄 Define failover strategy and automation
- 🔄 Define RTO/RPO targets for HA scenario
- 🔄 Cost analysis and trade-offs
### Goals:
- 🔄 HA concept document completed (docs/ha-concept.md)
- 🔄 Clear architecture for scaling to high availability
**Note**: Full implementation deferred - concept document shows architectural understanding and planning.
---
## Phase 9: Documentation and Final Testing ✅
This phase consolidates all documentation and performs end-to-end testing.
### 9.1 Documentation ✅
- ✅ Create comprehensive README.md
- Project overview and objectives
- Architecture summary
- Prerequisites and setup instructions
- Deployment procedures
- Operational procedures
- Troubleshooting guide
- ✅ Document architecture with diagrams (4 diagrams in docs/diagrams/)
- ✅ Document all decisions (ADR.md)
- ✅ Document all procedures (deployment, updates, backup/restore)
- ✅ Backup strategy documentation (docs/backup-strategy.md - 152 lines)
- ✅ Future enhancements (monitoring, logging, HA concept docs created)
### 9.2 Final Testing ✅
- ✅ Perform end-to-end deployment test (make configure tested)
- ✅ Test all automated processes (updates, backups, CI/CD)
- ✅ Verify all automation is functional
- ✅ System accessible via HTTPS with production SSL
### 9.3 Repository Organization ✅
- ✅ Well-organized directory structure
- ✅ Clear separation of concerns (terraform, ansible, docker, scripts)
- 🔄 Comprehensive README.md
### Goals:
- 🔄 Complete documentation package
- ✅ All automation tested and validated
- 🔄 Ready for interview presentation
---
## Phase 10: Interview Preparation
This phase prepares for the interview discussion.
### 10.1 Preparation
- Review all concept documents
- Prepare to explain technology choices
- Prepare architecture diagrams for presentation
- Prepare to demonstrate the system
- List lessons learned and trade-offs made
- Prepare improvement suggestions
### Goals:
- Ready to discuss all aspects of the implementation
- Demo environment functional and accessible
- Confident in technology choices and concepts
---
## Success Criteria
- ✅ Gitea accessible via HTTPS through reverse proxy (production SSL)
- ✅ Installation fully automated and reproducible (Terraform + Ansible)
- ✅ Automated updates configured and tested (Diun + custom scripts)
- ✅ CI/CD pipeline operational (Gitea Actions with self-hosted runners)
- ✅ Automated backups implemented (daily to S3)
- 🔄 Comprehensive concept documents for: Backup, Monitoring, Logging, HA
- ✅ All code in version control with proper structure
- ✅ System accessible to interviewer over internet (https://git.poll-streams.com)
- 🔄 Complete README.md with deployment and operational procedures
**Current Status**: Production-ready system with comprehensive automation. Completing final documentation phase before interview.
---
## Remaining Work (Phase 9 Completion)
### Documentation Tasks
1. **README.md** - Comprehensive project documentation
- Overview and objectives
- Architecture summary with diagram references
- Prerequisites and deployment guide
- Operational procedures (updates, backups, troubleshooting)
2. **docs/backup-strategy.md** - Complete backup documentation
- 3-2-1 backup strategy
- RTO/RPO targets
- Backup scope and retention policy
- Restore procedures with step-by-step instructions
- S3 lifecycle policy for rotation
- Configuration backup automation
3. **docs/monitoring-concept.md** - Future monitoring architecture
- Prometheus + Grafana architecture
- Key metrics and alerting thresholds
- Implementation plan
4. **docs/logging-concept.md** - Future logging architecture
- Loki + Promtail architecture
- Log sources and retention
- Implementation plan
5. **docs/ha-concept.md** - High availability design
- SPOF analysis
- Multi-AZ architecture with load balancing
- Database replication strategy
- Cost/benefit analysis
**Estimated Completion**: 2-3 hours