- Create restore.sh for automated S3 backup recovery - Fetches backups, stops services, restores database/data/config, restarts & validates - Successfully tested on production system - Document procedures in backup-strategy.md - Add Test 6: Full backup/restore cycle with disaster simulation - Rename test-update.sh → test-integration.sh Co-authored-by: aviyadeveloper <aviya.developer@gmail.com> Reviewed-on: #2
447 lines
17 KiB
Markdown
447 lines
17 KiB
Markdown
# Roadmap
|
|
|
|
This is the implementation road map for the project. It outlines the key milestones and features in incremental steps, allowing for a structured approach to development and deployment.
|
|
|
|
## Phase 1: Conceptualization and Planning
|
|
|
|
This phase will be achieved through discussion and research and will include the following steps (no code should be implemented in this phase):
|
|
|
|
### 1.1 Requirements Analysis
|
|
- Define the scope and requirements of the project
|
|
- Identify constraints and non-functional requirements
|
|
- Determine host environment (cloud provider, VPS, or local)
|
|
|
|
### 1.2 Technology Selection ✅
|
|
**Decisions documented in [ADR.md](ADR.md)**
|
|
|
|
- **Cloud**: AWS
|
|
- **Infrastructure as Code**: Terraform
|
|
- **Configuration Management**: Ansible (kept minimal)
|
|
- **Application Deployment**: Docker + Docker Compose
|
|
- **Database**: PostgreSQL (self-hosted in Docker)
|
|
- **Reverse Proxy**: Nginx
|
|
- **SSL**: Let's Encrypt with certbot
|
|
- **Update Automation**: Diun + Custom Scripts
|
|
- **Monitoring**: Prometheus + Grafana (later phase)
|
|
- **Logging**: Loki + Promtail (later phase)
|
|
- **Backup**: Custom scripts + S3 (later phase)
|
|
|
|
### 1.3 Architecture Design ✅
|
|
- ✅ Overall system architecture designed
|
|
- ✅ Network topology planned (VPC, subnets, security groups)
|
|
- ✅ Three architecture diagrams created in docs/diagrams/
|
|
|
|
### 1.4 Project Structure ✅
|
|
- Directory structure planned (will create incrementally per phase)
|
|
- Documentation structure in place (`docs/diagrams/`)
|
|
- Naming conventions: lowercase, hyphens for files, descriptive names
|
|
|
|
### Goals:
|
|
- ✅ A clear full Roadmap for the project available in this file
|
|
- ✅ Technology stack documented with rationale (see ADR.md)
|
|
- ✅ Architecture diagrams created (3 diagrams in docs/diagrams/)
|
|
- ✅ Project structure planned
|
|
|
|
**Phase 1 Complete!** Ready to begin Phase 2 (Infrastructure Setup).
|
|
|
|
---
|
|
|
|
## Phase 2: Infrastructure Setup
|
|
|
|
This phase provisions the AWS infrastructure using Terraform.
|
|
|
|
### 2.1 Terraform Backend Setup ✅
|
|
- Configure AWS CLI and credentials locally
|
|
- Set up Terraform backend (S3 bucket for state storage)
|
|
- Initialize Terraform working directory
|
|
|
|
### 2.2 Core Infrastructure ✅
|
|
- ✅ Create VPC with single public subnet
|
|
- ✅ Set up Internet Gateway
|
|
- ✅ Configure Security Group for EC2 (ports 22, 80, 443)
|
|
- ✅ Provision EC2 instance (t3.medium, Ubuntu 24.04) with IAM role
|
|
- ✅ Create S3 bucket for backups (with versioning & encryption)
|
|
- ✅ Configure Route 53 DNS records (A record: git.poll-streams.com → EC2)
|
|
- ✅ Use official Terraform AWS modules (VPC, Security Group)
|
|
- ✅ Refactored into separate files: main.tf, vpc.tf, security.tf, compute.tf, storage.tf, iam.tf, dns.tf, outputs.tf
|
|
|
|
### 2.3 Security Configuration ✅
|
|
- ✅ Configure SSH key-based authentication (Ed25519, generated via Terraform)
|
|
- ✅ SSH access from anywhere (0.0.0.0/0) - security via key-based auth
|
|
- ✅ Apply IAM policies (AmazonS3FullAccess for EC2 backups)
|
|
- ✅ Security group follows least access (only 22, 80, 443 inbound; all outbound)
|
|
- ✅ Encrypted EBS root volume (30GB gp3)
|
|
|
|
### Goals: ✅
|
|
- ✅ AWS infrastructure fully defined in Terraform code
|
|
- ✅ EC2 instance provisioned and accessible via SSH
|
|
- ✅ S3 backup bucket created
|
|
- ✅ Domain DNS configured and resolving
|
|
- ✅ Infrastructure can be destroyed and recreated with `terraform apply`
|
|
|
|
**Phase 2 Complete!** Ready to begin Phase 3 (Automated Gitea Deployment).
|
|
|
|
---
|
|
|
|
## Phase 3: Automated Gitea Deployment
|
|
|
|
This phase implements the automated, reproducible Gitea installation.
|
|
|
|
### 3.1 Database Setup ✅
|
|
- ✅ PostgreSQL 18.4 deployed via Docker Compose
|
|
- ✅ Database credentials stored in AWS Secrets Manager
|
|
- ✅ Random password generation via Terraform
|
|
- ✅ Volume mounted at /var/lib/postgresql (PostgreSQL 18+ requirement)
|
|
- ✅ Health checks configured with pg_isready
|
|
|
|
### 3.2 Gitea Installation ✅
|
|
- ✅ Gitea 1.22.6 deployed via Docker Compose
|
|
- ✅ Ansible playbooks created: setup-system.yml, deploy-gitea.yml, setup-ssl.yml, site.yml
|
|
- ✅ Docker + AWS CLI installation automated
|
|
- ✅ Gitea configured with environment variables (database, domain, ROOT_URL)
|
|
- ✅ SSH git access on port 2222
|
|
- ✅ Volumes for persistent data
|
|
|
|
### 3.3 Reverse Proxy Configuration ✅
|
|
- ✅ Nginx 1.27-alpine deployed via Docker Compose
|
|
- ✅ Let's Encrypt SSL certificate obtained via certbot (production)
|
|
- ✅ Domain: git.poll-streams.com (migrated to avoid rate limits)
|
|
- ✅ Two-stage nginx config (HTTP-only for ACME, then HTTPS)
|
|
- ✅ SSL termination at nginx, proxy to Gitea on port 3000
|
|
- ✅ HTTP to HTTPS redirect configured
|
|
- ✅ Security headers (HSTS, X-Frame-Options, etc.)
|
|
- ✅ WebSocket support for real-time features
|
|
- ✅ 512MB upload limit
|
|
|
|
### 3.4 Testing ✅
|
|
- ✅ HTTPS access verified: https://git.poll-streams.com
|
|
- ✅ Valid SSL certificate (Let's Encrypt production)
|
|
- ✅ HTTP → HTTPS redirect working
|
|
- ✅ Gitea web interface accessible and functional
|
|
- ✅ User account created, repository created
|
|
- ✅ Git push via HTTPS tested successfully
|
|
- ✅ Full deployment reproducible via `ansible-playbook site.yml`
|
|
|
|
### Goals: ✅
|
|
- ✅ Gitea running and accessible via HTTPS through reverse proxy
|
|
- ✅ Installation fully automated and reproducible
|
|
- ✅ Production-grade deployment with SSL
|
|
|
|
**Phase 3 Complete!** Gitea is fully deployed, secured with SSL, and accessible from the internet.
|
|
|
|
---
|
|
|
|
## Phase 4: Update Automation ✅
|
|
|
|
This phase implements automated update mechanisms for Gitea and related components.
|
|
|
|
### 4.1 Update Strategy Design ✅
|
|
- ✅ Weekly update checks (Sunday 3:00 AM)
|
|
- ✅ Per-container update policies (automatic vs manual)
|
|
- ✅ Pre-update backup to S3
|
|
- ✅ Post-update health checks
|
|
- ✅ Automatic rollback on failure
|
|
- ✅ Email notifications via AWS SES
|
|
|
|
### 4.2 Update Monitoring ✅
|
|
- ✅ Diun 4.33 deployed for Docker image update detection
|
|
- ✅ Scheduled weekly checks (cron: `0 3 * * 0`)
|
|
- ✅ Monitors: postgres, gitea, nginx, diun
|
|
- ✅ Email notifications configured via AWS SES SMTP
|
|
- ✅ IAM user created for SMTP credentials
|
|
- ✅ Labels define update policies per container
|
|
|
|
### 4.3 Automated Scripts ✅
|
|
- ✅ **backup.sh**: Database + Gitea data backup to S3 bucket
|
|
- ✅ **health-check.sh**: Validates all services running and responsive
|
|
- ✅ **auto-update.sh**: Automatic updates for low-risk containers (nginx)
|
|
- Backup before update
|
|
- Pull new image
|
|
- Recreate container
|
|
- Health check validation
|
|
- Automatic rollback on failure
|
|
- Email notifications
|
|
- ✅ **manual-update.sh**: Manual updates for critical containers (gitea/postgres)
|
|
- Operator confirmation required
|
|
- Same safety flow as auto-update
|
|
- Success/failure notifications
|
|
- ✅ **test-integration.sh**: Comprehensive integration test suite for CI/CD
|
|
- Script syntax validation (bash -n)
|
|
- Docker Compose configuration validation
|
|
- Backup archive creation and validation
|
|
- Health check failure detection
|
|
- Update workflow with rollback simulation
|
|
- Full backup and restore cycle testing (22 assertions total)
|
|
- Isolated test environment (/tmp)
|
|
- No dependencies on live services
|
|
- ✅ **restore.sh**: Disaster recovery from S3 backups
|
|
- Downloads latest backups from S3
|
|
- Restores database, Gitea data, and configuration
|
|
- Service stop/start orchestration
|
|
- Tested successfully on live system (timestamp 20260611_164408)
|
|
|
|
**Script Quality:**
|
|
- All scripts follow DRY principles with extracted helper functions
|
|
- Consistent error handling and logging patterns
|
|
- Configurable timeouts and magic numbers replaced with constants
|
|
- Comprehensive comments and documentation headers
|
|
|
|
### 4.4 Cron Jobs ✅
|
|
- ✅ Weekly automatic update (nginx only): Sunday 3:15 AM
|
|
- ✅ Weekly certificate renewal: Sunday 3:30 AM
|
|
- ✅ Daily backups: 2:00 AM
|
|
- ✅ All configured via Ansible (setup-cron.yml)
|
|
|
|
### 4.5 Certificate Renewal ✅
|
|
- ✅ Automated weekly renewal check via cron
|
|
- ✅ Uses certbot container: `docker compose run --rm certbot renew`
|
|
- ✅ Restarts nginx to load new certificates
|
|
- ✅ Process is idempotent (safe to run weekly)
|
|
|
|
### 4.6 Testing & Validation ✅
|
|
- ✅ Integration tests created (test-integration.sh)
|
|
- ✅ All scripts tested on live system
|
|
- ✅ Cron jobs verified
|
|
- ✅ Email notifications tested
|
|
- ✅ Diun monitoring confirmed (4 containers)
|
|
- ✅ Update workflow diagram created
|
|
|
|
### 4.7 CI/CD Implementation ✅
|
|
- ✅ Gitea Actions enabled on instance
|
|
- ✅ Self-hosted runners deployed (2x act_runner v0.2.10)
|
|
- ✅ Runner automation via Ansible (setup-runner.yml)
|
|
- ✅ Systemd services for runner management
|
|
- ✅ Host networking configuration for job containers
|
|
- ✅ CI workflow created (.gitea/workflows/test.yml)
|
|
- ✅ Automated testing on pull requests
|
|
- ✅ Docker layer caching for performance
|
|
- ✅ Artifact upload on test failure
|
|
- ✅ Full CI/CD pipeline tested and operational
|
|
|
|
### Goals:
|
|
- ✅ Automated update system operational
|
|
- ✅ Update process tested and validated on live system
|
|
- ✅ Rollback procedure implemented and tested
|
|
- ✅ Quality gate for CI/local environments
|
|
- ✅ CI/CD pipeline with self-hosted runners
|
|
- ✅ Documentation complete (workflow diagram)
|
|
|
|
**Implementation Summary:**
|
|
- 5 bash scripts following best practices (DRY, error handling, logging)
|
|
- Diun monitoring with AWS SES email notifications
|
|
- Per-container update policies (automatic: nginx, manual: gitea/postgres)
|
|
- Pre-update backups with automatic rollback on failure
|
|
- Certificate renewal automation
|
|
- Comprehensive testing framework
|
|
- CI/CD with Gitea Actions and 2 self-hosted runners
|
|
- Visual workflow documentation (including CI/CD flow)
|
|
|
|
**Phase 4 Complete!** Update automation and CI/CD fully operational with safety mechanisms.
|
|
|
|
---
|
|
|
|
## Phase 5: Backup Strategy Implementation ✅
|
|
|
|
This phase implements comprehensive backup solutions.
|
|
|
|
### 5.1 Backup Concept Document ✅
|
|
- ✅ Document backup strategy (3-2-1 rule)
|
|
- ✅ Define backup scope (database, repos, config, etc.)
|
|
- ✅ Define retention policy
|
|
- ✅ Define RTO and RPO targets
|
|
|
|
### 5.2 Backup Implementation ✅
|
|
- ✅ Automate database backups (pg_dump)
|
|
- ✅ Automate Gitea data directory backups (tar.gz)
|
|
- ✅ Automate configuration backups (docker-compose.yml, .env, scripts)
|
|
- ✅ Set up backup storage (S3 with versioning)
|
|
- ✅ Implement backup rotation and cleanup (S3 lifecycle policy)
|
|
- ✅ Schedule automated backups (daily 2:00 AM cron)
|
|
- ✅ Pre-update backups integrated into update workflow
|
|
|
|
### 5.3 Recovery Testing ✅
|
|
- ✅ Document restore procedures (docs/backup-strategy.md + restore.sh script)
|
|
- ✅ Test database restore on live system (timestamp: 20260611_164408)
|
|
- ✅ Test full system restore (database + data + config)
|
|
- ✅ Verify services operational post-restore (all containers healthy)
|
|
- ✅ Document recovery time (RTO: ~45 minutes, RPO: 24 hours)
|
|
- ✅ Integration test suite includes full backup/restore cycle validation
|
|
|
|
### Goals:
|
|
- ✅ Automated backup system operational
|
|
- ✅ Restore procedures tested and documented
|
|
- ✅ Backup strategy document completed (docs/backup-strategy.md - 145 lines, concise)
|
|
- ✅ Disaster recovery validated on production system
|
|
|
|
**Phase 5 Complete!** Backup and restore fully operational and validated.
|
|
|
|
---
|
|
|
|
## Phase 6: Monitoring Concept 🔄
|
|
|
|
This phase documents a monitoring strategy for future implementation.
|
|
|
|
### 6.1 Monitoring Concept Document 🔄
|
|
- 🔄 Define key metrics to monitor (CPU, RAM, disk, network, Gitea-specific)
|
|
- 🔄 Define alerting thresholds and conditions
|
|
- 🔄 Define alert channels (email, Slack, etc.)
|
|
- 🔄 Technology selection (Prometheus + Grafana)
|
|
- 🔄 Architecture design (exporters, retention, dashboards)
|
|
- 🔄 Implementation plan and effort estimation
|
|
|
|
### Goals:
|
|
- 🔄 Monitoring concept document completed (docs/monitoring-concept.md)
|
|
- 🔄 Clear roadmap for future monitoring implementation
|
|
|
|
**Note**: Full implementation deferred - concept document shows architectural understanding and planning.
|
|
|
|
---
|
|
|
|
## Phase 7: Logging Concept 🔄
|
|
|
|
This phase documents a centralized logging strategy for future implementation.
|
|
|
|
### 7.1 Logging Concept Document 🔄
|
|
- 🔄 Define logging architecture (Loki + Promtail)
|
|
- 🔄 Define log sources (Gitea, nginx, PostgreSQL, system)
|
|
- 🔄 Define log retention policy
|
|
- 🔄 Define log analysis requirements and use cases
|
|
- 🔄 Integration with Grafana for visualization
|
|
- 🔄 Implementation plan and resource requirements
|
|
|
|
### Goals:
|
|
- 🔄 Logging concept document completed (docs/logging-concept.md)
|
|
- 🔄 Clear roadmap for future logging implementation
|
|
|
|
**Note**: Full implementation deferred - concept document shows architectural understanding and planning.
|
|
|
|
---
|
|
|
|
## Phase 8: High Availability Concept 🔄
|
|
|
|
This phase documents a high availability strategy for future implementation.
|
|
|
|
### 8.1 HA Concept Document 🔄
|
|
- 🔄 Document SPOF (Single Points of Failure) analysis
|
|
- 🔄 Design HA architecture (Multi-AZ, load balancing)
|
|
- 🔄 Database redundancy strategy (RDS Multi-AZ or PostgreSQL replication)
|
|
- 🔄 Application redundancy (multiple Gitea instances)
|
|
- 🔄 Shared storage considerations (EFS or S3 for Gitea data)
|
|
- 🔄 Load balancer configuration (ALB)
|
|
- 🔄 Define failover strategy and automation
|
|
- 🔄 Define RTO/RPO targets for HA scenario
|
|
- 🔄 Cost analysis and trade-offs
|
|
|
|
### Goals:
|
|
- 🔄 HA concept document completed (docs/ha-concept.md)
|
|
- 🔄 Clear architecture for scaling to high availability
|
|
|
|
**Note**: Full implementation deferred - concept document shows architectural understanding and planning.
|
|
|
|
---
|
|
|
|
## Phase 9: Documentation and Final Testing ✅
|
|
|
|
This phase consolidates all documentation and performs end-to-end testing.
|
|
|
|
### 9.1 Documentation ✅
|
|
- ✅ Create comprehensive README.md
|
|
- Project overview and objectives
|
|
- Architecture summary
|
|
- Prerequisites and setup instructions
|
|
- Deployment procedures
|
|
- Operational procedures
|
|
- Troubleshooting guide
|
|
- ✅ Document architecture with diagrams (4 diagrams in docs/diagrams/)
|
|
- ✅ Document all decisions (ADR.md)
|
|
- ✅ Document all procedures (deployment, updates, backup/restore)
|
|
- ✅ Backup strategy documentation (docs/backup-strategy.md - 152 lines)
|
|
- ✅ Future enhancements (monitoring, logging, HA concept docs created)
|
|
|
|
### 9.2 Final Testing ✅
|
|
- ✅ Perform end-to-end deployment test (make configure tested)
|
|
- ✅ Test all automated processes (updates, backups, CI/CD)
|
|
- ✅ Verify all automation is functional
|
|
- ✅ System accessible via HTTPS with production SSL
|
|
|
|
### 9.3 Repository Organization ✅
|
|
- ✅ Well-organized directory structure
|
|
- ✅ Clear separation of concerns (terraform, ansible, docker, scripts)
|
|
- 🔄 Comprehensive README.md
|
|
|
|
### Goals:
|
|
- 🔄 Complete documentation package
|
|
- ✅ All automation tested and validated
|
|
- 🔄 Ready for interview presentation
|
|
|
|
---
|
|
|
|
## Phase 10: Interview Preparation
|
|
|
|
This phase prepares for the interview discussion.
|
|
|
|
### 10.1 Preparation
|
|
- Review all concept documents
|
|
- Prepare to explain technology choices
|
|
- Prepare architecture diagrams for presentation
|
|
- Prepare to demonstrate the system
|
|
- List lessons learned and trade-offs made
|
|
- Prepare improvement suggestions
|
|
|
|
### Goals:
|
|
- Ready to discuss all aspects of the implementation
|
|
- Demo environment functional and accessible
|
|
- Confident in technology choices and concepts
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
- ✅ Gitea accessible via HTTPS through reverse proxy (production SSL)
|
|
- ✅ Installation fully automated and reproducible (Terraform + Ansible)
|
|
- ✅ Automated updates configured and tested (Diun + custom scripts)
|
|
- ✅ CI/CD pipeline operational (Gitea Actions with self-hosted runners)
|
|
- ✅ Automated backups implemented (daily to S3)
|
|
- 🔄 Comprehensive concept documents for: Backup, Monitoring, Logging, HA
|
|
- ✅ All code in version control with proper structure
|
|
- ✅ System accessible to interviewer over internet (https://git.poll-streams.com)
|
|
- 🔄 Complete README.md with deployment and operational procedures
|
|
|
|
**Current Status**: Production-ready system with comprehensive automation. Completing final documentation phase before interview.
|
|
|
|
---
|
|
|
|
## Remaining Work (Phase 9 Completion)
|
|
|
|
### Documentation Tasks
|
|
1. **README.md** - Comprehensive project documentation
|
|
- Overview and objectives
|
|
- Architecture summary with diagram references
|
|
- Prerequisites and deployment guide
|
|
- Operational procedures (updates, backups, troubleshooting)
|
|
|
|
2. **docs/backup-strategy.md** - Complete backup documentation
|
|
- 3-2-1 backup strategy
|
|
- RTO/RPO targets
|
|
- Backup scope and retention policy
|
|
- Restore procedures with step-by-step instructions
|
|
- S3 lifecycle policy for rotation
|
|
- Configuration backup automation
|
|
|
|
3. **docs/monitoring-concept.md** - Future monitoring architecture
|
|
- Prometheus + Grafana architecture
|
|
- Key metrics and alerting thresholds
|
|
- Implementation plan
|
|
|
|
4. **docs/logging-concept.md** - Future logging architecture
|
|
- Loki + Promtail architecture
|
|
- Log sources and retention
|
|
- Implementation plan
|
|
|
|
5. **docs/ha-concept.md** - High availability design
|
|
- SPOF analysis
|
|
- Multi-AZ architecture with load balancing
|
|
- Database replication strategy
|
|
- Cost/benefit analysis
|
|
|
|
**Estimated Completion**: 2-3 hours |