Cloud Platform Infrastructure
- Designed a three-tier Terraform architecture (67 modules) enabling one engineer to safely manage 6 AWS accounts and 16 environments
- Eliminated all outbound internet access from the FedRAMP boundary through VPC endpoints and forward proxy allowlisting
- Implemented full microsegmentation with 60+ security group-managed endpoints using bidirectional rules, making lateral movement effectively impossible
- Achieved zero static credentials across all CI/CD and runtime via GitHub OIDC federation and per-task IAM roles
- Built immutable cross-region backups with vault lock enforcement and 7-year retention for FedRAMP compliance
The Challenge
The platform I manage started on a single-server Docker Swarm PaaS. That worked until we needed FedRAMP authorization, data residency across multiple regions, and the ability to scale beyond what a managed platform could offer. I needed to take this from a constrained PaaS deployment to a multi-account AWS organization supporting 40+ containers, multiple compliance environments, and data residency in both Canada and the US. The catch: a small team still had to be able to operate all of it with confidence.
Approach & Role
I was the sole infrastructure architect. I designed and implemented everything from network topology through application deployment. The result is a three-tier Terraform architecture that separates reusable modules from deployment orchestration and per-account configuration. One engineer can manage 6+ AWS accounts because the structure makes it safe to do so.
The key insight was designing for compliance from the start rather than retrofitting. Every resource gets KMS encryption. Every service gets a least-privilege IAM role. Every change flows through GitHub OIDC federation with zero static credentials. Building it right the first time meant we didn't have to rip things apart when the auditors showed up.
Architecture & Patterns
Three-tier Terraform structure:
- A versioned module library (67+ modules, git tag-based versioning) covering VPC, ECS, RDS, KMS, S3, WAF, GuardDuty, Security Hub, SageMaker, Bedrock, and more
- A deployment orchestration layer using YAML-templated service definitions for 20+ services
- Per-account variable configurations with layered deployment order: compliance, common, stacks, security groups
Multi-account organization:
- Separate AWS accounts for development, staging, production (CA + US), FedRAMP, and web/DNS
- GitHub OIDC federation with per-account IAM roles, no static credentials anywhere in CI/CD
- S3 backend with partial configuration for account portability
Developer experience:
- DevContainers for reproducible Terraform environments
- Pre-commit hooks running tflint, checkov, trivy, terraform-compliance, and terraform-docs
- Path-based CI/CD triggers for monorepo efficiency (plan-on-PR, apply-on-merge)
Network Security Architecture
I designed the network topology around a single principle: nothing talks to the internet unless there's an explicit, justified reason. Every service inside the FedRAMP boundary operates in a fully private network with no outbound internet access.
VPC segmentation by function:
- Core application microservices share a VPC but are segmented across private subnets with security group boundaries between service tiers
- UC (video conferencing) services occupy their own VPC because real-time media doesn't traverse NATs reliably. This is the only VPC with public-facing services beyond the edge load balancers
- Database services (RDS Aurora, Redis/Valkey) sit in an isolated VPC, accessible only through VPC endpoints (PrivateLink) from the application VPCs. No direct VPC peering, no transit gateway
- Observability infrastructure (monitoring, log aggregation, event processing) runs in its own VPC to separate operational tooling from application workloads
Private connectivity for AWS services:
- Every AWS API call (S3, SES, SNS, CloudWatch Logs, Transcribe, SageMaker, ECR, etc.) routes through interface or gateway VPC endpoints
- This is a defense-in-depth choice, since services have no reason to reach the public internet, that path simply doesn't exist. Traffic to AWS stays on the AWS backbone.
Controlled egress via proxy subnets:
- A small number of services need to call external third-party APIs (licensing servers, media retrieval). These calls route through forward proxy instances in dedicated subnets that have internet access
- The proxies allowlist specific external destinations, services cannot reach arbitrary internet hosts even through the proxy path
- This gives me complete visibility into external traffic flows and creates a single chokepoint for egress monitoring
Layered load balancing:
- Internet-facing ALBs terminate TLS and route to public services
- Internal ALBs and NLBs handle all inter-service communication, no service-to-service traffic crosses a public load balancer
- ECS Service Connect provides service mesh connectivity within the core application VPC
Microsegmentation via bidirectional security groups:
- Every service-to-service communication path is explicitly defined with paired security group rules. An outbound rule on the caller allowing traffic to the target, and a corresponding inbound rule on the target allowing traffic from the caller
- This applies across the entire topology: application services, VPC endpoints, databases, caches, load balancers, and proxy instances all have explicit bidirectional rules
- No service can reach another unless both sides of the connection are explicitly authorized. This isn't "allow all within the VPC," it's full microsegmentation at the security group level
- The result is 60+ security group-managed endpoints with individually scoped communication paths, making lateral movement between compromised services effectively impossible without matching rules on both ends
Admin access. SSM-only, no VPN:
- The sole path into production is AWS Systems Manager Session Manager: authenticate via MFA-gated SSO (AWS Identity Center), then establish an SSM port-forward to the bastion
- The bastion itself is network-restricted to specific internal targets (reporting databases, monitoring endpoints), no lateral movement to arbitrary hosts
- SSM's architecture prevents chaining port forwards, so an operator on the bastion can't pivot to other services. This is by design.
- Development and staging environments are completely disconnected from production, separate accounts, separate identity providers (Google IDP vs agency IDP), no network linkage
Data protection & immutable backups:
- AWS Backup with Vault Lock policies covering all stateful resources (S3, RDS, EFS), backups are immutable and cannot be deleted by operators, even with admin credentials
- Cross-region replication to a secondary vault with extended retention (7-year monthly backups) for compliance and disaster recovery
- All backup vaults encrypted with FIPS-validated KMS keys, data transferred exclusively over AWS internal networks
- Tiered retention in the primary region: 7-day point-in-time recovery, daily (7 days), weekly (21 days), monthly (90 days)
- Secondary region retention: daily (35 days), weekly (90 days), monthly (7 years)
- Backup plans auto-discover new resources by type, no manual enrollment when services are added
Impact & Scale
- 67 reusable Terraform modules spanning all AWS services in active use
- 16 environments across 6+ AWS accounts (including GovCloud)
- 40+ ECS Fargate tasks with Service Connect mesh, FluentBit sidecars, and Spot capacity providers
- 60+ security group-managed endpoints with explicit bidirectional rules for every communication path
- Multi-region deployment (ca-central-1, ca-west-1, us-east-1, us-east-2, us-west-2, us-gov-east-1, us-gov-west-1)
- Zero static credentials, all authentication via OIDC federation and IAM roles
- Immutable cross-region backups with 7-year retention and vault lock enforcement
- Small team managing the full infrastructure with high confidence through automation and quality gates