Cloud Platform Infrastructure

Designed a three-tier Terraform architecture (67 modules) enabling one engineer to safely manage 6 AWS accounts and 16 environments
Eliminated all outbound internet access from the FedRAMP boundary through VPC endpoints and forward proxy allowlisting
Implemented full microsegmentation with 60+ security group-managed endpoints using bidirectional rules, making lateral movement effectively impossible
Achieved zero static credentials across all CI/CD and runtime via GitHub OIDC federation and per-task IAM roles
Built immutable cross-region backups with vault lock enforcement and 7-year retention for FedRAMP compliance

The Challenge

The platform I manage started on a single-server Docker Swarm PaaS. That worked until we needed FedRAMP authorization, data residency across multiple regions, and the ability to scale beyond what a managed platform could offer. I needed to take this from a constrained PaaS deployment to a multi-account AWS organization supporting 40+ containers, multiple compliance environments, and data residency in both Canada and the US. The catch: a small team still had to be able to operate all of it with confidence.

Approach & Role

I was the sole infrastructure architect. I designed and implemented everything from network topology through application deployment. The result is a three-tier Terraform architecture that separates reusable modules from deployment orchestration and per-account configuration. One engineer can manage 6+ AWS accounts because the structure makes it safe to do so.

The key insight was designing for compliance from the start rather than retrofitting. Every resource gets KMS encryption. Every service gets a least-privilege IAM role. Every change flows through GitHub OIDC federation with zero static credentials. Building it right the first time meant we didn't have to rip things apart when the auditors showed up.

Architecture & Patterns

Three-tier Terraform structure:

A versioned module library (67+ modules, git tag-based versioning) covering VPC, ECS, RDS, KMS, S3, WAF, GuardDuty, Security Hub, SageMaker, Bedrock, and more
A deployment orchestration layer using YAML-templated service definitions for 20+ services
Per-account variable configurations with layered deployment order: compliance, common, stacks, security groups

Multi-account organization:

Separate AWS accounts for development, staging, production (CA + US), FedRAMP, and web/DNS
GitHub OIDC federation with per-account IAM roles, no static credentials anywhere in CI/CD
S3 backend with partial configuration for account portability

Developer experience:

DevContainers for reproducible Terraform environments
Pre-commit hooks running tflint, checkov, trivy, terraform-compliance, and terraform-docs
Path-based CI/CD triggers for monorepo efficiency (plan-on-PR, apply-on-merge)

Network Security Architecture

I designed the network topology around a single principle: nothing talks to the internet unless there's an explicit, justified reason. Every service inside the FedRAMP boundary operates in a fully private network with no outbound internet access.

VPC segmentation by function:

Core application microservices share a VPC but are segmented across private subnets with security group boundaries between service tiers
UC (video conferencing) services occupy their own VPC because real-time media doesn't traverse NATs reliably. This is the only VPC with public-facing services beyond the edge load balancers
Database services (RDS Aurora, Redis/Valkey) sit in an isolated VPC, accessible only through VPC endpoints (PrivateLink) from the application VPCs. No direct VPC peering, no transit gateway
Observability infrastructure (monitoring, log aggregation, event processing) runs in its own VPC to separate operational tooling from application workloads

Private connectivity for AWS services:

Every AWS API call (S3, SES, SNS, CloudWatch Logs, Transcribe, SageMaker, ECR, etc.) routes through interface or gateway VPC endpoints
This is a defense-in-depth choice, since services have no reason to reach the public internet, that path simply doesn't exist. Traffic to AWS stays on the AWS backbone.

Controlled egress via proxy subnets:

A small number of services need to call external third-party APIs (licensing servers, media retrieval). These calls route through forward proxy instances in dedicated subnets that have internet access
The proxies allowlist specific external destinations, services cannot reach arbitrary internet hosts even through the proxy path
This gives me complete visibility into external traffic flows and creates a single chokepoint for egress monitoring

Layered load balancing:

Internet-facing ALBs terminate TLS and route to public services
Internal ALBs and NLBs handle all inter-service communication, no service-to-service traffic crosses a public load balancer
ECS Service Connect provides service mesh connectivity within the core application VPC

Microsegmentation via bidirectional security groups:

Every service-to-service communication path is explicitly defined with paired security group rules. An outbound rule on the caller allowing traffic to the target, and a corresponding inbound rule on the target allowing traffic from the caller
This applies across the entire topology: application services, VPC endpoints, databases, caches, load balancers, and proxy instances all have explicit bidirectional rules
No service can reach another unless both sides of the connection are explicitly authorized. This isn't "allow all within the VPC," it's full microsegmentation at the security group level
The result is 60+ security group-managed endpoints with individually scoped communication paths, making lateral movement between compromised services effectively impossible without matching rules on both ends

Admin access. SSM-only, no VPN:

The sole path into production is AWS Systems Manager Session Manager: authenticate via MFA-gated SSO (AWS Identity Center), then establish an SSM port-forward to the bastion
The bastion itself is network-restricted to specific internal targets (reporting databases, monitoring endpoints), no lateral movement to arbitrary hosts
SSM's architecture prevents chaining port forwards, so an operator on the bastion can't pivot to other services. This is by design.
Development and staging environments are completely disconnected from production, separate accounts, separate identity providers (Google IDP vs agency IDP), no network linkage

Data protection & immutable backups:

AWS Backup with Vault Lock policies covering all stateful resources (S3, RDS, EFS), backups are immutable and cannot be deleted by operators, even with admin credentials
Cross-region replication to a secondary vault with extended retention (7-year monthly backups) for compliance and disaster recovery
All backup vaults encrypted with FIPS-validated KMS keys, data transferred exclusively over AWS internal networks
Tiered retention in the primary region: 7-day point-in-time recovery, daily (7 days), weekly (21 days), monthly (90 days)
Secondary region retention: daily (35 days), weekly (90 days), monthly (7 years)
Backup plans auto-discover new resources by type, no manual enrollment when services are added

Impact & Scale

67 reusable Terraform modules spanning all AWS services in active use
16 environments across 6+ AWS accounts (including GovCloud)
40+ ECS Fargate tasks with Service Connect mesh, FluentBit sidecars, and Spot capacity providers
60+ security group-managed endpoints with explicit bidirectional rules for every communication path
Multi-region deployment (ca-central-1, ca-west-1, us-east-1, us-east-2, us-west-2, us-gov-east-1, us-gov-west-1)
Zero static credentials, all authentication via OIDC federation and IAM roles
Immutable cross-region backups with 7-year retention and vault lock enforcement
Small team managing the full infrastructure with high confidence through automation and quality gates