Container Orchestration & Platform Migration
- Migrated 40+ containers from Docker Swarm to AWS ECS Fargate with zero data loss and zero application code changes
- Built a secrets compatibility shim that let services run on both platforms simultaneously during the transition, eliminating developer migration work
- Used PostgreSQL logical replication as a hot standby bridge, enabling a zero-downtime cutover with instant rollback capability
- Replaced all static IAM credentials with per-task IAM roles, eliminating long-lived secrets from the platform entirely
- Designed a deliberate migration sequence (provision, replicate, validate in parallel, cutover) that de-risked the move for a regulated healthcare platform
The Challenge
A regulated healthcare platform running 40+ containers on a managed Docker Swarm PaaS (MedStack) needed to migrate to AWS ECS Fargate. MedStack couldn't support a FedRAMP authorization path and their SOC 2 certification story was unreliable - the platform needed its own compliance boundary to pursue federal contracts. The platform served active therapy sessions across North America, so downtime wasn't an option. The migration had to be incremental, maintain compliance continuity, and impose zero burden on application development teams - they shouldn't need to touch their code to move between orchestrators.
Approach & Role
I designed and executed the full migration strategy end-to-end. The key constraints: zero downtime for end users, zero code changes required from application developers, and continuous compliance throughout the transition. The platform couldn't go dark for a weekend while we figured things out - this is healthcare, people are in active therapy sessions.
Architecture & Patterns
Secrets compatibility shim (the enabler):
The first problem was that Docker Swarm and ECS Fargate handle secrets completely differently. Swarm mounts them as files at /run/secrets/{name}, Fargate injects them as environment variables. Every service in the fleet read secrets from disk.
Rather than ask 4+ development teams to refactor their secret-loading code, I wrote a shell shim that runs at container startup: it detects which platform it's on, and if it's Fargate, creates /run/secrets/ files from environment variables. The application never knows the difference. This meant services could run on both platforms simultaneously during the transition period - critical for incremental migration.
IAM credential elimination:
On MedStack, services used IAM users with static access keys for AWS API calls. Moving to Fargate meant we could assign IAM roles directly to tasks - no more long-lived credentials sitting in secret stores. Each service got a least-privilege role scoped to exactly what it needs.
ECS Fargate deployment architecture:
- YAML template-driven service definitions for consistent deployments across all workloads
- FluentBit sidecars on every task for centralized log aggregation (FireLens to CloudWatch + S3)
- Fargate Spot capacity providers for cost optimization on non-critical workloads
- Service Connect mesh for inter-service communication with automatic TLS and service discovery
- Individual Dockerfiles per service (FedRAMP requirement - no shared base image with multiple CMD entrypoints)
- Per-service IAM roles with least-privilege policies
- ECR repository per service with lifecycle policies
Migration Sequence
The migration followed a deliberate sequence designed to validate each layer before depending on it:
1. Infrastructure provisioning
Spun up isolated AWS accounts for each deployment environment (dev, staging, prod-ca, prod-us). Translated Docker Swarm service definitions into ECS task definitions and Terraform modules. Each account got the full stack independently - no shared-tenancy shortcuts.
2. Data migration
Imported all live data into the new accounts: PostgreSQL database exports, S3 bucket replication with bulk sync. The goal was a complete mirror of production state in the new environment so QA could test against real data volumes and edge cases, not synthetic datasets.
3. Parallel validation
Deployed the full platform under a separate testing domain. QA hammered it - functional testing, load testing, regression suites. Bugs surfaced and got fixed without any production risk. This phase ran for weeks until QA certified the platform ready.
4. Replication bridge
Once QA signed off, I set up PostgreSQL logical replication from MedStack to AWS. Production continued serving from MedStack while every write replicated into the AWS databases in near-real-time. This gave us a hot standby with current data.
5. Cutover (zero-downtime)
The actual cutover was a coordinated sequence:
- Pointed DNS to a maintenance page
- Stopped all MedStack services (writes cease)
- Waited for replication lag to reach zero (databases fully consistent)
- Terminated replication, promoted AWS databases to primary
- Reconfigured the AWS deployment on the production domain
- QA validated through the blue-stack endpoint (bypasses the maintenance page)
- Pointed production DNS to the AWS ALB
- Verified end-user traffic flowing correctly
The maintenance window was planned and communicated, but the actual service interruption was minimal - most of the time was waiting for replication to drain and QA to give the thumbs up.
Rollback strategy:
If anything went wrong during cutover, the plan was simple: point DNS back to MedStack and restart the services. Since we stopped MedStack before any traffic hit AWS, no new data would exist in AWS that wasn't already in MedStack - zero data loss on rollback. We'd take the lessons learned and attempt again. The low-risk rollback is what made the team comfortable committing to the cutover window.
Impact & Scale
- 40+ containers migrated from Docker Swarm to ECS Fargate with zero data loss
- Compatibility shim eliminated the need for application-level migration work - developers didn't touch a line of code
- Static IAM credentials eliminated entirely, replaced with task-level IAM roles
- FluentBit sidecars provide unified observability across the entire fleet
- Service Connect mesh replaced manual service discovery configuration with automatic TLS and health-aware routing
- Platform operates across 4 isolated AWS accounts with independent deployment lifecycles
For the MedStack deployment automation that managed the platform pre-migration, see DevOps Automation & Tooling.