Container Orchestration & Platform Migration

Migrated 40+ containers from Docker Swarm to AWS ECS Fargate with zero data loss and zero application code changes
Built a secrets compatibility shim that let services run on both platforms simultaneously during the transition, eliminating developer migration work
Used PostgreSQL logical replication as a hot standby bridge, enabling a zero-downtime cutover with instant rollback capability
Replaced all static IAM credentials with per-task IAM roles, eliminating long-lived secrets from the platform entirely
Designed a deliberate migration sequence (provision, replicate, validate in parallel, cutover) that de-risked the move for a regulated healthcare platform

The Challenge

A regulated healthcare platform running 40+ containers on a managed Docker Swarm PaaS (MedStack) needed to migrate to AWS ECS Fargate. MedStack couldn't support a FedRAMP authorization path and their SOC 2 certification story was unreliable - the platform needed its own compliance boundary to pursue federal contracts. The platform served active therapy sessions across North America, so downtime wasn't an option. The migration had to be incremental, maintain compliance continuity, and impose zero burden on application development teams - they shouldn't need to touch their code to move between orchestrators.

Approach & Role

I designed and executed the full migration strategy end-to-end. The key constraints: zero downtime for end users, zero code changes required from application developers, and continuous compliance throughout the transition. The platform couldn't go dark for a weekend while we figured things out - this is healthcare, people are in active therapy sessions.

Architecture & Patterns

Secrets compatibility shim (the enabler):

The first problem was that Docker Swarm and ECS Fargate handle secrets completely differently. Swarm mounts them as files at /run/secrets/{name}, Fargate injects them as environment variables. Every service in the fleet read secrets from disk.

Rather than ask 4+ development teams to refactor their secret-loading code, I wrote a shell shim that runs at container startup: it detects which platform it's on, and if it's Fargate, creates /run/secrets/ files from environment variables. The application never knows the difference. This meant services could run on both platforms simultaneously during the transition period - critical for incremental migration.

IAM credential elimination:

On MedStack, services used IAM users with static access keys for AWS API calls. Moving to Fargate meant we could assign IAM roles directly to tasks - no more long-lived credentials sitting in secret stores. Each service got a least-privilege role scoped to exactly what it needs.

ECS Fargate deployment architecture:

YAML template-driven service definitions for consistent deployments across all workloads
FluentBit sidecars on every task for centralized log aggregation (FireLens to CloudWatch + S3)
Fargate Spot capacity providers for cost optimization on non-critical workloads
Service Connect mesh for inter-service communication with automatic TLS and service discovery
Individual Dockerfiles per service (FedRAMP requirement - no shared base image with multiple CMD entrypoints)
Per-service IAM roles with least-privilege policies
ECR repository per service with lifecycle policies

Migration Sequence

The migration followed a deliberate sequence designed to validate each layer before depending on it:

1. Infrastructure provisioning

Spun up isolated AWS accounts for each deployment environment (dev, staging, prod-ca, prod-us). Translated Docker Swarm service definitions into ECS task definitions and Terraform modules. Each account got the full stack independently - no shared-tenancy shortcuts.

2. Data migration

Imported all live data into the new accounts: PostgreSQL database exports, S3 bucket replication with bulk sync. The goal was a complete mirror of production state in the new environment so QA could test against real data volumes and edge cases, not synthetic datasets.

3. Parallel validation

Deployed the full platform under a separate testing domain. QA hammered it - functional testing, load testing, regression suites. Bugs surfaced and got fixed without any production risk. This phase ran for weeks until QA certified the platform ready.

4. Replication bridge

Once QA signed off, I set up PostgreSQL logical replication from MedStack to AWS. Production continued serving from MedStack while every write replicated into the AWS databases in near-real-time. This gave us a hot standby with current data.

5. Cutover (zero-downtime)

The actual cutover was a coordinated sequence:

Pointed DNS to a maintenance page
Stopped all MedStack services (writes cease)
Waited for replication lag to reach zero (databases fully consistent)
Terminated replication, promoted AWS databases to primary
Reconfigured the AWS deployment on the production domain
QA validated through the blue-stack endpoint (bypasses the maintenance page)
Pointed production DNS to the AWS ALB
Verified end-user traffic flowing correctly

The maintenance window was planned and communicated, but the actual service interruption was minimal - most of the time was waiting for replication to drain and QA to give the thumbs up.

Rollback strategy:

If anything went wrong during cutover, the plan was simple: point DNS back to MedStack and restart the services. Since we stopped MedStack before any traffic hit AWS, no new data would exist in AWS that wasn't already in MedStack - zero data loss on rollback. We'd take the lessons learned and attempt again. The low-risk rollback is what made the team comfortable committing to the cutover window.

Impact & Scale

40+ containers migrated from Docker Swarm to ECS Fargate with zero data loss
Compatibility shim eliminated the need for application-level migration work - developers didn't touch a line of code
Static IAM credentials eliminated entirely, replaced with task-level IAM roles
FluentBit sidecars provide unified observability across the entire fleet
Service Connect mesh replaced manual service discovery configuration with automatic TLS and health-aware routing
Platform operates across 4 isolated AWS accounts with independent deployment lifecycles

For the MedStack deployment automation that managed the platform pre-migration, see DevOps Automation & Tooling.