Container Orchestration & Platform Migration

The Challenge

A regulated healthcare platform running 40+ containers on a managed Docker Swarm PaaS (MedStack) needed to migrate to AWS ECS Fargate. MedStack couldn't support a FedRAMP authorization path and their SOC 2 certification story was unreliable - the platform needed its own compliance boundary to pursue federal contracts. The platform served active therapy sessions across North America, so downtime wasn't an option. The migration had to be incremental, maintain compliance continuity, and impose zero burden on application development teams - they shouldn't need to touch their code to move between orchestrators.

Approach & Role

I designed and executed the full migration strategy end-to-end. The key constraints: zero downtime for end users, zero code changes required from application developers, and continuous compliance throughout the transition. The platform couldn't go dark for a weekend while we figured things out - this is healthcare, people are in active therapy sessions.

Architecture & Patterns

Secrets compatibility shim (the enabler):

The first problem was that Docker Swarm and ECS Fargate handle secrets completely differently. Swarm mounts them as files at /run/secrets/{name}, Fargate injects them as environment variables. Every service in the fleet read secrets from disk.

Rather than ask 4+ development teams to refactor their secret-loading code, I wrote a shell shim that runs at container startup: it detects which platform it's on, and if it's Fargate, creates /run/secrets/ files from environment variables. The application never knows the difference. This meant services could run on both platforms simultaneously during the transition period - critical for incremental migration.

IAM credential elimination:

On MedStack, services used IAM users with static access keys for AWS API calls. Moving to Fargate meant we could assign IAM roles directly to tasks - no more long-lived credentials sitting in secret stores. Each service got a least-privilege role scoped to exactly what it needs.

ECS Fargate deployment architecture:

Migration Sequence

The migration followed a deliberate sequence designed to validate each layer before depending on it:

1. Infrastructure provisioning

Spun up isolated AWS accounts for each deployment environment (dev, staging, prod-ca, prod-us). Translated Docker Swarm service definitions into ECS task definitions and Terraform modules. Each account got the full stack independently - no shared-tenancy shortcuts.

2. Data migration

Imported all live data into the new accounts: PostgreSQL database exports, S3 bucket replication with bulk sync. The goal was a complete mirror of production state in the new environment so QA could test against real data volumes and edge cases, not synthetic datasets.

3. Parallel validation

Deployed the full platform under a separate testing domain. QA hammered it - functional testing, load testing, regression suites. Bugs surfaced and got fixed without any production risk. This phase ran for weeks until QA certified the platform ready.

4. Replication bridge

Once QA signed off, I set up PostgreSQL logical replication from MedStack to AWS. Production continued serving from MedStack while every write replicated into the AWS databases in near-real-time. This gave us a hot standby with current data.

5. Cutover (zero-downtime)

The actual cutover was a coordinated sequence:

  1. Pointed DNS to a maintenance page
  2. Stopped all MedStack services (writes cease)
  3. Waited for replication lag to reach zero (databases fully consistent)
  4. Terminated replication, promoted AWS databases to primary
  5. Reconfigured the AWS deployment on the production domain
  6. QA validated through the blue-stack endpoint (bypasses the maintenance page)
  7. Pointed production DNS to the AWS ALB
  8. Verified end-user traffic flowing correctly

The maintenance window was planned and communicated, but the actual service interruption was minimal - most of the time was waiting for replication to drain and QA to give the thumbs up.

Rollback strategy:

If anything went wrong during cutover, the plan was simple: point DNS back to MedStack and restart the services. Since we stopped MedStack before any traffic hit AWS, no new data would exist in AWS that wasn't already in MedStack - zero data loss on rollback. We'd take the lessons learned and attempt again. The low-risk rollback is what made the team comfortable committing to the cutover window.

Impact & Scale

For the MedStack deployment automation that managed the platform pre-migration, see DevOps Automation & Tooling.