Data Analytics & Observability

Built an ETL pipeline replicating 25+ production tables to analytics with systematic PII removal (SHA256 hashing) before data leaves the compliance boundary
Implemented schema drift detection catching upstream column changes immediately rather than silently losing data
Deployed FluentBit sidecars on all 40+ ECS tasks with dual routing to CloudWatch (real-time) and S3/Parquet (long-term archival)
Self-hosted privacy-focused web analytics (Matomo) eliminating third-party tracking scripts and external data processors
Connected Parquet + Glue catalog enabling ad-hoc SQL queries via Athena/Superset without touching production databases

The Challenge

The platform generates data across 40+ containers, and the business needed analytics capabilities without compromising patient privacy. Production data can't be accessed directly by analysts. PII has to be systematically removed before any data leaves the production boundary. On top of that, I needed an observability stack covering infrastructure health, application logs, and business metrics. All within compliance constraints.

Approach & Role

I built the data pipeline infrastructure, observability stack, and analytics tooling from scratch. The core design principle: PII never leaves the production boundary unprotected. Every record flowing to analytics environments passes through obfuscation transforms. The pipeline also detects schema drift so that when upstream services add or change columns, we catch it immediately rather than silently losing data.

Architecture & Patterns

ETL pipeline (AWS Glue):

792-line Glue ETL job capturing 25+ tables from production
PII obfuscation: SHA256 hashing of names, emails, phone numbers and other customer information before analytics ingestion
Schema drift detection with incremental sync (only changed records transferred)
CRM data extraction with cursor-based pagination and rate limiting
Output: Parquet format in S3 with Glue catalog for Athena/Superset query access

Observability stack:

Zabbix for infrastructure monitoring (20+ custom templates covering ECS tasks, PostgreSQL, Redis, network)
FluentBit sidecars (Chainguard containers) on every ECS task for log forwarding
FireLens routing: CloudWatch for real-time viewing, Kinesis Firehose to S3 (Parquet) for long-term retention
CloudWatch alarms for service-level health indicators

Privacy-focused web analytics:

Self-hosted Matomo deployment with custom RDS SSL configuration
First-party data collection, no third-party tracking scripts
Privacy-compliant visitor analytics without cross-site tracking

Operational dashboards:

Apache Superset connected to the analytics data lake
Session outcomes visualization with PDF export capabilities
NLP analysis dashboards showing clinical summarization metrics

Impact & Scale

25+ production tables replicated to analytics with systematic PII removal
Schema drift detection prevents silent data loss during service evolution
Infrastructure monitoring covers the full stack with 20+ Zabbix templates
Centralized logging from 40+ ECS tasks with both real-time and archival paths
Privacy-compliant analytics, no external data processors, all first-party
Parquet + Glue catalog enables ad-hoc SQL queries without touching production databases