Data Analytics & Observability
- Built an ETL pipeline replicating 25+ production tables to analytics with systematic PII removal (SHA256 hashing) before data leaves the compliance boundary
- Implemented schema drift detection catching upstream column changes immediately rather than silently losing data
- Deployed FluentBit sidecars on all 40+ ECS tasks with dual routing to CloudWatch (real-time) and S3/Parquet (long-term archival)
- Self-hosted privacy-focused web analytics (Matomo) eliminating third-party tracking scripts and external data processors
- Connected Parquet + Glue catalog enabling ad-hoc SQL queries via Athena/Superset without touching production databases
The Challenge
The platform generates data across 40+ containers, and the business needed analytics capabilities without compromising patient privacy. Production data can't be accessed directly by analysts. PII has to be systematically removed before any data leaves the production boundary. On top of that, I needed an observability stack covering infrastructure health, application logs, and business metrics. All within compliance constraints.
Approach & Role
I built the data pipeline infrastructure, observability stack, and analytics tooling from scratch. The core design principle: PII never leaves the production boundary unprotected. Every record flowing to analytics environments passes through obfuscation transforms. The pipeline also detects schema drift so that when upstream services add or change columns, we catch it immediately rather than silently losing data.
Architecture & Patterns
ETL pipeline (AWS Glue):
- 792-line Glue ETL job capturing 25+ tables from production
- PII obfuscation: SHA256 hashing of names, emails, phone numbers and other customer information before analytics ingestion
- Schema drift detection with incremental sync (only changed records transferred)
- CRM data extraction with cursor-based pagination and rate limiting
- Output: Parquet format in S3 with Glue catalog for Athena/Superset query access
Observability stack:
- Zabbix for infrastructure monitoring (20+ custom templates covering ECS tasks, PostgreSQL, Redis, network)
- FluentBit sidecars (Chainguard containers) on every ECS task for log forwarding
- FireLens routing: CloudWatch for real-time viewing, Kinesis Firehose to S3 (Parquet) for long-term retention
- CloudWatch alarms for service-level health indicators
Privacy-focused web analytics:
- Self-hosted Matomo deployment with custom RDS SSL configuration
- First-party data collection, no third-party tracking scripts
- Privacy-compliant visitor analytics without cross-site tracking
Operational dashboards:
- Apache Superset connected to the analytics data lake
- Session outcomes visualization with PDF export capabilities
- NLP analysis dashboards showing clinical summarization metrics
Impact & Scale
- 25+ production tables replicated to analytics with systematic PII removal
- Schema drift detection prevents silent data loss during service evolution
- Infrastructure monitoring covers the full stack with 20+ Zabbix templates
- Centralized logging from 40+ ECS tasks with both real-time and archival paths
- Privacy-compliant analytics, no external data processors, all first-party
- Parquet + Glue catalog enables ad-hoc SQL queries without touching production databases