





ETL Data Processing Pipeline
Software Architect
Researched & designed Kafka integration
Lead initiative to enable horizontal scaling
As Principal Engineer, I led a software engineering team that built an extract, transform, and load (ETL) pipeline for cancer research. The pipeline loads clinical and molecular data from real-time feeds. Researchers and physicians can then use web apps to interact with the live data set.
As over 1.6m patients receive care from physicians at 106 clinics and labs across the US, we process electronic health records (EHR) for each visit. These records include clinical lab results, genomic profiles, and physician notes. Healthcare data can be messy, incomplete, error-prone, and unstructured. Especially when originating from a variety of sources. So, we developed configurable microservices to clean and transform data into a harmonized dataset. These services use NLP to analyze unstructured fields and ML classification models to fill in data gaps in patient records.
We used Kafka to create a high-throughput, low-latency data stream for handling real-time data feeds. The containerized microservices act as producers and consumers transforming data in the cluster. Kubernetes orchestrates the containers, scaling horizontally to maintain real-time delivery for growing demands. This allows the system to quickly incorporate new data sources to expand its dataset. Scaling also enables full historical replays of data sources on a daily basis.
The ETL pipeline has built-in monitoring and reporting. Data analysts track dashboards of errors to discover novel data formats. They use this information to change live file parser configurations and replay data. Through this process, they continuously improve data capture and quality in the system.