Neuroimaging & Healthcare Data Lakehouse Platform

Process daily healthcare data through a governed, scalable medallion architecture supporting clinical research and analytics.

Overview

Built and maintained the data platform as the data engineer handling multiple healthcare data sources at scale—OpenNeuro neuroimaging repositories, datasets requiring formal data use agreements and PII handling, and hospitalization data via bulk ingestion. Designed and implemented the multi-source ingestion framework, contributed to Kubernetes infrastructure, and owned data governance with a medallion architecture serving analytics and machine learning workloads.

The platform processes neuroimaging data (NIfTI/BIDS from OpenNeuro, DICOM from research collaborators), federal health records via FHIR R4 resources with ICD-10 coded diagnoses, and discharge/EMR datasets through Bronze/Silver/Gold layers, enabling data scientists and researchers to access clean, governed datasets for precision medicine applications.

Goals

Architecture

┌─────────────────────────────────────────────────────────────┐
│ Data Sources (10-20+ TB Daily) │
│ OpenNeuro | Research Repos | Hospitalization Data │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Ingestion Layer (15+ Airflow DAGs) │
│ Python | PySpark | Format-Specific Connectors | Validation │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ BRONZE: Raw Data (Apache Iceberg + MinIO) │
│ DICOM, NIfTI/BIDS, FHIR R4 (JSON), CSV → S3 storage │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SILVER: Cleansed & Validated (dbt + PySpark) │
│ Data quality checks | Schema enforcement | BIDS │
│ conformance | De-identification │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ GOLD: Analytics-Ready (Snowflake) │
│ Dimensional Models | BI Layer │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Governance & Observability Layer │
│ OpenMetadata (Catalog) | Prometheus/Grafana (Monitoring) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Consumption Layer │
│ Data Science | Research Analytics | Clinical Apps │
└─────────────────────────────────────────────────────────────┘

Medallion Architecture:

Technology Stack

LayerTechnologies
IngestionPython, PySpark, Airflow (15+ DAGs)
ProcessingPySpark, dbt, SQL
StorageApache Iceberg, MinIO (S3-compatible), PostgreSQL, Snowflake
OrchestrationApache Airflow (PostgreSQL backend), Kubernetes CronJobs
InfrastructureKubernetes (on-prem), Helm, Docker
GovernanceOpenMetadata (data catalog & lineage)
ObservabilityPrometheus, Grafana (99.5% uptime monitoring)
Data FormatsDICOM, NIfTI/BIDS, FHIR R4 (JSON), CSV, Parquet

Implementation Details

Multi-Source Ingestion Framework: Built reusable Python framework with format-specific connectors for four source categories:

Airflow DAGs ran on a scheduled basis, scanning each repository for newly available files, ingesting them into the Bronze layer, and triggering downstream processing through Silver and Gold for analysis-ready output. Integrated MinIO S3 provider and Snowflake connectors for end-to-end data flow across 15+ source repositories.

Kubernetes Infrastructure: Architected the data lakehouse on on-prem Kubernetes from scratch, deploying MinIO via Helm for S3-compatible object storage with raw/refined/curated bucket architecture. Configured NodePort and ClusterIP services, persistent storage volumes, and RBAC policies for multi-tenant access. Airflow runs as KubernetesExecutor with a PostgreSQL metadata backend, dynamically spawning pods for each task. Deployed Prometheus/Grafana monitoring stack achieving 99.5% uptime. Used Helm charts for standardized deployments across dev/staging/production environments.

Medallion Architecture with Apache Iceberg: Bronze layer captures raw data (DICOM, NIfTI, FHIR resources, CSV) with immutable snapshots and time-travel capabilities. Silver layer applies cleansing, conforming, schema evolution, de-identification, and data quality checks via dbt tests. Gold layer delivers Kimball star schema dimensional models in Snowflake as the consumption layer for BI tools — PySpark jobs transform Silver-layer Iceberg tables and write optimized Parquet files that Snowflake consumes via external tables backed by MinIO.

MinIO Object Storage & Replication: Configured bucket-level replication to a secondary on-prem node for disaster recovery. Bronze layer objects replicate automatically, ensuring raw data availability even during hardware failures.

Data Governance: OpenMetadata catalogs all datasets with automatic lineage tracking from Airflow DAGs. Implemented role-based access controls and data classification (PII, PHI) for HIPAA compliance.

OpenNeuro & BIDS Data Ingestion

A significant portion of the platform’s neuroimaging data originates from OpenNeuro, a free and open platform for sharing BIDS-compliant neuroimaging datasets. BIDS (Brain Imaging Data Structure) is a standardized format for organizing and describing neuroimaging and behavioral data — it defines folder hierarchies, file naming conventions, and metadata schemas that make datasets machine-readable and reproducible across research teams.

Understanding and implementing BIDS compliance was critical to the ingestion pipeline. Each OpenNeuro dataset follows the BIDS directory structure:

dataset/
├── dataset_description.json # Dataset-level metadata
├── participants.tsv # Subject demographics
├── sub-001/
│ └── ses-01/
│ ├── anat/ # Structural MRI (T1w, T2w)
│ │ ├── sub-001_ses-01_T1w.nii.gz
│ │ └── sub-001_ses-01_T1w.json
│ ├── func/ # Functional MRI (BOLD)
│ │ ├── sub-001_ses-01_task-rest_bold.nii.gz
│ │ └── sub-001_ses-01_task-rest_events.tsv
│ └── eeg/ # EEG recordings
│ ├── sub-001_ses-01_task-MotorImagery_eeg.edf
│ └── sub-001_ses-01_task-MotorImagery_events.tsv
├── sub-002/
│ └── ...

BIDS Conversion Pipeline

For datasets not yet in BIDS format, built conversion pipelines using MNE-BIDS to standardize raw EEG/neuroimaging data before Bronze layer ingestion. The following shows the group study conversion pattern used for multi-subject EEG datasets:

from pathlib import Path
import mne
from mne.datasets import eegbci
from mne_bids import (
BIDSPath,
get_anonymization_daysback,
write_raw_bids,
print_dir_tree,
make_report,
)
from mne_bids.stats import count_events
subject_ids = [1, 2]
# Map EEG Motor Movement/Imagery Dataset runs to sequential run numbers
runs = [4, 8, 12] # Run #1, #2, #3 of motor imagery task
run_map = dict(zip(runs, range(1, 4)))
# Fetch raw data for each subject
for subject_id in subject_ids:
eegbci.load_data(subjects=subject_id, runs=runs, update_path=True)
bids_root = Path("/data/bronze/eegmmidb_bids")
# Collect raw objects for anonymization alignment across subjects
raw_list, bids_list = [], []
for subject_id in subject_ids:
for run in runs:
raw_fname = eegbci.load_data(subjects=subject_id, runs=run)[0]
raw = mne.io.read_raw_edf(raw_fname)
raw.info["line_freq"] = 60 # (60 Hz)
raw_list.append(raw)
bids_list.append(
BIDSPath(
subject=f"{subject_id:03}",
session="01",
task="MotorImagery",
run=f"{run_map[run]:02}",
root=bids_root,
)
)
# Compute consistent anonymization offset across all subjects
daysback_min, _ = get_anonymization_daysback(raw_list)
# Write each recording to BIDS format with anonymization
for raw, bids_path in zip(raw_list, bids_list):
write_raw_bids(
raw,
bids_path,
anonymize=dict(daysback=daysback_min + 2117),
overwrite=True,
)

Key aspects of this conversion pattern for production use:

Post-Conversion Validation & Cataloging

After BIDS conversion, the pipeline validates and catalogs the output:

# Verify BIDS directory structure
print_dir_tree(bids_root)
# Aggregate event statistics across the full dataset
event_counts = count_events(bids_root)
# Generate dataset report for OpenMetadata cataloging
dataset_report = make_report(root=bids_root)

The count_events() output feeds directly into OpenMetadata as dataset-level metadata, giving researchers visibility into what tasks, subjects, and event types are available without manually inspecting files.

Dataset Anonymization

Healthcare neuroimaging data carries significant re-identification risk — recording dates, subject metadata in file headers, and session timestamps can all leak PII. The pipeline implements a dedicated anonymization step that strips identifying information from BIDS datasets before they are promoted beyond the Bronze layer.

The anonymization process operates at two levels:

1. Header-level anonymization — Recording dates and subject identifiers embedded in EEG/MRI file headers are shifted or removed during write_raw_bids() using the same daysback offset pattern shown in the conversion pipeline above. The get_anonymization_daysback() function computes a safe range ensuring all shifted dates remain valid, and a single offset is applied consistently across every file in the dataset.

2. Sidecar metadata scrubbing — JSON sidecar files (_eeg.json, _T1w.json) are scanned for fields like InstitutionName, InstitutionAddress, and DeviceSerialNumber. These are either removed or replaced with generic values before promotion to Silver.

Why this matters for production:

Edge cases handled:

Use Cases Enabled

Organizing neuroimaging data into BIDS format at the Bronze layer unlocked several downstream use cases:

Resources & References

The following resources informed the BIDS ingestion and anonymization pipeline design:

Data Characteristics

MetricValue
Volume20+ TB processed
FrequencyBatch (daily)
FormatDICOM, NIfTI/BIDS, FHIR R4 (JSON), CSV → Parquet/Iceberg
DAGs15+ Airflow orchestration pipelines
GrowthNewly and existing changes with new clinical trials and OpenNeuro dataset releases

Reliability & Edge Cases

Lessons Learned

Kubernetes learning curve: Initial deployment complexity required investing in Helm charts and GitOps patterns, but paid dividends in standardization and reproducibility across dev/staging/production environments.

Open-source storage layer: Chose Apache Iceberg over proprietary alternatives for better schema evolution support, time-travel capabilities critical for healthcare data auditing, and vendor independence. Keeping the stack open-source (Iceberg + MinIO + Airflow) avoided lock-in and gave full control over the storage layer.

OpenMetadata integration: Automated lineage tracking from Airflow required custom operators but eliminated manual documentation overhead and improved data discoverability across the team.

Snowflake as consumption layer: Loading transformed data from Iceberg/MinIO into Snowflake via external tables required careful coordination — schema definitions in Snowflake must match Iceberg table evolution, and stage refresh schedules need to align with Gold layer write cadence. The payoff was familiar SQL access for analysts and researchers who didn’t need to interact with the lakehouse directly.

BIDS anonymization trade-offs: Balancing de-identification with data utility required careful design. Shifting dates too aggressively can break age-at-scan calculations critical for pediatric and aging studies; not shifting enough risks re-identification. Using get_anonymization_daysback() across entire datasets struck the right balance for HIPAA compliance without losing clinically relevant temporal relationships.

Medical data complexity: Each data source has unique validation and compliance requirements — DICOM imaging files need header validation and tag anonymization across dozens of patient-identifying fields, OpenNeuro NIfTI files need BIDS conformance checking, federal FHIR R4 resources require structure validation against FHIR R4 resource profiles, and HCUP records need ICD-10-CM/PCS code verification against annual CMS code set releases. Building format-specific connectors with validation upfront reduced downstream data quality issues significantly.

Incident example: Early in development, a DAG ingesting OpenNeuro datasets silently accepted files with missing participants.tsv metadata — downstream joins in the Silver layer produced null demographic columns that propagated into Gold dimensional models before being caught. Added a Bronze-layer validation gate that checks for required BIDS metadata files and rejects incomplete datasets to a quarantine bucket with Slack alerting.

Future Improvements