Real-Time Stock Market Data Pipeline

Value statement: Stream live stock market data through Kafka-based medallion architecture for real-time financial analytics and Power BI dashboards.

Overview

Built distributed streaming data pipeline capturing real-time stock market data from Finnhub API. System ingests live market events via Kafka producers, processes through medallion architecture (Bronze/Silver/Gold) using dbt transformations in Snowflake, and serves analytics dashboards through Power BI. Orchestrated via Airflow DAGs with automated ingestion and monitoring.

The pipeline handles high-throughput financial data streams, implements schema enforcement and data quality checks, and provides sub-minute latency for trading analytics and market research use cases.

Goals

Architecture

┌─────────────────────────────────────────────────────────────┐
│ Finnhub API │
│ Live Stock Market WebSocket/REST │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Kafka Producers (Python) │
│ Real-time event capture with schema validation │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Apache Kafka Cluster │
│ Distributed event streaming with partitioning │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ BRONZE: Raw Events (MinIO S3) │
│ Immutable event store with time partitioning │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SILVER: Cleansed Data (dbt + Snowflake) │
│ Deduplication | Schema enforcement | Validation │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ GOLD: Analytics Tables (Snowflake) │
│ Aggregations | Time-series | Market indicators │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Power BI Dashboards │
│ Real-time market analytics and trading insights │
└─────────────────────────────────────────────────────────────┘

Technology Stack

LayerTechnologies
Data SourceFinnhub API (WebSocket + REST)
IngestionPython Kafka Producers, Confluent Kafka
StreamingApache Kafka (multi-broker cluster)
StorageMinIO (S3-compatible object storage)
Processingdbt (data transformations), Snowflake
OrchestrationApache Airflow (scheduled DAGs)
VisualizationPower BI (real-time dashboards)
InfrastructureDocker, Docker Compose

Implementation Details

Kafka Producer Architecture: Built Python-based Kafka producers using confluent-kafka library to ingest live stock data from Finnhub WebSocket API. Implemented error handling, retry logic, and schema validation before publishing to Kafka topics. Each producer maintains connection pooling and graceful shutdown handling.

Event Streaming with Kafka: Configured multi-broker Kafka cluster with topic partitioning based on stock symbols for parallel processing. Implemented exactly-once semantics using idempotent producers and transactional consumers. Kafka Connect used for continuous consumption into MinIO Bronze layer.

Medallion Architecture:

Airflow Orchestration: Built Airflow DAGs for:

Power BI Integration: Connected Power BI to Snowflake Gold layer using DirectQuery for near-real-time dashboard updates. Dashboards display market trends, portfolio analytics, and trading signals.

Data Characteristics

MetricValue
Event Volume~10K+ events/minute during market hours
LatencySub-minute end-to-end (API → Dashboard)
Symbols Tracked100+ stocks and indices
Data FormatJSON events → Parquet/Snowflake tables
RetentionBronze: 2 years

Reliability & Edge Cases

Lessons Learned

Kafka partitioning strategy: Initial round-robin partitioning caused uneven load distribution. Switched to stock symbol-based partitioning which improved parallel processing and reduced consumer lag.

MinIO vs S3: Chose MinIO for local development and cost reduction while maintaining S3 API compatibility. Production deployment can seamlessly migrate to AWS S3.

dbt incremental models: Implemented incremental models with merge strategies to handle late-arriving events and out-of-order timestamps common in financial data streams.

Power BI DirectQuery limitations: DirectQuery mode has query timeout constraints. Added Snowflake materialized views to pre-aggregate heavy computations for dashboard performance.

Future Improvements