Kubernetes for Data Engineers: Why and How

Six months ago, I didn’t know anything about Kubernetes. Today, I run a production data platform on Kubernetes processing 20TB+ of data with 99.5% uptime. Here’s what I learned and why data engineers should care about K8s.

Why Kubernetes for Data Pipelines?

The Problem: Traditional Data Infrastructure

Traditional data infrastructure has several pain points:

Manual Scaling: Add more servers when workload increases
Resource Waste: Servers idle during off-peak hours
Fragile Deployments: “It works on my machine” syndrome
Poor Isolation: One failing job crashes the entire server

The Solution: Container Orchestration

Kubernetes solves these problems by:

Auto-scaling: Spin up resources when needed
Resource Efficiency: Share compute across workloads
Reproducibility: Container images ensure consistency
Isolation: Jobs run in isolated pods

Key Kubernetes Concepts for Data Engineers

1. Pods

A pod is the smallest deployable unit. Think of it as a wrapper around your container.

apiVersion: v1
kind: Pod
metadata:
  name: data-processing-job
spec:
  containers:
  - name: etl-worker
    image: my-data-pipeline:v1
    resources:
      requests:
        memory: "4Gi"
        cpu: "2"
      limits:
        memory: "8Gi"
        cpu: "4"

Why it matters: Each pod is isolated. If one crashes, others keep running.

2. Deployments

A deployment manages multiple replicas of your application.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: airflow-worker
spec:
  replicas: 3  # Run 3 copies
  selector:
    matchLabels:
      app: airflow
  template:
    metadata:
      labels:
        app: airflow
    spec:
      containers:
      - name: airflow-worker
        image: apache/airflow:2.7.0

Why it matters: Horizontal scaling for parallel data processing.

3. Persistent Volumes

Data needs to persist even when pods restart.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-storage
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Ti

Why it matters: Store your data lakehouse files, databases, and intermediate results.

4. ConfigMaps and Secrets

Manage configuration and credentials separately from code.

apiVersion: v1
kind: ConfigMap
metadata:
  name: pipeline-config
data:
  database_host: "postgres.default.svc.cluster.local"
  batch_size: "1000"
---
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
data:
  password: cGFzc3dvcmQxMjM=  # base64 encoded

Why it matters: Different configs for dev/staging/prod without changing code.

Real-World Example: Apache Airflow on Kubernetes

In production, we deployed Airflow on Kubernetes. Here’s the architecture:

1
┌─────────────────────────────────────────┐
2
│     Kubernetes Cluster                  │
3
│                                         │
4
│  ┌──────────────┐  ┌──────────────┐   │
5
│  │   Airflow    │  │   Airflow    │   │
6
│  │   Scheduler  │  │   Webserver  │   │
7
│  └──────────────┘  └──────────────┘   │
8
│                                         │
9
│  ┌──────────────┐  ┌──────────────┐   │
10
│  │   Worker 1   │  │   Worker 2   │   │
11
│  │  (Auto-scale)│  │  (Auto-scale)│   │
12
│  └──────────────┘  └──────────────┘   │
13
│                                         │
14
│  ┌─────────────────────────────────┐  │
15
│  │   PostgreSQL (Metadata DB)      │  │
16
│  └─────────────────────────────────┘  │
17
└─────────────────────────────────────────┘

Benefits We Saw:

Before Kubernetes:

Fixed number of workers (5)
Peak usage: 60% capacity wasted
Deployment: 2 hours of downtime
Scaling: Manual, took days

After Kubernetes:

Auto-scaling workers (2-10)
Resource utilization: 85%
Deployment: Rolling updates, zero downtime
Scaling: Automatic based on queue depth

Practical Tips for Data Engineers

1. Start with Docker First

Before Kubernetes, master Docker:

# Dockerfile for data pipeline
FROM python:3.10-slim
 
WORKDIR /app
 
COPY requirements.txt .
RUN pip install -r requirements.txt
 
COPY src/ ./src/
 
CMD ["python", "src/pipeline.py"]

Build and test locally:

docker build -t my-pipeline:v1 .
docker run my-pipeline:v1

2. Use Helm Charts

Helm is like a package manager for Kubernetes. Don’t write YAML from scratch.

# Install Airflow using Helm
helm repo add apache-airflow https://airflow.apache.org
helm install airflow apache-airflow/airflow

3. Resource Requests and Limits

Always set resource constraints:

resources:
  requests:
    memory: "2Gi"   # Guaranteed
    cpu: "1"
  limits:
    memory: "4Gi"   # Maximum allowed
    cpu: "2"

Why: Prevents one job from consuming all cluster resources.

4. Use Namespaces for Environments

Separate dev, staging, and production:

kubectl create namespace dev
kubectl create namespace prod
 
# Deploy to dev
kubectl apply -f pipeline.yaml -n dev
 
# Deploy to prod
kubectl apply -f pipeline.yaml -n prod

Common Patterns for Data Workloads

Pattern 1: Batch Processing Jobs

Use Kubernetes Jobs for one-time tasks:

apiVersion: batch/v1
kind: Job
metadata:
  name: daily-etl
spec:
  template:
    spec:
      containers:
      - name: etl
        image: my-etl-pipeline:v1
        env:
        - name: EXECUTION_DATE
          value: "2024-11-22"
      restartPolicy: OnFailure

Pattern 2: Scheduled CronJobs

Run pipelines on a schedule:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hourly-ingestion
spec:
  schedule: "0 * * * *"  # Every hour
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: ingestion
            image: data-ingestion:v1
          restartPolicy: OnFailure

Pattern 3: Streaming Workloads

Deploy long-running streaming jobs:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-consumer
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: consumer
        image: kafka-stream-processor:v1

Monitoring Your Data Pipelines

Prometheus for Metrics

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
      - job_name: 'data-pipelines'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_app]
          regex: data-pipeline
          action: keep

Grafana for Visualization

Key metrics to track:

Pod CPU/Memory usage
Pipeline execution times
Failed job counts
Data processing throughput

In production, we built 25+ custom dashboards tracking:

Pipeline health
Data quality scores
Resource utilization
SLA compliance

Cost Considerations

Kubernetes can save money if done right:

Our Cost Savings:

Auto-scaling: 40% reduction in idle resources
Spot instances: 60% cheaper compute
Efficient packing: Running 3x more workloads on same hardware

Cost Gotchas:

Over-provisioning: Requesting too many resources
Always-on dev environments: Shut them down at night
Large persistent volumes: Clean up old data

When NOT to Use Kubernetes

Kubernetes adds complexity. Skip it if:

You have < 5 data pipelines
Your workloads are simple batch jobs
You’re a solo developer without ops support
Your data fits on one machine

Use simpler alternatives:

Docker Compose for local development
AWS Batch for simple job scheduling
Managed services (Cloud Composer, AWS MWAA)

Getting Started Checklist

✅ Learn Docker basics
✅ Run minikube locally
✅ Deploy a simple application
✅ Add persistent storage
✅ Implement monitoring
✅ Set up CI/CD pipeline
✅ Configure auto-scaling
✅ Practice disaster recovery

Key Takeaways

Kubernetes provides scalability and reliability for data workloads
Start with Docker, then add Kubernetes when complexity justifies it
Use Helm charts to avoid writing YAML from scratch
Monitor everything: resource usage, job success rates, data quality
Auto-scaling saves money and improves performance

Kubernetes has a learning curve, but for production data platforms, the benefits are worth it. Our 99.5% uptime proves it works.

Questions about Kubernetes for data engineering? I learned by doing—happy to share more details about our setup. Find me on LinkedIn.