CI/CD for Data Pipelines — From Git Push to Production

Automate data pipeline deployments with GitHub Actions. Testing strategies, dbt CI, Terraform integration, and rollback patterns.

Mar 17, 2026· projects · 3 minutes

CI/CD for Data Pipelines — From Git Push to Production

Data pipelines deserve the same CI/CD rigor as application code. A merge to main should trigger tests, build artifacts, and deploy to production without manual intervention. Here’s a practical setup.

What “CI/CD for Data” Looks Like

Continuous Integration: On every pull request, automatically run unit tests for transformation logic, lint SQL and Python code, validate DAG structure (no cycles, no missing dependencies), build Docker images, and run dbt compile or dry-run BigQuery queries to catch syntax errors.

Continuous Deployment: On merge to main, build and push Docker images to Artifact Registry, deploy updated DAG files to Cloud Composer, run dbt migrations against the target warehouse, and apply infrastructure changes via Terraform.

A GitHub Actions Pipeline

name: Data Pipeline CI/CD

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

env:
  GCP_PROJECT: my-project
  REGION: us-central1
  REGISTRY: us-central1-docker.pkg.dev/my-project/pipelines

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install -r requirements-dev.txt

      - name: Lint
        run: |
          ruff check src/
          ruff format --check src/

      - name: Unit tests
        run: pytest tests/unit/ -v --tb=short

      - name: Validate DAGs
        run: python scripts/validate_dags.py

      - name: dbt compile (dry run)
        run: |
          cd dbt/
          dbt compile --target ci
        env:
          DBT_PROFILES_DIR: ./profiles

  build-and-deploy:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    permissions:
      contents: read
      id-token: write

    steps:
      - uses: actions/checkout@v4

      - name: Authenticate to GCP
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
          service_account: ${{ secrets.SA_EMAIL }}

      - name: Configure Docker
        run: gcloud auth configure-docker ${{ env.REGION }}-docker.pkg.dev

      - name: Build and push image
        run: |
          docker build -t ${{ env.REGISTRY }}/etl:${{ github.sha }} .
          docker push ${{ env.REGISTRY }}/etl:${{ github.sha }}

      - name: Deploy DAGs to Composer
        run: |
          gsutil -m rsync -r -d dags/ \
            gs://${{ secrets.COMPOSER_BUCKET }}/dags/

      - name: Run dbt
        run: |
          cd dbt/
          dbt run --target prod
          dbt test --target prod
        env:
          DBT_PROFILES_DIR: ./profiles

Key Practices

DAG validation in CI. Import every DAG file and check for parse errors before they reach Composer. A broken DAG file can take down the entire Airflow scheduler.

import importlib, pathlib, sys

dag_dir = pathlib.Path("dags")
errors = []
for f in dag_dir.glob("**/*.py"):
    try:
        spec = importlib.util.spec_from_file_location(f.stem, f)
        mod = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(mod)
    except Exception as e:
        errors.append(f"{f}: {e}")

if errors:
    print("\n".join(errors))
    sys.exit(1)
print("All DAGs valid.")

Workload Identity Federation for auth. Don’t store service account key files in GitHub Secrets. Use Workload Identity Federation so your GitHub Actions runner authenticates to GCP without long-lived credentials.

Image tagging strategy. Tag images with the Git SHA (etl:abc123def) for traceability. You can always trace a running container back to the exact commit that built it. Optionally add a latest or stable tag for convenience, but the SHA tag is the source of truth.

Database migrations as code. dbt handles SQL model changes. For schema migrations on transactional databases, use tools like Alembic (SQLAlchemy) or Flyway. Never apply schema changes manually.

Takeaway: Automate everything between git push and production. Validate DAGs, lint code, run tests, build images, and deploy — all triggered by Git events. Manual deployments are a liability.

Grafana Dashboards for Data Platform Health — What to Build First

Build actionable Grafana dashboards for data platforms. Pipeline latency, data freshness, error rates, and cost tracking visualizations.
Amazon EKS for Data Workloads — A GCP Engineer's Perspective

Navigating EKS coming from GKE. Key differences in IAM, networking, and managed add-ons for running data workloads on AWS Kubernetes.
Databricks SQL Analytics Without the Spark Complexity

Databricks SQL provides a SQL-first analytics experience on top of the Lakehouse, powered by dedicated SQL warehouses optimized for BI and reporting.

CI/CD for Data Pipelines — From Git Push to Production

CI/CD for Data Pipelines — From Git Push to Production

What “CI/CD for Data” Looks Like

A GitHub Actions Pipeline

Key Practices

More posts

Grafana Dashboards for Data Platform Health — What to Build First

Amazon EKS for Data Workloads — A GCP Engineer's Perspective

Databricks SQL Analytics Without the Spark Complexity