Grafana Dashboards for Data Platform Health — What to Build First
Build actionable Grafana dashboards for data platforms. Pipeline latency, data freshness, error rates, and cost tracking visualizations.
· projects · 3 minutes
Grafana Dashboards for Data Platform Health — What to Build First
When you first set up Grafana for a data platform, it’s tempting to build dozens of dashboards. Don’t. Start with three, make them excellent, and expand from there.
Dashboard 1: Pipeline Health Overview
This is the dashboard your on-call engineer opens first. It answers one question: is anything broken right now?
Panels:
- DAG/Pipeline status grid. A table or stat panel showing each pipeline’s last run status (success/failure/running), last completion time, and next scheduled run. Color-code by status: green, red, yellow.
- Data freshness gauges. For each critical serving table, show the age of the most recent record. If your SLA is “data no older than 1 hour,” the gauge should go red at 60 minutes.
- Failure timeline. A time-series panel showing pipeline failures over the past 7 days. Spikes indicate systemic issues vs. one-off flakes.
- Active alerts. An embedded alert list showing any currently firing alerts.
-- Data freshness query (BigQuery → Grafana via BigQuery plugin)SELECT table_name, TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), max_event_ts, MINUTE) AS freshness_minutesFROM `my_project.monitoring.table_freshness`ORDER BY freshness_minutes DESCDashboard 2: Resource Utilization
This dashboard answers: are we right-sized, and are we spending efficiently?
Panels:
- BigQuery slot utilization over time. If you’re on flat-rate pricing, this shows whether you’re under- or over-provisioned. If on-demand, track bytes scanned per day.
- Top queries by cost. A table showing the top 10 most expensive queries in the past 24 hours, with user and bytes billed. This is your cost optimization hit list.
- Dataproc/Spark job resource usage. CPU and memory utilization per job. If your executors consistently use 20% of allocated memory, you’re over-provisioned.
- GCS storage growth. Track bucket sizes over time to catch unbounded growth in raw/staging zones.
Dashboard 3: Data Quality Scorecard
This is the dashboard your data team and stakeholders care about: can we trust the data?
Panels:
- dbt test results. Pass/fail/warn counts from the latest dbt run. Break down by model or severity.
- Null rate tracking. Time-series of null percentages for critical columns. A sudden spike in nulls in
customer_idmeans something upstream changed. - Row count trends. Daily record counts for key tables, with 7-day moving average and anomaly bands. This catches both drops (missing data) and spikes (duplicates).
- Schema change log. A table showing recent schema modifications detected by your monitoring. Unexpected column additions or type changes are early warnings of upstream contract violations.
Grafana Best Practices for Data Teams
Use variables for environment switching. Define a dashboard variable for project or environment so the same dashboard works for dev, staging, and production.
Template your queries. If you have 20 pipelines, don’t create 20 hardcoded panels. Use a variable that dynamically populates from a query (SELECT DISTINCT pipeline_name FROM monitoring.runs) and template your panels against it.
Set meaningful alert thresholds. An alert that fires every day gets ignored. Start with generous thresholds, tighten them as you learn your baseline. A freshness alert at 2x your normal latency is a good starting point.
Link dashboards to runbooks. Every alert panel should include a link to a runbook or troubleshooting guide. When that 3am page fires, the on-call engineer shouldn’t need to reverse-engineer what to do.
Takeaway: Three dashboards — pipeline health, resource utilization, and data quality — cover 90% of your observability needs. Build these well before expanding. Make them actionable, not decorative.
More posts
-
CI/CD for Data Pipelines — From Git Push to Production
Automate data pipeline deployments with GitHub Actions. Testing strategies, dbt CI, Terraform integration, and rollback patterns.
-
Building a Lightweight ELT Pipeline with Dataproc Serverless and BigQuery
Run Spark jobs without cluster management. Build an end-to-end ELT pipeline using Dataproc Serverless for transformations and BigQuery for analytics.
-
Understanding GCP's Data Storage Spectrum - When to Use What
A practical guide to choosing between Cloud Storage, BigQuery, Bigtable, and Spanner based on your data access patterns and scale requirements.