Designing a Data Lakehouse on GCP with BigLake

Unify your data lake and warehouse with BigLake. Query Parquet and ORC files in Cloud Storage directly from BigQuery with fine-grained access control.

Mar 25, 2026· projects · 2 minutes

Designing a Data Lakehouse on GCP with BigLake

The data lakehouse pattern aims to combine the flexibility of a data lake (store anything in open formats on cheap storage) with the performance and governance of a data warehouse. On GCP, BigLake is the key enabler.

What BigLake Does

BigLake creates a unified access layer over data stored in GCS (and other clouds). You define BigLake tables in BigQuery’s metastore, pointing to Parquet, ORC, Avro, or Delta Lake files in GCS. These tables are then queryable through BigQuery with the same SQL interface, access controls, and performance optimizations as native BigQuery tables.

The key insight: your data stays in open formats on GCS, but you get BigQuery’s governance and query engine on top.

Why This Matters

Open formats prevent vendor lock-in. Parquet files in GCS can be read by Spark, Presto, DuckDB, or any tool that understands the format. You’re not locked into BigQuery’s internal storage.

Unified governance. BigLake applies BigQuery’s column-level security and row-level access policies to data stored in GCS. Previously, securing a data lake meant managing IAM at the bucket/object level — far too coarse for most access control requirements.

Cost flexibility. You can keep cold or infrequently accessed data in GCS (at GCS pricing) while promoting hot datasets to native BigQuery storage for maximum query performance. BigLake gives you a single SQL interface across both.

A Practical Setup

-- Create a BigLake connection
CREATE EXTERNAL CONNECTION `my_project.us.my_connection`
OPTIONS (type = 'CLOUD_RESOURCE');

-- Create a BigLake table over Parquet files
CREATE EXTERNAL TABLE `my_project.lakehouse.events`
WITH CONNECTION `my_project.us.my_connection`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-lake/events/*']
);

-- Query it like any BigQuery table
SELECT action, COUNT(*) as cnt
FROM `my_project.lakehouse.events`
WHERE event_date = '2025-03-17'
GROUP BY action;

When to Go Lakehouse vs. Pure Warehouse

Use a lakehouse approach when you have large volumes of data in open formats that multiple engines need to access, or when you want to minimize storage costs for historical data. Use native BigQuery tables for your hottest, most-queried datasets where maximum performance matters.

Most production platforms end up with a hybrid: BigLake for the broad lake layer, native BigQuery for curated/serving datasets.

Takeaway: BigLake bridges the gap between data lakes and warehouses on GCP. Store data in open formats, govern it centrally, and query it with BigQuery — without forcing everything into proprietary storage.

Building Production APIs with FastAPI for Data Services

Expose your data pipelines via REST APIs using FastAPI. Covers async patterns, Pydantic validation, authentication, and deployment strategies.
Databricks SQL Analytics Without the Spark Complexity

Databricks SQL provides a SQL-first analytics experience on top of the Lakehouse, powered by dedicated SQL warehouses optimized for BI and reporting.
Why I Use dbt with BigQuery (And You Should Too)

How dbt transforms BigQuery development with version-controlled models, incremental builds, and automated documentation for analytics engineering.

Designing a Data Lakehouse on GCP with BigLake

Designing a Data Lakehouse on GCP with BigLake

What BigLake Does

Why This Matters

A Practical Setup

When to Go Lakehouse vs. Pure Warehouse

More posts

Building Production APIs with FastAPI for Data Services

Databricks SQL Analytics Without the Spark Complexity

Why I Use dbt with BigQuery (And You Should Too)