Designing a Data Lakehouse on GCP with BigLake
Unify your data lake and warehouse with BigLake. Query Parquet and ORC files in Cloud Storage directly from BigQuery with fine-grained access control.
· projects · 2 minutes
Designing a Data Lakehouse on GCP with BigLake
The data lakehouse pattern aims to combine the flexibility of a data lake (store anything in open formats on cheap storage) with the performance and governance of a data warehouse. On GCP, BigLake is the key enabler.
What BigLake Does
BigLake creates a unified access layer over data stored in GCS (and other clouds). You define BigLake tables in BigQuery’s metastore, pointing to Parquet, ORC, Avro, or Delta Lake files in GCS. These tables are then queryable through BigQuery with the same SQL interface, access controls, and performance optimizations as native BigQuery tables.
The key insight: your data stays in open formats on GCS, but you get BigQuery’s governance and query engine on top.
Why This Matters
Open formats prevent vendor lock-in. Parquet files in GCS can be read by Spark, Presto, DuckDB, or any tool that understands the format. You’re not locked into BigQuery’s internal storage.
Unified governance. BigLake applies BigQuery’s column-level security and row-level access policies to data stored in GCS. Previously, securing a data lake meant managing IAM at the bucket/object level — far too coarse for most access control requirements.
Cost flexibility. You can keep cold or infrequently accessed data in GCS (at GCS pricing) while promoting hot datasets to native BigQuery storage for maximum query performance. BigLake gives you a single SQL interface across both.
A Practical Setup
-- Create a BigLake connectionCREATE EXTERNAL CONNECTION `my_project.us.my_connection`OPTIONS (type = 'CLOUD_RESOURCE');
-- Create a BigLake table over Parquet filesCREATE EXTERNAL TABLE `my_project.lakehouse.events`WITH CONNECTION `my_project.us.my_connection`OPTIONS ( format = 'PARQUET', uris = ['gs://my-lake/events/*']);
-- Query it like any BigQuery tableSELECT action, COUNT(*) as cntFROM `my_project.lakehouse.events`WHERE event_date = '2025-03-17'GROUP BY action;When to Go Lakehouse vs. Pure Warehouse
Use a lakehouse approach when you have large volumes of data in open formats that multiple engines need to access, or when you want to minimize storage costs for historical data. Use native BigQuery tables for your hottest, most-queried datasets where maximum performance matters.
Most production platforms end up with a hybrid: BigLake for the broad lake layer, native BigQuery for curated/serving datasets.
Takeaway: BigLake bridges the gap between data lakes and warehouses on GCP. Store data in open formats, govern it centrally, and query it with BigQuery — without forcing everything into proprietary storage.
More posts
-
Building Production APIs with FastAPI for Data Services
Expose your data pipelines via REST APIs using FastAPI. Covers async patterns, Pydantic validation, authentication, and deployment strategies.
-
Databricks SQL Analytics Without the Spark Complexity
Databricks SQL provides a SQL-first analytics experience on top of the Lakehouse, powered by dedicated SQL warehouses optimized for BI and reporting.
-
Why I Use dbt with BigQuery (And You Should Too)
How dbt transforms BigQuery development with version-controlled models, incremental builds, and automated documentation for analytics engineering.