Skip to content

BigQuery

BigQuery is GCP’s fully managed, serverless data warehouse. It’s designed for analysing large datasets — petabyte-scale — using standard SQL, without managing any infrastructure.

TermDescription
ProjectTop-level billing and access boundary
DatasetContainer for tables (like a database schema)
TableStores data in a columnar format
ViewVirtual table defined by a SQL query
JobAn action: query, load, export, or copy
Terminal window
# Create a dataset
bq mk --dataset my_project:my_dataset
# Create a table from a schema file
bq mk --table my_project:my_dataset.my_table schema.json
# Create a table from a CSV (auto-detect schema)
bq load --autodetect my_dataset.my_table gs://my-bucket/data.csv

schema.json:

[
{ "name": "user_id", "type": "STRING", "mode": "REQUIRED" },
{ "name": "event_name", "type": "STRING", "mode": "NULLABLE" },
{ "name": "created_at", "type": "TIMESTAMP", "mode": "NULLABLE" },
{ "name": "value", "type": "FLOAT64", "mode": "NULLABLE" }
]
Terminal window
# Run a query from CLI
bq query --nouse_legacy_sql \
'SELECT COUNT(*) FROM my_dataset.my_table'
# Save results to a table
bq query --nouse_legacy_sql \
--destination_table=my_dataset.results \
'SELECT user_id, COUNT(*) as events FROM my_dataset.events GROUP BY user_id'

BigQuery uses standard SQL:

-- Count events by day
SELECT
DATE(created_at) AS day,
event_name,
COUNT(*) AS total
FROM `my_project.my_dataset.events`
WHERE created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day, event_name
ORDER BY day DESC;
-- Find top users by revenue
SELECT
user_id,
SUM(amount) AS total_revenue
FROM `my_project.my_dataset.orders`
WHERE status = 'completed'
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 10;

Partition tables to reduce query cost (you only scan relevant partitions):

CREATE TABLE my_dataset.events_partitioned
PARTITION BY DATE(created_at)
AS SELECT * FROM my_dataset.events;

Query a specific partition:

SELECT * FROM my_dataset.events_partitioned
WHERE DATE(created_at) = '2025-01-01';

Clustering sorts data within partitions to improve filter performance:

CREATE TABLE my_dataset.events_clustered
PARTITION BY DATE(created_at)
CLUSTER BY user_id, event_name
AS SELECT * FROM my_dataset.events;

BigQuery pricing has two components:

  • Storage: ~$0.02/GB/month for active storage
  • Queries: $5 per TB of data scanned (first 1 TB/month free)

Use SELECT with LIMIT and partitioning to reduce bytes scanned.

Terminal window
# Estimate bytes scanned before running
bq query --dry_run --nouse_legacy_sql \
'SELECT * FROM my_dataset.large_table LIMIT 100'
Terminal window
# Export table to CSV in Cloud Storage
bq extract my_dataset.my_table gs://my-bucket/export-*.csv
# Export as JSON
bq extract --destination_format=NEWLINE_DELIMITED_JSON \
my_dataset.my_table gs://my-bucket/export-*.json
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT user_id, COUNT(*) AS events
FROM `my_project.my_dataset.events`
GROUP BY user_id
ORDER BY events DESC
LIMIT 10
"""
results = client.query(query)
for row in results:
print(row.user_id, row.events)