BigQuery

BigQuery is GCP’s fully managed, serverless data warehouse. It’s designed for analysing large datasets — petabyte-scale — using standard SQL, without managing any infrastructure.

Key Concepts

Term	Description
Project	Top-level billing and access boundary
Dataset	Container for tables (like a database schema)
Table	Stores data in a columnar format
View	Virtual table defined by a SQL query
Job	An action: query, load, export, or copy

Creating Datasets and Tables

# Create a dataset
bq mk --dataset my_project:my_dataset

# Create a table from a schema file
bq mk --table my_project:my_dataset.my_table schema.json

# Create a table from a CSV (auto-detect schema)
bq load --autodetect my_dataset.my_table gs://my-bucket/data.csv

schema.json:

[
  { "name": "user_id", "type": "STRING", "mode": "REQUIRED" },
  { "name": "event_name", "type": "STRING", "mode": "NULLABLE" },
  { "name": "created_at", "type": "TIMESTAMP", "mode": "NULLABLE" },
  { "name": "value", "type": "FLOAT64", "mode": "NULLABLE" }
]

Running Queries

# Run a query from CLI
bq query --nouse_legacy_sql \
  'SELECT COUNT(*) FROM my_dataset.my_table'

# Save results to a table
bq query --nouse_legacy_sql \
  --destination_table=my_dataset.results \
  'SELECT user_id, COUNT(*) as events FROM my_dataset.events GROUP BY user_id'

SQL Examples

BigQuery uses standard SQL:

-- Count events by day
SELECT
  DATE(created_at) AS day,
  event_name,
  COUNT(*) AS total
FROM `my_project.my_dataset.events`
WHERE created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day, event_name
ORDER BY day DESC;

-- Find top users by revenue
SELECT
  user_id,
  SUM(amount) AS total_revenue
FROM `my_project.my_dataset.orders`
WHERE status = 'completed'
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 10;

Partitioned Tables

Partition tables to reduce query cost (you only scan relevant partitions):

CREATE TABLE my_dataset.events_partitioned
PARTITION BY DATE(created_at)
AS SELECT * FROM my_dataset.events;

Query a specific partition:

SELECT * FROM my_dataset.events_partitioned
WHERE DATE(created_at) = '2025-01-01';

Clustered Tables

Clustering sorts data within partitions to improve filter performance:

CREATE TABLE my_dataset.events_clustered
PARTITION BY DATE(created_at)
CLUSTER BY user_id, event_name
AS SELECT * FROM my_dataset.events;

Pricing

BigQuery pricing has two components:

Storage: ~$0.02/GB/month for active storage
Queries: $5 per TB of data scanned (first 1 TB/month free)

Use SELECT with LIMIT and partitioning to reduce bytes scanned.

# Estimate bytes scanned before running
bq query --dry_run --nouse_legacy_sql \
  'SELECT * FROM my_dataset.large_table LIMIT 100'

Exporting Data

# Export table to CSV in Cloud Storage
bq extract my_dataset.my_table gs://my-bucket/export-*.csv

# Export as JSON
bq extract --destination_format=NEWLINE_DELIMITED_JSON \
  my_dataset.my_table gs://my-bucket/export-*.json

BigQuery with Python

from google.cloud import bigquery

client = bigquery.Client()

query = """
SELECT user_id, COUNT(*) AS events
FROM `my_project.my_dataset.events`
GROUP BY user_id
ORDER BY events DESC
LIMIT 10
"""

results = client.query(query)
for row in results:
    print(row.user_id, row.events)