BigQuery
BigQuery
Section titled “BigQuery”BigQuery is GCP’s fully managed, serverless data warehouse. It’s designed for analysing large datasets — petabyte-scale — using standard SQL, without managing any infrastructure.
Key Concepts
Section titled “Key Concepts”| Term | Description |
|---|---|
| Project | Top-level billing and access boundary |
| Dataset | Container for tables (like a database schema) |
| Table | Stores data in a columnar format |
| View | Virtual table defined by a SQL query |
| Job | An action: query, load, export, or copy |
Creating Datasets and Tables
Section titled “Creating Datasets and Tables”# Create a datasetbq mk --dataset my_project:my_dataset
# Create a table from a schema filebq mk --table my_project:my_dataset.my_table schema.json
# Create a table from a CSV (auto-detect schema)bq load --autodetect my_dataset.my_table gs://my-bucket/data.csvschema.json:
[ { "name": "user_id", "type": "STRING", "mode": "REQUIRED" }, { "name": "event_name", "type": "STRING", "mode": "NULLABLE" }, { "name": "created_at", "type": "TIMESTAMP", "mode": "NULLABLE" }, { "name": "value", "type": "FLOAT64", "mode": "NULLABLE" }]Running Queries
Section titled “Running Queries”# Run a query from CLIbq query --nouse_legacy_sql \ 'SELECT COUNT(*) FROM my_dataset.my_table'
# Save results to a tablebq query --nouse_legacy_sql \ --destination_table=my_dataset.results \ 'SELECT user_id, COUNT(*) as events FROM my_dataset.events GROUP BY user_id'SQL Examples
Section titled “SQL Examples”BigQuery uses standard SQL:
-- Count events by daySELECT DATE(created_at) AS day, event_name, COUNT(*) AS totalFROM `my_project.my_dataset.events`WHERE created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)GROUP BY day, event_nameORDER BY day DESC;
-- Find top users by revenueSELECT user_id, SUM(amount) AS total_revenueFROM `my_project.my_dataset.orders`WHERE status = 'completed'GROUP BY user_idORDER BY total_revenue DESCLIMIT 10;Partitioned Tables
Section titled “Partitioned Tables”Partition tables to reduce query cost (you only scan relevant partitions):
CREATE TABLE my_dataset.events_partitionedPARTITION BY DATE(created_at)AS SELECT * FROM my_dataset.events;Query a specific partition:
SELECT * FROM my_dataset.events_partitionedWHERE DATE(created_at) = '2025-01-01';Clustered Tables
Section titled “Clustered Tables”Clustering sorts data within partitions to improve filter performance:
CREATE TABLE my_dataset.events_clusteredPARTITION BY DATE(created_at)CLUSTER BY user_id, event_nameAS SELECT * FROM my_dataset.events;Pricing
Section titled “Pricing”BigQuery pricing has two components:
- Storage: ~$0.02/GB/month for active storage
- Queries: $5 per TB of data scanned (first 1 TB/month free)
Use SELECT with LIMIT and partitioning to reduce bytes scanned.
# Estimate bytes scanned before runningbq query --dry_run --nouse_legacy_sql \ 'SELECT * FROM my_dataset.large_table LIMIT 100'Exporting Data
Section titled “Exporting Data”# Export table to CSV in Cloud Storagebq extract my_dataset.my_table gs://my-bucket/export-*.csv
# Export as JSONbq extract --destination_format=NEWLINE_DELIMITED_JSON \ my_dataset.my_table gs://my-bucket/export-*.jsonBigQuery with Python
Section titled “BigQuery with Python”from google.cloud import bigquery
client = bigquery.Client()
query = """SELECT user_id, COUNT(*) AS eventsFROM `my_project.my_dataset.events`GROUP BY user_idORDER BY events DESCLIMIT 10"""
results = client.query(query)for row in results: print(row.user_id, row.events)