Most SQL tutorials stop at SELECT, WHERE, and JOIN. Those are necessary but not the parts of SQL that make data scientists powerful. Window functions, CTEs, time-series queries, and query optimization — these are where SQL becomes a genuine analytical language rather than a way to fetch rows.
This guide is for people who already know basic SQL and want the patterns that actually come up in data science work.
CTEs: Write Readable, Maintainable Queries
Common Table Expressions (CTEs) let you name subqueries and reference them like temporary tables. They transform nested-query spaghetti into readable pipelines.
-- Without CTEs: nested and hard to follow
SELECT user_id, total_revenue
FROM (
SELECT user_id, SUM(amount) as total_revenue
FROM (
SELECT user_id, amount
FROM orders
WHERE status = 'completed'
)
GROUP BY user_id
)
WHERE total_revenue > 1000;
-- With CTEs: reads like a pipeline
WITH completed_orders AS (
SELECT user_id, amount
FROM orders
WHERE status = 'completed'
),
user_revenue AS (
SELECT user_id, SUM(amount) as total_revenue
FROM completed_orders
GROUP BY user_id
)
SELECT user_id, total_revenue
FROM user_revenue
WHERE total_revenue > 1000;
CTEs don’t change what the query does — they change how it reads. This matters when a query is 200 lines long and you need to debug or extend it six months later.
Recursive CTEs let you traverse tree structures and hierarchical data:
-- Organizational hierarchy: find all reports under a manager
WITH RECURSIVE org_tree AS (
-- Base case: the manager
SELECT employee_id, manager_id, name, 0 as depth
FROM employees
WHERE employee_id = 'MGR_001'
UNION ALL
-- Recursive case: find direct reports of everyone in the tree
SELECT e.employee_id, e.manager_id, e.name, ot.depth + 1
FROM employees e
INNER JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT * FROM org_tree ORDER BY depth, name;
Window Functions: The Core Pattern
Window functions apply a function across a set of rows related to the current row, without collapsing the result like GROUP BY does. Every data scientist should be fluent with these.
The syntax:
function_name(column) OVER (
PARTITION BY partition_column
ORDER BY sort_column
ROWS/RANGE BETWEEN frame_start AND frame_end
)
Ranking Within Groups
-- Rank each product by revenue within each category
SELECT
product_id,
category,
revenue,
RANK() OVER (PARTITION BY category ORDER BY revenue DESC) as rank_in_category,
DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) as dense_rank,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as row_num
FROM product_sales;
RANK(): gaps when there are ties (1, 1, 3, 4…)DENSE_RANK(): no gaps (1, 1, 2, 3…)ROW_NUMBER(): unique sequential number even for ties
To get the top-N per group (a very common task):
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rn
FROM product_sales
)
SELECT * FROM ranked WHERE rn <= 3;
Running Totals and Cumulative Statistics
-- Running sum of revenue by date
SELECT
date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY date) as cumulative_revenue,
AVG(daily_revenue) OVER (ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_7d_avg
FROM daily_revenue;
The frame clause (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) defines the sliding window. Common patterns:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW— cumulative from startROWS BETWEEN 29 PRECEDING AND CURRENT ROW— 30-day rolling windowRANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW— 7-day window by date distance, not row count
Lag and Lead: Time-Series Features
The most useful window functions for feature engineering:
-- Create lag features for time-series modeling
SELECT
date,
sales,
LAG(sales, 1) OVER (PARTITION BY store_id ORDER BY date) as sales_lag_1d,
LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date) as sales_lag_7d,
LAG(sales, 28) OVER (PARTITION BY store_id ORDER BY date) as sales_lag_28d,
-- Week-over-week change
sales - LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date) as wow_change,
-- Percentage change
SAFE_DIVIDE(
sales - LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date),
LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date)
) as wow_pct_change
FROM daily_sales;
This kind of query, which would require multiple self-joins or complex Pandas operations, is clean and efficient with window functions.
Percentiles and Statistical Features
SELECT
product_id,
price,
PERCENTILE_CONT(0.5) OVER () as median_price,
PERCENTILE_CONT(0.25) OVER () as p25_price,
PERCENTILE_CONT(0.75) OVER () as p75_price,
-- Z-score
(price - AVG(price) OVER ()) / STDDEV(price) OVER () as price_zscore
FROM products;
Time-Series Queries
Handling Date Arithmetic
-- BigQuery date functions
SELECT
date,
DATE_ADD(date, INTERVAL 1 DAY) as tomorrow,
DATE_SUB(date, INTERVAL 7 DAY) as last_week,
DATE_DIFF(date, '2024-01-01', DAY) as days_since_start,
FORMAT_DATE('%Y-%m', date) as year_month,
EXTRACT(DAYOFWEEK FROM date) as day_of_week, -- 1=Sunday
EXTRACT(WEEK FROM date) as week_number
FROM my_table;
Generating a Date Spine
When you need to ensure every date appears in a result (including days with no events):
-- Generate a date range
WITH date_spine AS (
SELECT date
FROM UNNEST(GENERATE_DATE_ARRAY('2024-01-01', '2024-12-31', INTERVAL 1 DAY)) as date
),
daily_sales AS (
SELECT date, SUM(amount) as revenue
FROM orders
GROUP BY date
)
SELECT
ds.date,
COALESCE(dl.revenue, 0) as revenue -- 0 for missing days
FROM date_spine ds
LEFT JOIN daily_sales dl USING (date);
Session Analysis
Finding user sessions from clickstream data (a common product analytics task):
WITH session_boundaries AS (
SELECT
user_id,
event_time,
-- Mark start of new session: gap > 30 minutes from previous event
CASE WHEN
TIMESTAMP_DIFF(event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
MINUTE) > 30
OR LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) IS NULL
THEN 1 ELSE 0 END as is_session_start
FROM events
),
sessions AS (
SELECT
user_id,
event_time,
SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
FROM session_boundaries
)
SELECT
user_id,
session_id,
MIN(event_time) as session_start,
MAX(event_time) as session_end,
COUNT(*) as events_in_session
FROM sessions
GROUP BY user_id, session_id;
Self-Joins for Cohort Analysis
Cohort analysis — tracking how groups of users behave over time — uses self-joins:
-- Monthly retention: what % of users active in month M are still active in month M+N?
WITH monthly_active AS (
SELECT DISTINCT
user_id,
FORMAT_DATE('%Y-%m', activity_date) as activity_month
FROM user_events
),
cohort_month AS (
SELECT user_id, MIN(activity_month) as cohort
FROM monthly_active
GROUP BY user_id
)
SELECT
c.cohort,
ma.activity_month,
COUNT(DISTINCT ma.user_id) as active_users,
-- Months since cohort entry
DATE_DIFF(
PARSE_DATE('%Y-%m', ma.activity_month),
PARSE_DATE('%Y-%m', c.cohort),
MONTH
) as months_since_join
FROM cohort_month c
JOIN monthly_active ma USING (user_id)
GROUP BY c.cohort, ma.activity_month
ORDER BY c.cohort, months_since_join;
Query Optimization
Partition and Cluster Pruning
In BigQuery (and most columnar stores), queries are charged by bytes scanned. Always filter on partition columns:
-- Expensive: scans entire table
SELECT * FROM orders WHERE user_id = '12345';
-- Cheap: only scans the partition for that date range
SELECT * FROM orders
WHERE date BETWEEN '2024-01-01' AND '2024-03-31'
AND user_id = '12345';
Know which columns your tables are partitioned and clustered on. This is the single biggest lever for query cost and speed.
Avoid SELECT *
Always name the columns you need. SELECT * reads all columns from storage, even those you don’t use. On wide tables, this can multiply your query cost by 10x.
Push Filters Early
Filter before joining, not after:
-- Slow: joins full tables, then filters
SELECT o.*, u.country
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE u.country = 'US' AND o.status = 'completed';
-- Faster: filter inside CTEs before the join
WITH us_users AS (SELECT user_id, country FROM users WHERE country = 'US'),
completed_orders AS (SELECT * FROM orders WHERE status = 'completed')
SELECT o.*, u.country
FROM completed_orders o
JOIN us_users u ON o.user_id = u.user_id;
Modern query optimizers often do this automatically, but being explicit helps in complex queries where the optimizer may not see the optimization.
Use Approximate Functions for Exploration
-- Exact: expensive for high-cardinality columns
SELECT COUNT(DISTINCT user_id) FROM events;
-- Approximate (error ~1%): much cheaper
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;
During exploration, approximate aggregations are often sufficient and dramatically cheaper. Switch to exact functions for final reporting.
The SQL-Python Decision
SQL is the right tool when:
- The data lives in a database and you want to avoid moving it
- You need a transformation that’s expressible cleanly in SQL
- You want the result to be reproducible by other analysts without Python dependencies
- The operation benefits from the database’s parallel execution (joins over billions of rows)
Python/Pandas is the right tool when:
- You need ML algorithms, statistical tests, or libraries that don’t exist in SQL
- Your transformation requires complex state or control flow
- You’re doing exploratory analysis that changes direction frequently
- You need visualization as part of the workflow
For data scientists: be fluent in both. SQL for extraction, aggregation, and feature generation at scale. Python for modeling, statistical analysis, and anything that requires code that SQL can’t express cleanly.