SQL for Data Scientists: The Patterns That Actually Matter

Most SQL tutorials stop at SELECT, WHERE, and JOIN. Those are necessary but not the parts of SQL that make data scientists powerful. Window functions, CTEs, time-series queries, and query optimization — these are where SQL becomes a genuine analytical language rather than a way to fetch rows.

This guide is for people who already know basic SQL and want the patterns that actually come up in data science work.

CTEs: Write Readable, Maintainable Queries

Common Table Expressions (CTEs) let you name subqueries and reference them like temporary tables. They transform nested-query spaghetti into readable pipelines.

-- Without CTEs: nested and hard to follow
SELECT user_id, total_revenue
FROM (
    SELECT user_id, SUM(amount) as total_revenue
    FROM (
        SELECT user_id, amount
        FROM orders
        WHERE status = 'completed'
    )
    GROUP BY user_id
)
WHERE total_revenue > 1000;

-- With CTEs: reads like a pipeline
WITH completed_orders AS (
    SELECT user_id, amount
    FROM orders
    WHERE status = 'completed'
),
user_revenue AS (
    SELECT user_id, SUM(amount) as total_revenue
    FROM completed_orders
    GROUP BY user_id
)
SELECT user_id, total_revenue
FROM user_revenue
WHERE total_revenue > 1000;

CTEs don’t change what the query does — they change how it reads. This matters when a query is 200 lines long and you need to debug or extend it six months later.

Recursive CTEs let you traverse tree structures and hierarchical data:

-- Organizational hierarchy: find all reports under a manager
WITH RECURSIVE org_tree AS (
    -- Base case: the manager
    SELECT employee_id, manager_id, name, 0 as depth
    FROM employees
    WHERE employee_id = 'MGR_001'
    
    UNION ALL
    
    -- Recursive case: find direct reports of everyone in the tree
    SELECT e.employee_id, e.manager_id, e.name, ot.depth + 1
    FROM employees e
    INNER JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT * FROM org_tree ORDER BY depth, name;

Window Functions: The Core Pattern

Window functions apply a function across a set of rows related to the current row, without collapsing the result like GROUP BY does. Every data scientist should be fluent with these.

The syntax:

function_name(column) OVER (
    PARTITION BY partition_column
    ORDER BY sort_column
    ROWS/RANGE BETWEEN frame_start AND frame_end
)

Ranking Within Groups

-- Rank each product by revenue within each category
SELECT 
    product_id,
    category,
    revenue,
    RANK() OVER (PARTITION BY category ORDER BY revenue DESC) as rank_in_category,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) as dense_rank,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as row_num
FROM product_sales;

RANK(): gaps when there are ties (1, 1, 3, 4…)
DENSE_RANK(): no gaps (1, 1, 2, 3…)
ROW_NUMBER(): unique sequential number even for ties

To get the top-N per group (a very common task):

WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rn
    FROM product_sales
)
SELECT * FROM ranked WHERE rn <= 3;

Running Totals and Cumulative Statistics

-- Running sum of revenue by date
SELECT 
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY date) as cumulative_revenue,
    AVG(daily_revenue) OVER (ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_7d_avg
FROM daily_revenue;

The frame clause (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) defines the sliding window. Common patterns:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — cumulative from start
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW — 30-day rolling window
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW — 7-day window by date distance, not row count

Lag and Lead: Time-Series Features

The most useful window functions for feature engineering:

-- Create lag features for time-series modeling
SELECT
    date,
    sales,
    LAG(sales, 1) OVER (PARTITION BY store_id ORDER BY date) as sales_lag_1d,
    LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date) as sales_lag_7d,
    LAG(sales, 28) OVER (PARTITION BY store_id ORDER BY date) as sales_lag_28d,
    
    -- Week-over-week change
    sales - LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date) as wow_change,
    
    -- Percentage change
    SAFE_DIVIDE(
        sales - LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date),
        LAG(sales, 7) OVER (PARTITION BY store_id ORDER BY date)
    ) as wow_pct_change
FROM daily_sales;

This kind of query, which would require multiple self-joins or complex Pandas operations, is clean and efficient with window functions.

Percentiles and Statistical Features

SELECT
    product_id,
    price,
    PERCENTILE_CONT(0.5) OVER () as median_price,
    PERCENTILE_CONT(0.25) OVER () as p25_price,
    PERCENTILE_CONT(0.75) OVER () as p75_price,
    
    -- Z-score
    (price - AVG(price) OVER ()) / STDDEV(price) OVER () as price_zscore
FROM products;

Time-Series Queries

Handling Date Arithmetic

-- BigQuery date functions
SELECT
    date,
    DATE_ADD(date, INTERVAL 1 DAY) as tomorrow,
    DATE_SUB(date, INTERVAL 7 DAY) as last_week,
    DATE_DIFF(date, '2024-01-01', DAY) as days_since_start,
    FORMAT_DATE('%Y-%m', date) as year_month,
    EXTRACT(DAYOFWEEK FROM date) as day_of_week, -- 1=Sunday
    EXTRACT(WEEK FROM date) as week_number
FROM my_table;

Generating a Date Spine

When you need to ensure every date appears in a result (including days with no events):

-- Generate a date range
WITH date_spine AS (
    SELECT date
    FROM UNNEST(GENERATE_DATE_ARRAY('2024-01-01', '2024-12-31', INTERVAL 1 DAY)) as date
),
daily_sales AS (
    SELECT date, SUM(amount) as revenue
    FROM orders
    GROUP BY date
)
SELECT 
    ds.date,
    COALESCE(dl.revenue, 0) as revenue  -- 0 for missing days
FROM date_spine ds
LEFT JOIN daily_sales dl USING (date);

Session Analysis

Finding user sessions from clickstream data (a common product analytics task):

WITH session_boundaries AS (
    SELECT
        user_id,
        event_time,
        -- Mark start of new session: gap > 30 minutes from previous event
        CASE WHEN 
            TIMESTAMP_DIFF(event_time, 
                LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
                MINUTE) > 30 
            OR LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) IS NULL
        THEN 1 ELSE 0 END as is_session_start
    FROM events
),
sessions AS (
    SELECT
        user_id,
        event_time,
        SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
    FROM session_boundaries
)
SELECT 
    user_id,
    session_id,
    MIN(event_time) as session_start,
    MAX(event_time) as session_end,
    COUNT(*) as events_in_session
FROM sessions
GROUP BY user_id, session_id;

Self-Joins for Cohort Analysis

Cohort analysis — tracking how groups of users behave over time — uses self-joins:

-- Monthly retention: what % of users active in month M are still active in month M+N?
WITH monthly_active AS (
    SELECT DISTINCT
        user_id,
        FORMAT_DATE('%Y-%m', activity_date) as activity_month
    FROM user_events
),
cohort_month AS (
    SELECT user_id, MIN(activity_month) as cohort
    FROM monthly_active
    GROUP BY user_id
)
SELECT
    c.cohort,
    ma.activity_month,
    COUNT(DISTINCT ma.user_id) as active_users,
    -- Months since cohort entry
    DATE_DIFF(
        PARSE_DATE('%Y-%m', ma.activity_month),
        PARSE_DATE('%Y-%m', c.cohort),
        MONTH
    ) as months_since_join
FROM cohort_month c
JOIN monthly_active ma USING (user_id)
GROUP BY c.cohort, ma.activity_month
ORDER BY c.cohort, months_since_join;

Query Optimization

Partition and Cluster Pruning

In BigQuery (and most columnar stores), queries are charged by bytes scanned. Always filter on partition columns:

-- Expensive: scans entire table
SELECT * FROM orders WHERE user_id = '12345';

-- Cheap: only scans the partition for that date range
SELECT * FROM orders 
WHERE date BETWEEN '2024-01-01' AND '2024-03-31'
  AND user_id = '12345';

Know which columns your tables are partitioned and clustered on. This is the single biggest lever for query cost and speed.

Avoid SELECT *

Always name the columns you need. SELECT * reads all columns from storage, even those you don’t use. On wide tables, this can multiply your query cost by 10x.

Push Filters Early

Filter before joining, not after:

-- Slow: joins full tables, then filters
SELECT o.*, u.country
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE u.country = 'US' AND o.status = 'completed';

-- Faster: filter inside CTEs before the join
WITH us_users AS (SELECT user_id, country FROM users WHERE country = 'US'),
completed_orders AS (SELECT * FROM orders WHERE status = 'completed')
SELECT o.*, u.country
FROM completed_orders o
JOIN us_users u ON o.user_id = u.user_id;

Modern query optimizers often do this automatically, but being explicit helps in complex queries where the optimizer may not see the optimization.

Use Approximate Functions for Exploration

-- Exact: expensive for high-cardinality columns
SELECT COUNT(DISTINCT user_id) FROM events;

-- Approximate (error ~1%): much cheaper
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

During exploration, approximate aggregations are often sufficient and dramatically cheaper. Switch to exact functions for final reporting.

The SQL-Python Decision

SQL is the right tool when:

The data lives in a database and you want to avoid moving it
You need a transformation that’s expressible cleanly in SQL
You want the result to be reproducible by other analysts without Python dependencies
The operation benefits from the database’s parallel execution (joins over billions of rows)

Python/Pandas is the right tool when:

You need ML algorithms, statistical tests, or libraries that don’t exist in SQL
Your transformation requires complex state or control flow
You’re doing exploratory analysis that changes direction frequently
You need visualization as part of the workflow

For data scientists: be fluent in both. SQL for extraction, aggregation, and feature generation at scale. Python for modeling, statistical analysis, and anything that requires code that SQL can’t express cleanly.