Writing data engineering
data-engineering 11 min read 12 February 2024

Python Performance: Writing Code That Scales

Practical techniques for making Python code fast — NumPy vectorization, Numba JIT compilation, multiprocessing, profiling tools, and common patterns that silently kill performance.

Python is slow. That’s the honest answer. A Python for-loop is roughly 10–100x slower than equivalent C code. For most applications, this doesn’t matter — network I/O, database queries, and human interaction dominate the runtime, not Python execution speed.

For data science and ML workloads — matrix operations, feature engineering over millions of rows, iterative algorithms — it matters a lot. The good news: Python’s performance ceiling is high if you use the right tools.

Understand Where Your Time Goes Before Optimizing

The first rule of performance optimization is to measure before you touch anything. Guessing where the bottleneck is almost always wrong.

cProfile: Find the Hotspot

import cProfile
cProfile.run('your_function()', sort='cumulative')

Shows cumulative time per function call. Sort by cumulative to find the functions that consume the most total time. Sort by tottime to find the functions that are slow independent of what they call.

line_profiler: Drill Into a Specific Function

pip install line_profiler
@profile  # decorator, activated by kernprof
def slow_function(data):
    result = []
    for x in data:          # is this line slow?
        result.append(x**2) # or this one?
    return result
kernprof -l -v script.py

Gives per-line timing. Reveals which specific line is the bottleneck inside a function.

memory_profiler: Find Memory Leaks

@profile
def memory_hungry_function():
    big_list = list(range(10_000_000))
    return big_list

Shows memory usage per line. Essential when you’re hitting OOM errors and need to understand the memory lifecycle.

py-spy: Profile Without Code Changes

py-spy top --pid 12345
py-spy record -o profile.svg --pid 12345

Attaches to a running process. No code modification needed. Generates flame graphs. The right tool for profiling production workloads.

NumPy: Vectorization as the First Fix

The single highest-leverage optimization in Python data science: replace for-loops with NumPy operations.

NumPy operations are implemented in C and operate on contiguous memory. A vectorized operation on 1 million elements is not 1 million Python operations — it’s one call to a C loop. The overhead per element drops by orders of magnitude.

The Pattern: Replace Loops with Array Operations

Slow:

result = []
for i in range(len(x)):
    result.append(x[i] ** 2 + y[i] * 3)

Fast:

result = x**2 + y * 3  # operates on entire arrays at once

Both produce the same answer. The second version is 50–100x faster on large arrays.

Broadcasting

NumPy broadcasting lets you operate on arrays of different shapes without explicit loops:

# Subtract column means from each column (centering)
X_centered = X - X.mean(axis=0)  # X is (n_samples, n_features)

# Outer product without nested loops
a = np.array([1, 2, 3])
b = np.array([10, 20])
result = a[:, np.newaxis] * b  # shape (3, 2)

Understanding broadcasting rules eliminates many loops.

Memory Layout Matters

NumPy arrays are row-major (C order) by default. Operations along the last axis (columns) are cache-friendly. Operations along the first axis (rows) can be slow due to cache misses.

X = np.random.rand(10000, 10000)
%timeit X.sum(axis=1)  # slow: traverses rows across columns
%timeit X.T.sum(axis=0)  # same result, often faster

For repeated column operations, store data in Fortran order (order='F').

Numba: JIT Compilation for Custom Loops

When you genuinely need a loop that can’t be vectorized — custom algorithms, recursive computations, data-dependent control flow — Numba compiles Python functions to machine code using LLVM.

from numba import jit
import numpy as np

@jit(nopython=True)
def compute_rolling_zscore(x, window):
    n = len(x)
    result = np.empty(n)
    for i in range(window, n):
        window_data = x[i-window:i]
        mean = window_data.mean()
        std = window_data.std()
        result[i] = (x[i] - mean) / std if std > 0 else 0.0
    return result

The @jit(nopython=True) decorator compiles the function the first time it’s called. Subsequent calls are near-C speed. The function must use NumPy and basic Python — no Pandas, no Python objects inside the compiled code.

When Numba helps: tight loops with numerical operations, custom statistical algorithms, simulations. When Numba doesn’t help: operations that can already be expressed in NumPy, I/O-bound code, anything using Python objects.

Cython: Statically Typed Python

Cython compiles Python (with optional type annotations) to C. More powerful than Numba but requires a separate compilation step.

# rolling_stats.pyx
import numpy as np
cimport numpy as np

def fast_sum(np.ndarray[double, ndim=1] arr):
    cdef double total = 0.0
    cdef int i
    for i in range(len(arr)):
        total += arr[i]
    return total

Cython is appropriate when you need the full power of C (custom memory management, interoperating with C libraries) or when Numba can’t handle your use case. For most data science performance needs, Numba or vectorized NumPy is sufficient.

Parallelism: Multiprocessing vs Threading

Python’s Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously. This makes threading nearly useless for CPU-bound Python code.

Threading: works for I/O-bound tasks (network requests, disk reads). Multiple threads can issue I/O operations and wait concurrently, even with the GIL.

Multiprocessing: bypasses the GIL by spawning separate processes. Each process has its own Python interpreter and memory space. Suitable for CPU-bound parallel work.

from multiprocessing import Pool
import pandas as pd

def process_chunk(chunk_data):
    # heavy computation on a chunk of data
    return chunk_data.apply(some_expensive_function)

# Split a large DataFrame into chunks, process in parallel
chunks = np.array_split(df, 8)  # 8 cores
with Pool(8) as pool:
    results = pool.map(process_chunk, chunks)

result = pd.concat(results)

Overhead caveat: multiprocessing has significant overhead from spawning processes and serializing data between them. Only use it when:

For small tasks, the overhead dominates and multiprocessing is slower than sequential.

Python-Specific Patterns That Silently Kill Performance

Avoid Attribute Lookup in Tight Loops

import math

# Slow: looks up math.__dict__['sqrt'] on every iteration
for x in data:
    result = math.sqrt(x)

# Fast: bind the function once
sqrt = math.sqrt
for x in data:
    result = sqrt(x)

The dot operator triggers __getattribute__, which is a dictionary lookup. In a loop running millions of times, this adds up.

Use Generators for Large Sequences

# Builds entire list in memory
total = sum([x**2 for x in range(10_000_000)])

# Generator computes one element at a time — constant memory
total = sum(x**2 for x in range(10_000_000))

List comprehensions build the full list before processing. Generator expressions yield one element at a time. For large sequences that feed into sum, max, min, or other aggregation, use generators.

Choose Data Structures for Their Access Patterns

OperationListSetDict
Membership testO(n)O(1)O(1)
AppendO(1)--
Index accessO(1)-O(1)

If you’re checking if x in collection in a loop, use a set — not a list. This is a 100x speedup when the collection has thousands of elements.

Use join for String Concatenation

# Slow: creates a new string object on every +
result = ""
for s in strings:
    result += s

# Fast: joins in one pass
result = "".join(strings)

String + in a loop is O(n²) in total allocations. join is O(n).

Itertools for Memory-Efficient Iteration

import itertools

# Cartesian product without building the full list
for a, b in itertools.product(range(1000), range(1000)):
    process(a, b)

# Chaining multiple iterables
for item in itertools.chain(list1, list2, list3):
    process(item)

# Sliding window
from itertools import islice
def sliding_window(iterable, n):
    it = iter(iterable)
    window = tuple(islice(it, n))
    if len(window) == n:
        yield window
    for x in it:
        window = window[1:] + (x,)
        yield window

The Performance Decision Tree

When code is too slow, ask in order:

  1. Can I vectorize this with NumPy? If yes, do that first. Biggest return.
  2. Is the bottleneck I/O? If yes, use async or threading — no point optimizing the Python code.
  3. Do I need a loop that can’t be vectorized? Try Numba @jit.
  4. Is this embarrassingly parallel? Use multiprocessing if tasks are large enough to justify overhead.
  5. Is it a data structure problem? Switch list to set, or optimize access patterns.
  6. Is it calling external code? Profile the external dependency; sometimes there’s a faster equivalent.

Only move to Cython or C extensions if all of the above are exhausted. Most data science performance problems are solved at step 1 or 2.

Measure first. Optimize the right thing. Measure again to verify you actually made it faster.

python performance numpy numba profiling optimization vectorization
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →