Data Structures and Algorithms: Complexity and Resources

Why DSA Matters for Data Scientists

Data scientists don’t spend most of their time writing merge sort from scratch. But the underlying intuition — how does the cost of an operation scale with input size — is used constantly. A model training loop that’s O(n²) in the number of samples will not scale to production data. A nearest-neighbor search that works on 10K points will be unusable at 10M. Understanding complexity is prerequisite to reasoning about whether an approach will work at scale.

Space-Time Complexity

The aim of complexity analysis is to measure how much time and space a program takes as a function of input size.

The core question: given an input of size n, how does the number of operations (time) or memory usage (space) grow as n increases?

Example: finding whether element x exists in a list of n elements.

n = 4
l = [1, 2, 3, 4]
x = 5

for i in l:
    if i == x:
        print('element exists')

In the worst case (x not in the list), this touches every element — n comparisons. Time complexity is O(n): linear in the size of the list.

Common complexity classes (from fastest to slowest):

Notation	Name	Example
O(1)	Constant	Dictionary lookup by key
O(log n)	Logarithmic	Binary search in sorted array
O(n)	Linear	Linear scan, single pass
O(n log n)	Linearithmic	Merge sort, heap sort
O(n²)	Quadratic	Nested loops, bubble sort
O(2ⁿ)	Exponential	Brute-force subset enumeration

Space complexity follows the same notation but measures memory. A function that creates a copy of the input array is O(n) space. A function that operates in-place is O(1) space.

Amortized Complexity

Some operations are occasionally expensive but cheap on average. Python’s list.append() is O(1) amortized: most appends are O(1), but occasionally the list needs to resize (O(n) for that one operation). Over many appends, the average cost per operation is still O(1).

This matters when analyzing data pipelines: an occasional expensive operation (index rebuild, cache invalidation) doesn’t change the overall throughput if it’s rare enough.

Common Data Structure Complexities

Structure	Access	Search	Insert	Delete
Array / List	O(1)	O(n)	O(n)	O(n)
Hash Table	O(1) avg	O(1) avg	O(1) avg	O(1) avg
Binary Search Tree	O(log n)	O(log n)	O(log n)	O(log n)
Heap	O(n)	O(n)	O(log n)	O(log n)
Linked List	O(n)	O(n)	O(1)	O(1)

Hash tables (Python dict, set) are the most useful structure for data science work — fast lookups, fast membership checks, and the backbone of groupby/aggregation operations.

Learning Resources

For building fundamentals:

Data Structures and Algorithms in Python — Jovian — structured course with Python implementations
Abdul Bari — Algorithms — clear lecture-style explanations of major algorithm families
GeeksforGeeks — Data Structures — reference for individual data structure implementations
Space & Time Complexity Reference
Distill.pub — not DSA specifically, but the best source for visual, rigorous explanations of ML concepts that require algorithmic thinking