Writing data engineering
data-engineering 10 min read 10 January 2024

The Data Engineering Stack: A Practitioner's Map

A structured map of the data engineering landscape — from OS fundamentals and SQL through distributed compute, streaming, cloud services, and orchestration. Built from real project experience across Blue Yonder, Mastertrust, and independent data systems work.

Data engineering is a wide discipline. People enter it from database backgrounds, software engineering, data science, or devops — and the required surface area is genuinely large. This is my working map of the stack, organized by layer, built from what I actually needed to know across production systems handling 5TB+ of data.

Foundations: OS and Scripting

Everything in data engineering runs on Linux. If you can’t navigate a filesystem, write a shell script, or debug a cron job at the terminal, you’re dependent on tooling abstractions that will fail you at the worst moments.

Databases: Core DBMS Concepts

Before you touch any distributed system, you need relational fundamentals.

DBMS theory worth knowing:

SQL fluency required:

Big Data Fundamentals

Understanding distributed systems conceptually before touching frameworks saves enormous confusion later.

Core concepts:

File formats matter:

Data types:

Data Warehousing

Data warehouses have different design patterns than transactional systems.

Big Data Frameworks

Apache Hadoop

Hadoop is the foundational layer — most modern frameworks are built on or react against its architecture.

Apache Hive

Hive adds SQL semantics on top of HDFS. Useful for batch ETL on Hadoop infrastructure.

Apache Spark (the core skill)

Spark replaced MapReduce as the default distributed compute engine. If you only learn one framework, make it Spark.

Data Movement

Orchestration

Pipelines need scheduling, dependency management, and failure handling.

NoSQL Databases

Different access patterns require different database architectures.

DatabaseUse case
HBaseWide-column, Hadoop-native, low-latency reads on large datasets
Cassandra (DataStax)High write throughput, multi-region, no single point of failure
ElasticsearchFull-text search, log analytics (part of the ELK stack)
MongoDBDocument store, flexible schema, good for semi-structured data

Messaging and Streaming

Apache Kafka is the backbone of modern streaming architectures. It provides durable, ordered message logs that decouple producers from consumers. Used for real-time event pipelines, change data capture, and stream processing with Flink or Spark Streaming.

Dashboarding and Visualization

Cloud: AWS Data Services

Most production data infrastructure runs on cloud-managed services. AWS is the largest ecosystem.

CategoryServices
ComputeEC2 (on-demand VMs), EMR (managed Hadoop/Spark)
StorageS3 (object store), EFS (filesystem)
Access managementIAM, Secrets Manager
Relational databasesRDS (managed PostgreSQL/MySQL), Redshift (data warehouse), Athena (serverless SQL on S3)
NoSQLDynamoDB
Serverless computeLambda
ETLAWS Glue (managed Spark)
SchedulingCloudWatch Events
MessagingSNS (pub/sub), SQS (queue), Kinesis (real-time streaming)

The common pattern: raw data lands in S3, Glue or EMR processes it, results land in Redshift or RDS, Athena enables ad-hoc SQL on raw S3, CloudWatch triggers the schedule, Kinesis handles the real-time path.

What I Actually Use

In production work, the stack I reach for most:

The full stack above matters for system design, architecture decisions, and debugging — even when you’re not writing Hadoop jobs directly.

data-engineering big-data spark kafka airflow sql hadoop aws architecture
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →