The Data Engineering Stack: A Practitioner's Map

Data engineering is a wide discipline. People enter it from database backgrounds, software engineering, data science, or devops — and the required surface area is genuinely large. This is my working map of the stack, organized by layer, built from what I actually needed to know across production systems handling 5TB+ of data.

Foundations: OS and Scripting

Everything in data engineering runs on Linux. If you can’t navigate a filesystem, write a shell script, or debug a cron job at the terminal, you’re dependent on tooling abstractions that will fail you at the worst moments.

Linux / Unix — file permissions, process management, networking basics, systemd
Shell scripting — automation, ETL glue code, deployment scripts
Data Structures & Algorithms — arrays, linked lists, stacks, queues, trees, graphs, dynamic programming, sorting/searching. Not for interviews — for understanding why your Spark join is slow.

Databases: Core DBMS Concepts

Before you touch any distributed system, you need relational fundamentals.

DBMS theory worth knowing:

DDL / DML / DCL
ACID properties and what they mean for transactions
Concurrency control and deadlock
Indexing strategies and hashing
Normalization forms (1NF through 3NF and when to denormalize)
Views, stored procedures, integrity constraints
ER diagram design

SQL fluency required:

Transactional databases: PostgreSQL, MySQL
All join types (inner, left, right, full, cross, self)
Nested queries and CTEs
GROUP BY with aggregates
CASE WHEN for conditional logic
Window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER) — these separate intermediate SQL from advanced SQL

Big Data Fundamentals

Understanding distributed systems conceptually before touching frameworks saves enormous confusion later.

Core concepts:

What is Big Data? The 5 V’s: Volume, Velocity, Variety, Veracity, Value
Distributed computation vs. distributed storage
Vertical vs. horizontal scaling and why horizontal won
Commodity hardware clusters and their failure assumptions

File formats matter:

CSV — universal but inefficient for analytics
JSON — flexible, verbose
Avro — schema evolution, row-based
Parquet — columnar, compression-friendly, the default for analytics
ORC — columnar, Hive-optimized

Data types:

Structured (RDBMS), Semi-structured (JSON, XML), Unstructured (logs, images, text)

Data Warehousing

Data warehouses have different design patterns than transactional systems.

OLAP vs OLTP — analytical queries vs. transactional writes require fundamentally different schema designs
Dimension and fact tables — the building blocks of warehouse design
Star schema — one central fact table, denormalized dimensions, fast for BI queries
Snowflake schema — normalized dimensions, storage-efficient, slower joins
Warehouse design involves tradeoffs between query speed, storage cost, and update complexity

Big Data Frameworks

Apache Hadoop

Hadoop is the foundational layer — most modern frameworks are built on or react against its architecture.

HDFS — distributed filesystem with block replication
MapReduce — the original distributed compute model (write; most people now use Spark instead)
YARN — resource management layer, still relevant in managed Hadoop clusters

Apache Hive

Hive adds SQL semantics on top of HDFS. Useful for batch ETL on Hadoop infrastructure.

Data loading in various file formats
Internal vs. external tables (external tables leave data in place when dropped)
Querying HDFS-backed data via HiveQL
Partitioning and bucketing for query performance
Map-side joins for small-table optimization
UDFs for custom transformation logic
SerDe for custom serialization/deserialization

Apache Spark (the core skill)

Spark replaced MapReduce as the default distributed compute engine. If you only learn one framework, make it Spark.

Spark Core — RDDs, transformations vs. actions, lazy evaluation, DAG execution
Spark SQL — DataFrames, Dataset API, SQL interface, Catalyst optimizer
Spark Streaming — micro-batch streaming, DStream and Structured Streaming APIs

Data Movement

Apache Sqoop — batch transfer between HDFS and relational databases
Apache NiFi — visual dataflow tool, useful for routing and transformation without code
Apache Flume — log and event data collection into HDFS

Orchestration

Pipelines need scheduling, dependency management, and failure handling.

Apache Airflow — DAG-based workflow scheduler, the industry standard. Python-native DAG definitions, rich UI, large ecosystem of operators. Used it extensively at Blue Yonder for ML pipeline orchestration.
Azkaban — LinkedIn’s workflow scheduler, simpler than Airflow, used in Hadoop-heavy shops

NoSQL Databases

Different access patterns require different database architectures.

Database	Use case
HBase	Wide-column, Hadoop-native, low-latency reads on large datasets
Cassandra (DataStax)	High write throughput, multi-region, no single point of failure
Elasticsearch	Full-text search, log analytics (part of the ELK stack)
MongoDB	Document store, flexible schema, good for semi-structured data

Messaging and Streaming

Apache Kafka is the backbone of modern streaming architectures. It provides durable, ordered message logs that decouple producers from consumers. Used for real-time event pipelines, change data capture, and stream processing with Flink or Spark Streaming.

Dashboarding and Visualization

Tableau — business user-facing dashboards, drag-and-drop
Power BI — Microsoft ecosystem, strong enterprise adoption
Grafana — time-series metrics, infrastructure monitoring, alert dashboards
Kibana — Elasticsearch-native visualization, part of the ELK stack (Elasticsearch + Logstash + Kibana)

Cloud: AWS Data Services

Most production data infrastructure runs on cloud-managed services. AWS is the largest ecosystem.

Category	Services
Compute	EC2 (on-demand VMs), EMR (managed Hadoop/Spark)
Storage	S3 (object store), EFS (filesystem)
Access management	IAM, Secrets Manager
Relational databases	RDS (managed PostgreSQL/MySQL), Redshift (data warehouse), Athena (serverless SQL on S3)
NoSQL	DynamoDB
Serverless compute	Lambda
ETL	AWS Glue (managed Spark)
Scheduling	CloudWatch Events
Messaging	SNS (pub/sub), SQS (queue), Kinesis (real-time streaming)

The common pattern: raw data lands in S3, Glue or EMR processes it, results land in Redshift or RDS, Athena enables ad-hoc SQL on raw S3, CloudWatch triggers the schedule, Kinesis handles the real-time path.

What I Actually Use

In production work, the stack I reach for most:

Python + Pandas/PySpark for data transformation
PostgreSQL for transactional data, Redshift/BigQuery for analytics
Airflow for orchestration
Kafka for event streaming
S3 as the data lake layer
Parquet as the default file format

The full stack above matters for system design, architecture decisions, and debugging — even when you’re not writing Hadoop jobs directly.