Data engineering is a wide discipline. People enter it from database backgrounds, software engineering, data science, or devops — and the required surface area is genuinely large. This is my working map of the stack, organized by layer, built from what I actually needed to know across production systems handling 5TB+ of data.
Foundations: OS and Scripting
Everything in data engineering runs on Linux. If you can’t navigate a filesystem, write a shell script, or debug a cron job at the terminal, you’re dependent on tooling abstractions that will fail you at the worst moments.
- Linux / Unix — file permissions, process management, networking basics, systemd
- Shell scripting — automation, ETL glue code, deployment scripts
- Data Structures & Algorithms — arrays, linked lists, stacks, queues, trees, graphs, dynamic programming, sorting/searching. Not for interviews — for understanding why your Spark join is slow.
Databases: Core DBMS Concepts
Before you touch any distributed system, you need relational fundamentals.
DBMS theory worth knowing:
- DDL / DML / DCL
- ACID properties and what they mean for transactions
- Concurrency control and deadlock
- Indexing strategies and hashing
- Normalization forms (1NF through 3NF and when to denormalize)
- Views, stored procedures, integrity constraints
- ER diagram design
SQL fluency required:
- Transactional databases: PostgreSQL, MySQL
- All join types (inner, left, right, full, cross, self)
- Nested queries and CTEs
GROUP BYwith aggregatesCASE WHENfor conditional logic- Window functions (
ROW_NUMBER,RANK,LAG,LEAD,SUM OVER) — these separate intermediate SQL from advanced SQL
Big Data Fundamentals
Understanding distributed systems conceptually before touching frameworks saves enormous confusion later.
Core concepts:
- What is Big Data? The 5 V’s: Volume, Velocity, Variety, Veracity, Value
- Distributed computation vs. distributed storage
- Vertical vs. horizontal scaling and why horizontal won
- Commodity hardware clusters and their failure assumptions
File formats matter:
- CSV — universal but inefficient for analytics
- JSON — flexible, verbose
- Avro — schema evolution, row-based
- Parquet — columnar, compression-friendly, the default for analytics
- ORC — columnar, Hive-optimized
Data types:
- Structured (RDBMS), Semi-structured (JSON, XML), Unstructured (logs, images, text)
Data Warehousing
Data warehouses have different design patterns than transactional systems.
- OLAP vs OLTP — analytical queries vs. transactional writes require fundamentally different schema designs
- Dimension and fact tables — the building blocks of warehouse design
- Star schema — one central fact table, denormalized dimensions, fast for BI queries
- Snowflake schema — normalized dimensions, storage-efficient, slower joins
- Warehouse design involves tradeoffs between query speed, storage cost, and update complexity
Big Data Frameworks
Apache Hadoop
Hadoop is the foundational layer — most modern frameworks are built on or react against its architecture.
- HDFS — distributed filesystem with block replication
- MapReduce — the original distributed compute model (write; most people now use Spark instead)
- YARN — resource management layer, still relevant in managed Hadoop clusters
Apache Hive
Hive adds SQL semantics on top of HDFS. Useful for batch ETL on Hadoop infrastructure.
- Data loading in various file formats
- Internal vs. external tables (external tables leave data in place when dropped)
- Querying HDFS-backed data via HiveQL
- Partitioning and bucketing for query performance
- Map-side joins for small-table optimization
- UDFs for custom transformation logic
- SerDe for custom serialization/deserialization
Apache Spark (the core skill)
Spark replaced MapReduce as the default distributed compute engine. If you only learn one framework, make it Spark.
- Spark Core — RDDs, transformations vs. actions, lazy evaluation, DAG execution
- Spark SQL — DataFrames, Dataset API, SQL interface, Catalyst optimizer
- Spark Streaming — micro-batch streaming, DStream and Structured Streaming APIs
Data Movement
- Apache Sqoop — batch transfer between HDFS and relational databases
- Apache NiFi — visual dataflow tool, useful for routing and transformation without code
- Apache Flume — log and event data collection into HDFS
Orchestration
Pipelines need scheduling, dependency management, and failure handling.
- Apache Airflow — DAG-based workflow scheduler, the industry standard. Python-native DAG definitions, rich UI, large ecosystem of operators. Used it extensively at Blue Yonder for ML pipeline orchestration.
- Azkaban — LinkedIn’s workflow scheduler, simpler than Airflow, used in Hadoop-heavy shops
NoSQL Databases
Different access patterns require different database architectures.
| Database | Use case |
|---|---|
| HBase | Wide-column, Hadoop-native, low-latency reads on large datasets |
| Cassandra (DataStax) | High write throughput, multi-region, no single point of failure |
| Elasticsearch | Full-text search, log analytics (part of the ELK stack) |
| MongoDB | Document store, flexible schema, good for semi-structured data |
Messaging and Streaming
Apache Kafka is the backbone of modern streaming architectures. It provides durable, ordered message logs that decouple producers from consumers. Used for real-time event pipelines, change data capture, and stream processing with Flink or Spark Streaming.
Dashboarding and Visualization
- Tableau — business user-facing dashboards, drag-and-drop
- Power BI — Microsoft ecosystem, strong enterprise adoption
- Grafana — time-series metrics, infrastructure monitoring, alert dashboards
- Kibana — Elasticsearch-native visualization, part of the ELK stack (Elasticsearch + Logstash + Kibana)
Cloud: AWS Data Services
Most production data infrastructure runs on cloud-managed services. AWS is the largest ecosystem.
| Category | Services |
|---|---|
| Compute | EC2 (on-demand VMs), EMR (managed Hadoop/Spark) |
| Storage | S3 (object store), EFS (filesystem) |
| Access management | IAM, Secrets Manager |
| Relational databases | RDS (managed PostgreSQL/MySQL), Redshift (data warehouse), Athena (serverless SQL on S3) |
| NoSQL | DynamoDB |
| Serverless compute | Lambda |
| ETL | AWS Glue (managed Spark) |
| Scheduling | CloudWatch Events |
| Messaging | SNS (pub/sub), SQS (queue), Kinesis (real-time streaming) |
The common pattern: raw data lands in S3, Glue or EMR processes it, results land in Redshift or RDS, Athena enables ad-hoc SQL on raw S3, CloudWatch triggers the schedule, Kinesis handles the real-time path.
What I Actually Use
In production work, the stack I reach for most:
- Python + Pandas/PySpark for data transformation
- PostgreSQL for transactional data, Redshift/BigQuery for analytics
- Airflow for orchestration
- Kafka for event streaming
- S3 as the data lake layer
- Parquet as the default file format
The full stack above matters for system design, architecture decisions, and debugging — even when you’re not writing Hadoop jobs directly.