Automating Regulatory Data Pipelines at ANZ Bank

Overview

My first role out of IIT Bombay was at ANZ Bank’s enterprise data automation team in India. The scope was APRA (Australian Prudential Regulation Authority) regulatory data — the pipeline that ensures the bank’s regulatory reporting is accurate and timely. I joined as a fresh graduate and ended up owning significant parts of the automation infrastructure that reduced development and testing time by 50%.

The Problem

Banking regulatory data pipelines are a specific class of engineering problem: the correctness requirements are absolute, the data sources are heterogeneous and legacy, and the cost of errors is regulatory — not just operational. ANZ’s data was flowing from 70+ source systems with different encodings, schemas, and update frequencies into a centralized regulatory reporting layer.

The development and testing process was largely manual: engineers would write DataStage jobs, test them by hand against sample data, and chase down discrepancies through multiple system layers. This was slow, error-prone, and didn’t scale as the number of source systems grew.

Why It Mattered

APRA regulatory reporting is not optional. Late or incorrect regulatory data is a compliance risk — and in banking, compliance failures have material consequences. The engineering team was spending more time on manual testing and debugging than on building new capabilities. The bottleneck was structural.

Data & Inputs

70+ source systems with diverse formats:
- ASCII flat files
- Semi-ASCII files with mixed encoding
- EBCDIC files (IBM mainframe encoding — a specific challenge for engineers from modern data backgrounds)
- Database extracts from Teradata and other systems
Control-M job scheduling metadata and dependency chains
Historical regulatory submission data for validation baselines

Understanding EBCDIC encoding — and why a single wrong character can silently corrupt a downstream calculation — was one of the first things I learned on the job.

Approach

The automation strategy had two components:

Component 1: ETL development automation Analyzed the patterns in existing IBM DataStage jobs and identified the repetitive structural elements — source connection setup, field mapping, data type handling, error logging. Wrote Python tooling to generate DataStage job templates from configuration files, reducing the work to specify-then-generate rather than build-from-scratch.

Component 2: Automated test framework Built an end-to-end test automation framework using Robot Framework and Python. The framework:

Generated test data covering all file format variants (ASCII, EBCDIC, semi-ASCII)
Ran DataStage jobs in a test environment
Compared outputs against expected baselines using column-level validation
Generated test reports with pass/fail by field, by record, and by job

This meant engineers could test a new DataStage job in minutes rather than hours — and the tests could be run automatically before any production deployment.

Engineering & Implementation

DataStage job generation: Python-based template engine producing DataStage parameter files from structured config — reducing new job creation from days to hours
Robot Framework test library: Custom Robot Framework keywords for DataStage job execution, Teradata data validation, and file format handling (including EBCDIC)
Control-M integration: Scripts to validate job dependency chains and automatically verify execution order
Validation layer: Field-level comparison with configurable tolerance for numeric fields, exact match for identifiers, and custom rules for date and encoding validation
Test reporting: HTML test reports with drill-down to field level — giving engineers a precise view of what passed, what failed, and why

The framework was designed to be usable by other engineers in the team — not just by me. Documentation and onboarding were built in from the start.

Results & Impact

50% reduction in development and testing time for new DataStage jobs
70+ source systems covered by automated test validation
Framework adopted by the wider team — became the standard for new pipeline development
Zero production incidents attributable to bugs caught by automated tests during my tenure
Led the Grads4Tech initiative to improve technology adoption culture within the bank

Limitations & What I’d Do Differently

The test framework was strong on happy-path coverage but required manual work to add new edge cases. A property-based testing approach (generating test inputs programmatically from specs) would have improved coverage depth.

The DataStage job generation was template-based, not model-based — it couldn’t handle genuinely novel job types without manual template extension. A more principled DSL-to-DataStage compiler would have been more powerful, though the complexity tradeoff might not have been worth it for the team’s needs.

Stack

IBM DataStage, Teradata, Control-M, Robot Framework, Python, SQL, ASCII/EBCDIC file handling

Overview

The Problem

Why It Mattered

Data & Inputs

Approach

Engineering & Implementation

Results & Impact

Limitations & What I’d Do Differently

Stack

Related Writing

Stack

Lets collaborate!