Case Studies Regulatory ETL Across 70+ Mainframe Systems for ANZ's APRA Reporting
automation Tier-1 Australian Bank · 2021–2022

Regulatory ETL Across 70+ Mainframe Systems for ANZ's APRA Reporting

Built and automated ETL pipelines across 70+ heterogeneous source systems - including EBCDIC mainframe feeds - feeding ANZ Bank's APRA regulatory reports, with a Robot Framework test layer that caught data errors before submission and cut development and testing time by 50%.

Problem

ANZ's APRA regulatory reports were assembled from data flowing through 70+ heterogeneous source systems - ASCII, semi-ASCII, and EBCDIC mainframe encodings - with development and testing done almost entirely by hand. A single wrong byte could silently corrupt a regulatory submission, and there was no automated way to catch it before it reached APRA.

Outcome

Delivered Python-driven DataStage job generation plus an end-to-end Robot Framework test harness covering all 70+ source systems, cutting development and testing time by 50% and catching data errors at the field level before deployment.

Impact - who used it & what changed

ANZ's compliance team submitted APRA regulatory reports generated by the pipelines I built - zero tolerance for error, and the automation cut delivery time.

Who Used This and What Changed

ANZ’s compliance team submits regulatory reports to APRA, the Australian Prudential Regulation Authority. Those reports were generated by the data pipelines I built and tested. In this domain there is no tolerance for error - a wrong number in a regulatory submission is a legal event, not a bug ticket - and the work I owned cut the time it took to deliver those pipelines in half while adding the automated checks that caught data errors before they ever reached the regulator.

This was my first production role, straight out of IIT Bombay. I joined ANZ’s enterprise data automation team and ended up owning significant parts of the infrastructure feeding APRA reporting. The job sounds dry. It was not. Banking compliance at this level exposes every assumption you make about data quality, system reliability, and what “good enough” actually means. Spoiler: in this domain, good enough does not exist.

The Problem

Banking regulatory data pipelines belong to a specific class of engineering problem. The correctness requirements are absolute. The data sources are heterogeneous and legacy. The cost of errors is regulatory, not just operational.

ANZ’s data was flowing from 70+ source systems - different encodings, schemas, and update frequencies - into a centralized regulatory reporting layer that fed APRA submissions. The development and testing process was almost entirely manual. Engineers would write DataStage jobs, test them by hand against sample data, and chase discrepancies through multiple system layers.

Slow. Error-prone. Not scalable as the number of source systems grew. And every undetected data quality error was a potential compliance failure downstream.

Why It Mattered

APRA regulatory reporting is not optional. Late or incorrect data is a compliance risk with material consequences - it is not a ticket in the backlog. The engineering team was spending more time on manual testing and debugging than on building new capabilities, and the manual process meant errors could slip through to a report that compliance officers would sign and submit to the regulator. The bottleneck was structural, and the stakes were legal.

Data & Inputs

Understanding EBCDIC encoding - and why a single wrong character can silently corrupt a downstream calculation - was one of the first things I learned on the job. One wrong byte in a regulatory report is not a technical debt item. It is a legal event.

Approach

Two-component automation strategy.

Component 1: ETL development automation Analyzed the patterns in existing IBM DataStage jobs and identified the repetitive structural elements - source connection setup, field mapping, data type handling, error logging. Built Python tooling to generate DataStage job templates from configuration files. Specify-then-generate rather than build-from-scratch. New job creation dropped from days to hours.

Component 2: Automated test framework Built an end-to-end test automation framework using Robot Framework and Python. The framework:

Engineers could test a new DataStage job in minutes rather than hours. Tests ran automatically before any production deployment. That shift eliminated an entire class of production incidents.

Engineering & Implementation

The framework was built to be usable by other engineers, not just by me. Documentation and onboarding were part of the design from the start. That decision is why it got adopted.

Results & Impact

Limitations & What I’d Do Differently

The test framework was strong on happy-path coverage but required manual work to add new edge cases. A property-based testing approach - generating test inputs programmatically from specs - would have improved coverage depth without the manual overhead.

The DataStage job generation was template-based, not model-based. It could not handle genuinely novel job types without manual template extension. A more principled DSL-to-DataStage compiler would have been more powerful, though the complexity tradeoff might not have been worth it for the team’s needs at the time.

Stack

IBM DataStage, Teradata, Control-M, Robot Framework, Python, SQL, ASCII/EBCDIC file handling

Stack

Python IBM DataStage Teradata Control-M Robot Framework SQL
banking etl automation regulatory data-engineering

Have a problem worth solving?

Whether you need a quantitative researcher, a Machine Learning systems builder, or a technical advisor, I take a small number of consulting engagements at a time.

Book a call →