Who Used This and What Changed
ANZ’s compliance team submits regulatory reports to APRA, the Australian Prudential Regulation Authority. Those reports were generated by the data pipelines I built and tested. In this domain there is no tolerance for error - a wrong number in a regulatory submission is a legal event, not a bug ticket - and the work I owned cut the time it took to deliver those pipelines in half while adding the automated checks that caught data errors before they ever reached the regulator.
This was my first production role, straight out of IIT Bombay. I joined ANZ’s enterprise data automation team and ended up owning significant parts of the infrastructure feeding APRA reporting. The job sounds dry. It was not. Banking compliance at this level exposes every assumption you make about data quality, system reliability, and what “good enough” actually means. Spoiler: in this domain, good enough does not exist.
The Problem
Banking regulatory data pipelines belong to a specific class of engineering problem. The correctness requirements are absolute. The data sources are heterogeneous and legacy. The cost of errors is regulatory, not just operational.
ANZ’s data was flowing from 70+ source systems - different encodings, schemas, and update frequencies - into a centralized regulatory reporting layer that fed APRA submissions. The development and testing process was almost entirely manual. Engineers would write DataStage jobs, test them by hand against sample data, and chase discrepancies through multiple system layers.
Slow. Error-prone. Not scalable as the number of source systems grew. And every undetected data quality error was a potential compliance failure downstream.
Why It Mattered
APRA regulatory reporting is not optional. Late or incorrect data is a compliance risk with material consequences - it is not a ticket in the backlog. The engineering team was spending more time on manual testing and debugging than on building new capabilities, and the manual process meant errors could slip through to a report that compliance officers would sign and submit to the regulator. The bottleneck was structural, and the stakes were legal.
Data & Inputs
- 70+ source systems with diverse formats:
- ASCII flat files
- Semi-ASCII files with mixed encoding
- EBCDIC files (IBM mainframe encoding - a specific challenge for engineers from modern data backgrounds)
- Database extracts from Teradata and other systems
- Control-M job scheduling metadata and dependency chains
- Historical regulatory submission data for validation baselines
Understanding EBCDIC encoding - and why a single wrong character can silently corrupt a downstream calculation - was one of the first things I learned on the job. One wrong byte in a regulatory report is not a technical debt item. It is a legal event.
Approach
Two-component automation strategy.
Component 1: ETL development automation Analyzed the patterns in existing IBM DataStage jobs and identified the repetitive structural elements - source connection setup, field mapping, data type handling, error logging. Built Python tooling to generate DataStage job templates from configuration files. Specify-then-generate rather than build-from-scratch. New job creation dropped from days to hours.
Component 2: Automated test framework Built an end-to-end test automation framework using Robot Framework and Python. The framework:
- Generated test data covering all file format variants (ASCII, EBCDIC, semi-ASCII)
- Ran DataStage jobs in a test environment
- Compared outputs against expected baselines using column-level validation
- Generated test reports with pass/fail by field, by record, and by job
Engineers could test a new DataStage job in minutes rather than hours. Tests ran automatically before any production deployment. That shift eliminated an entire class of production incidents.
Engineering & Implementation
- DataStage job generation: Python-based template engine producing DataStage parameter files from structured config
- Robot Framework test library: Custom keywords for DataStage job execution, Teradata data validation, and file format handling including EBCDIC
- Control-M integration: Scripts to validate job dependency chains and automatically verify execution order
- Validation layer: Field-level comparison with configurable tolerance for numeric fields, exact match for identifiers, and custom rules for date and encoding validation
- Test reporting: HTML test reports with drill-down to field level - engineers could see exactly what passed, what failed, and why
The framework was built to be usable by other engineers, not just by me. Documentation and onboarding were part of the design from the start. That decision is why it got adopted.
Results & Impact
- ANZ’s compliance team submitted APRA regulatory reports generated by these pipelines - the automated test layer caught data errors at the field level before any submission left the bank
- 50% reduction in development and testing time for new DataStage jobs - engineers went from days to hours building jobs, and from hours to minutes testing them
- 70+ source systems covered by automated test validation, including EBCDIC mainframe feeds
- Framework adopted by the wider team - became the standard for new pipeline development, which is why it kept paying off after I left it
- Zero production incidents attributable to bugs caught by automated tests during my tenure
- Led the Grads4Tech initiative to sharpen technology adoption culture within the bank
Limitations & What I’d Do Differently
The test framework was strong on happy-path coverage but required manual work to add new edge cases. A property-based testing approach - generating test inputs programmatically from specs - would have improved coverage depth without the manual overhead.
The DataStage job generation was template-based, not model-based. It could not handle genuinely novel job types without manual template extension. A more principled DSL-to-DataStage compiler would have been more powerful, though the complexity tradeoff might not have been worth it for the team’s needs at the time.
Stack
IBM DataStage, Teradata, Control-M, Robot Framework, Python, SQL, ASCII/EBCDIC file handling