Overview
My first role out of IIT Bombay was at ANZ Bank’s enterprise data automation team in India. The scope was APRA (Australian Prudential Regulation Authority) regulatory data — the pipeline that ensures the bank’s regulatory reporting is accurate and timely. I joined as a fresh graduate and ended up owning significant parts of the automation infrastructure that reduced development and testing time by 50%.
The Problem
Banking regulatory data pipelines are a specific class of engineering problem: the correctness requirements are absolute, the data sources are heterogeneous and legacy, and the cost of errors is regulatory — not just operational. ANZ’s data was flowing from 70+ source systems with different encodings, schemas, and update frequencies into a centralized regulatory reporting layer.
The development and testing process was largely manual: engineers would write DataStage jobs, test them by hand against sample data, and chase down discrepancies through multiple system layers. This was slow, error-prone, and didn’t scale as the number of source systems grew.
Why It Mattered
APRA regulatory reporting is not optional. Late or incorrect regulatory data is a compliance risk — and in banking, compliance failures have material consequences. The engineering team was spending more time on manual testing and debugging than on building new capabilities. The bottleneck was structural.
Data & Inputs
- 70+ source systems with diverse formats:
- ASCII flat files
- Semi-ASCII files with mixed encoding
- EBCDIC files (IBM mainframe encoding — a specific challenge for engineers from modern data backgrounds)
- Database extracts from Teradata and other systems
- Control-M job scheduling metadata and dependency chains
- Historical regulatory submission data for validation baselines
Understanding EBCDIC encoding — and why a single wrong character can silently corrupt a downstream calculation — was one of the first things I learned on the job.
Approach
The automation strategy had two components:
Component 1: ETL development automation Analyzed the patterns in existing IBM DataStage jobs and identified the repetitive structural elements — source connection setup, field mapping, data type handling, error logging. Wrote Python tooling to generate DataStage job templates from configuration files, reducing the work to specify-then-generate rather than build-from-scratch.
Component 2: Automated test framework Built an end-to-end test automation framework using Robot Framework and Python. The framework:
- Generated test data covering all file format variants (ASCII, EBCDIC, semi-ASCII)
- Ran DataStage jobs in a test environment
- Compared outputs against expected baselines using column-level validation
- Generated test reports with pass/fail by field, by record, and by job
This meant engineers could test a new DataStage job in minutes rather than hours — and the tests could be run automatically before any production deployment.
Engineering & Implementation
- DataStage job generation: Python-based template engine producing DataStage parameter files from structured config — reducing new job creation from days to hours
- Robot Framework test library: Custom Robot Framework keywords for DataStage job execution, Teradata data validation, and file format handling (including EBCDIC)
- Control-M integration: Scripts to validate job dependency chains and automatically verify execution order
- Validation layer: Field-level comparison with configurable tolerance for numeric fields, exact match for identifiers, and custom rules for date and encoding validation
- Test reporting: HTML test reports with drill-down to field level — giving engineers a precise view of what passed, what failed, and why
The framework was designed to be usable by other engineers in the team — not just by me. Documentation and onboarding were built in from the start.
Results & Impact
- 50% reduction in development and testing time for new DataStage jobs
- 70+ source systems covered by automated test validation
- Framework adopted by the wider team — became the standard for new pipeline development
- Zero production incidents attributable to bugs caught by automated tests during my tenure
- Led the Grads4Tech initiative to improve technology adoption culture within the bank
Limitations & What I’d Do Differently
The test framework was strong on happy-path coverage but required manual work to add new edge cases. A property-based testing approach (generating test inputs programmatically from specs) would have improved coverage depth.
The DataStage job generation was template-based, not model-based — it couldn’t handle genuinely novel job types without manual template extension. A more principled DSL-to-DataStage compiler would have been more powerful, though the complexity tradeoff might not have been worth it for the team’s needs.
Stack
IBM DataStage, Teradata, Control-M, Robot Framework, Python, SQL, ASCII/EBCDIC file handling