When I joined ANZ Bank as a fresh data engineer, I knew almost nothing about how banks actually work. Six months of building regulatory data pipelines taught me more than any finance course.
What follows is the map I wish I’d had — the business model, the data architecture, and the regulatory constraints that shape everything a data scientist does in banking.
The Business Model
At its core, a bank does two things:
- Borrows money at a low interest rate (from depositors and capital markets)
- Lends money at a higher interest rate (to individuals and businesses)
The spread between borrowing and lending rates is the net interest margin — the primary source of profitability for traditional retail banks.
Beyond the core spread, banks generate revenue from:
- Fees: transaction fees, account maintenance fees, advisory fees, underwriting fees
- Trading: proprietary trading desks and client execution (for investment banks)
- Asset management: managing money on behalf of clients for a management fee
- Insurance: bancassurance products, often sold through the retail banking channel
The tradeoffs are fundamental: more lending → more interest income, but also more credit risk. Lower lending standards → more customers, but worse loan book quality.
The Balance Sheet in 2 Minutes
Assets:
- Loans: mortgages, business loans, personal credit — what the bank is owed
- Securities: government bonds, corporate bonds, structured products — investment portfolio
- Cash and equivalents: reserves, interbank deposits
Liabilities:
- Deposits: money the bank owes to its customers
- Borrowings: money the bank has borrowed from capital markets
- Equity: shareholders’ capital and retained earnings
A bank with $1B in assets funded by $900M in deposits and $100M in equity has a leverage ratio of 10:1. This is normal in banking — and it’s why bank failures cascade. When loan losses exceed the equity cushion, the bank is insolvent.
How Regulatory Data Fits In
Prudential regulators (APRA in Australia, the Fed and OCC in the US, the PRA in the UK) require banks to maintain data systems that allow them to verify compliance with capital, liquidity, and reporting requirements.
In Australia, APRA requires:
- Regulatory capital reporting (ARS): How much capital does the bank hold? Is it above minimum requirements?
- Liquidity reporting (LRS): Does the bank have enough liquid assets to survive a 30-day stress scenario?
- Balance sheet reporting: Detailed breakdowns of assets, liabilities, and off-balance-sheet exposures
The data for these reports comes from every corner of the bank: retail banking systems, wholesale funding desks, treasury, credit risk models, operations. At ANZ, that was 70+ source systems — each with its own data model, encoding, and update schedule.
The Data Complexity
Banking data is heterogeneous by necessity. A large bank has accumulated systems over decades: legacy mainframe systems from the 1980s alongside modern cloud-native applications. The mainframe systems often produce data in EBCDIC encoding (an IBM character encoding standard from the 1960s) rather than ASCII. They have fixed-width field formats rather than delimited files. They update on batch schedules rather than in real time.
A regulatory ETL pipeline at a large bank is not just SQL — it’s format conversion, encoding translation, data quality validation, lineage tracking, and reconciliation across systems that were never designed to talk to each other.
This is why the “50% development time reduction” at ANZ was meaningful. The test automation we built wasn’t testing a simple data pipeline — it was testing integrations across dozens of incompatible systems, each with its own quirks.
The Incentive Structure
Understanding why banking data systems are the way they are requires understanding the incentive structure:
Regulatory compliance > operational efficiency. A bank that misses regulatory reporting deadlines faces fines and regulatory scrutiny. This means the data systems that feed regulatory reports are treated as mission-critical infrastructure, even when they’re technically clunky.
Stability > innovation. The systems that process billions of dollars in daily transactions cannot be disrupted by a botched technology upgrade. This creates strong resistance to changing systems that are “working,” even when they’re antiquated.
Risk awareness is cultural. In a well-run bank, questions about “what happens if this fails?” are taken seriously at every level. This is the right instinct — the downside of data errors in banking is regulatory exposure, customer harm, and reputational damage.
Where Data Science Fits In Banking
The interesting applications of data science in banking:
Credit risk modeling: Predicting probability of default, loss given default, and exposure at default for the loan book. These models directly affect capital requirements under Basel III.
Fraud detection: Real-time classification of transactions as legitimate or fraudulent. The class imbalance problem (fraud is rare) and the adversarial nature (fraudsters adapt to detection) make this a perpetual modeling challenge.
Customer analytics: Predicting churn, cross-sell propensity, and customer lifetime value. The data is rich (transaction history, product holdings, interaction history) but privacy constraints limit what can be done.
AML / transaction monitoring: Identifying patterns consistent with money laundering in transaction data. A pattern-matching and anomaly detection problem at very large scale.
Interest rate modeling: For treasury and ALM (asset-liability management), forecasting interest rate movements and modeling the bank’s sensitivity to rate changes.
What the Work Actually Looks Like
Data science in a large bank is slower than in a startup. The approval processes, the data access controls, the change management procedures — all of these slow down the pace of experimentation. But the data is richer, the stakes are higher, and the problems are genuinely hard.
The most important skill for a data scientist in banking: understand the business well enough to know which data quality issues are errors and which are features. When loan origination volumes spike in one month, is it a data quality problem or a legitimate surge in mortgage applications? The domain knowledge to answer that question is what turns data into decisions.