The Context
The engagement began in Europe, working with the global operations of a major banking institution on complex financial systems. When the bank expanded its wealth management platform to Hong Kong — one of the world’s largest wealth management markets — I moved with it, tasked with leading the Performance and Capacity Planning Centre of Excellence as the platform approached its Asia go-live.
The platform served the bank’s fixed income and wealth management operations: portfolio management, trade execution, client reporting, and the settlement workflows that connected them. In this context, performance is not an engineering concern sitting downstream of feature delivery. It is a business-critical requirement. A platform that cannot process trades during a rate decision or a market dislocation event is not a platform a wealth management operation can rely on. The tolerance for latency and downtime is, in practical terms, close to zero.
The Crisis: What Java GC Failures Look Like at Scale
Weeks before the scheduled go-live, load testing revealed what unit tests and functional testing had not: the platform could not sustain the transaction volumes the bank’s Asia operations required.
The root cause was Java memory management — specifically, the way the application’s object lifecycle had been designed and how the JVM’s garbage collector was responding to it under sustained load. This is a class of problem that is genuinely easy to miss during development and genuinely dangerous in production, and it is worth explaining why.
Java applications manage memory through a garbage collector that periodically identifies and reclaims objects that are no longer referenced. Under light load, this process runs frequently, keeps the heap clean, and is largely invisible. Under sustained high load, the pattern changes. Objects accumulate faster than the collector can reclaim them. The heap fills. The JVM escalates to progressively more expensive collection strategies — culminating in a full stop-the-world collection that halts all application threads while the entire heap is examined.
In a wealth management platform processing continuous transaction flows, a stop-the-world GC pause is not a minor inconvenience. It is a service interruption. At sufficient load, these pauses were occurring frequently enough and lasting long enough to make the platform operationally unviable. The system worked. It just didn’t work at the scale the business required.
The development team had done nothing wrong in a narrow sense — the application logic was correct and the features were complete. What had not been designed carefully enough was the object lifecycle: how objects were allocated, how long they were retained, how they moved through the JVM’s generational memory model, and what the cumulative effect of those patterns looked like under sustained transaction volumes. This is a discipline that requires specific expertise and deliberate attention. Under the delivery pressure of a major platform launch, it had not received either.
Resolving the Immediate Problem
The path to go-live required working through three layers simultaneously.
The first was diagnostic: getting accurate data on what the GC was actually doing under load. GC logs, heap profiling, thread analysis, and careful load test instrumentation to understand not just that the platform was failing but precisely where and why. The failure modes were not identical across the application — some components had significantly worse memory profiles than others — and fixing the right things first required knowing which things those were.
The second was remediation: working with the development team on the specific code patterns driving the worst memory behaviour. Object pooling where appropriate, reducing unnecessary object creation in high-throughput code paths, reviewing collection usage, and in a few cases rearchitecting components where the fundamental design was creating unavoidable pressure on the heap. This was not a configuration exercise — GC tuning can absorb a problem up to a point, but if the application is generating too much garbage, tuning the collector only delays the failure mode.
The third was tuning: once the application behaviour was improved, configuring the JVM and GC settings to match the actual workload profile. Heap sizing, generation ratios, collector choice, and pause time targets — these decisions interact with each other and with the application’s specific allocation patterns in ways that require empirical validation rather than rule-of-thumb settings.
The platform reached go-live. The more important work was ensuring that what happened in the weeks before go-live would not be allowed to happen again.
Building the Centre of Excellence
A Performance and Capacity Planning Centre of Excellence is not a team that fixes performance problems after they occur. That description would be a reactive function with a good name. The goal was a proactive function: one that established standards, owned the testing regime, and embedded performance discipline into the delivery process early enough to prevent production failures rather than remediate them.
Building it started with the right people. The interviews I conducted across the Hong Kong talent market gave me a clear picture of where the skills gap actually sat. Technical capability was not the constraint — the pool of strong Java engineers in Hong Kong was deep. What was less common was engineers who combined technical depth with a specific kind of systems thinking: the ability to reason about application behaviour under conditions that didn’t yet exist, to anticipate how load patterns would interact with architectural decisions, and to make the case for performance work in terms the business could evaluate. This last skill — translating technical performance risk into business impact — turned out to be as important as the technical skills themselves.
The CoE’s core output was a testing regime that caught performance problems before they reached production. This meant load testing profiles calibrated to real-world transaction patterns, not synthetic benchmarks. It meant stress testing to failure — understanding where the system broke and why, not just confirming that it worked within expected parameters. It meant endurance testing over sustained periods to surface memory problems that only manifested after hours of operation. And it meant regression testing to ensure that code changes didn’t degrade performance that had already been established.
Predictive Capacity Planning in Financial Services
The capacity planning work added a dimension that standard performance engineering often omits: the forward-looking model.
A wealth management platform’s load profile is not uniform. Daily patterns follow market hours, with peaks at open and close. Weekly patterns reflect settlement cycles. But the most significant load events are market-driven: a central bank rate decision, a major credit event, a significant IPO in the region. These events are partially predictable — the rate decision calendar is known in advance — and partially not, but their effect on transaction volumes is well-documented in historical data.
The predictive model we built combined historical transaction data with market event calendars and, where available, forward indicators of expected market activity. The output was a capacity forecast that the infrastructure team could plan against — not “here is how the system performs today” but “here is what we expect the system to need in three months, and here are the market scenarios that would push beyond that.” For a fixed income operation, where the heaviest load often coincides with the market conditions that make any service disruption most costly, this forward visibility was operationally significant.
The prior European foundation mattered here. Having worked with the bank’s global operations gave me context that the Hong Kong team didn’t yet have — an understanding of how the global leadership thought about performance risk, what reporting they expected, and what the consequences of a performance failure during a significant market event would look like from the perspective of the business rather than the technology team. That context shaped the CoE’s standards and reporting from the start, in ways that made its output more legible and more actionable to the people making investment decisions.
What Performance Engineering in Financial Services Actually Requires
Three things that this engagement made durable in my own practice.
GC problems are design problems, not configuration problems. The instinct when facing a garbage collection failure is to tune the collector. Sometimes that is the right response. More often, if the problem is serious, it means the application’s memory design needs attention. Tuning can manage the symptoms of poor object lifecycle design; it cannot fix them. The distinction matters because configuration changes are fast and design changes are slow, and choosing the wrong response wastes time you often don’t have.
Performance testing must be calibrated to real failure scenarios. Load testing against average expected traffic validates the average case. In financial services, the average case is rarely the one that causes problems. The test suite needs to include the market event scenarios — the conditions under which the system will be most heavily used at exactly the moment when any degradation is most consequential. Building those scenarios requires understanding the business, not just the system.
A CoE’s value is measured in problems that don’t occur. This makes it difficult to defend in budget discussions, because invisible prevented failures don’t generate incident reports. Making the CoE’s value visible requires two things: clear documentation of the risk that existing capability is managing, and a track record of finding problems in testing that would otherwise have reached production. Both require discipline in how the function reports its work, not just how it does it.
The platform stabilised. The CoE established its standards. The go-live that had been at risk became a go-live that worked. The measure that mattered was not the crisis resolved but the number of similar crises that, in the years that followed, never happened.