Every AI programme review I have sat in measures the same things. Accuracy. Precision. Recall. AUC. F1 score. These metrics are real, they are important within their scope, and they describe something meaningful about the relationship between a model and its training data. They do not describe whether the model is producing value in production, because that question requires a different metric that almost nobody is tracking.
Decision latency is the time between when a decision could be made and when it actually is made. It is not the same thing as model latency, which is how fast the model runs inference. A model can have inference latency measured in milliseconds and decision latency measured in hours. The model is fast. The decision is slow. And the consequences of that gap — in prevented loss, in competitive position, in the difference between AI that changes outcomes and AI that observes them — are often larger than the consequences of any model quality improvement a data science team could make.
Every decision has an actionable window
The reason decision latency matters is that every decision has a window within which intervention is possible and outside of which it is not. The window is determined by the context the decision operates in, not by the technology used to support it.
For a payment fraud decision, the actionable window is defined by the payment network, not by the bank’s technology choices. Card authorisation networks typically allow 150 to 300 milliseconds from the point the network request is received to the point an authorisation or decline must be returned. Subtract network transit time and processing overhead and the fraud scoring model has somewhere between 50 and 100 milliseconds to complete inference and return a result. If the model cannot score within that window, it cannot influence the authorisation decision. It is off the critical path. On real-time payment rails — Pix in Brazil, FedNow in the United States, the NPP in Australia, SIC5 in Switzerland — the settlement is irrevocable and final. A fraud score produced after the payment settles is not a prevention mechanism. It is a detection mechanism. The funds have moved. The intervention options are limited to recovery, which depends on the fraudster not having dispersed the proceeds, which in organised fraud they reliably have. The difference between a model that scores before settlement and a model that scores after is not a question of accuracy. It is a categorical difference in what the model can achieve.
For a credit decision, the actionable window is typically defined by competitive dynamics. A customer applying for credit through a digital channel who does not receive a decision within a defined time period will apply elsewhere. The model that achieves 92% accuracy in 48 hours is not competing with a model that achieves 88% accuracy in four seconds on the dimension that determines whether the customer completes the application. Speed is the decision variable. The accuracy difference is secondary until the latency problem is solved.
For an AML alert, the actionable window is the period during which the suspicious transaction pattern is still exploitable. An alert generated in real time and worked within 20 minutes reaches a situation where intervention is possible. The same alert worked 12 hours later reaches a situation where the funds have moved through multiple accounts and the trail has gone cold. The model quality is identical. The outcome is structurally different.
Model latency and decision latency are not the same thing
The confusion between these two concepts is pervasive and expensive. Data science teams optimise for model latency because it is measurable, it is within their control, and it appears in the metrics they are evaluated against. Decision latency requires understanding the business process, the intervention point, and the gap between them, which is a business analysis problem rather than a technical one.
A batch-scoring architecture can produce a model with 50-millisecond inference latency and 6-hour decision latency simultaneously. The model scores quickly within its scheduled batch run. The batch run happens every six hours. The decision latency is six hours, irrespective of how fast the model itself operates. This architecture produces detection rather than prevention on any decision where the actionable window is shorter than six hours, regardless of model quality.
The organisations that discover this gap typically do so in one of two ways. Either they measure it deliberately — which requires someone to define the actionable window for each decision type and compare it to actual scoring timing — or they discover it through a failure. A fraud pattern that should have been caught was not caught because the alert arrived after the funds moved. A customer was lost because the credit decision arrived after they had already been approved elsewhere. A compliance finding identified a transaction pattern that had been flagged but not worked within the required window. Both discovery mechanisms work. One of them is significantly cheaper than the other.
Three sources of decision latency and what to do about each
Architecture is the first and most common source. The model is off the critical path — scoring in batch when the decision requires real-time intervention, or producing output that reaches a queue rather than the decision point. The fix requires moving the model to the point of decision, which is an architecture change, not a model change. It often requires investment in real-time inference infrastructure that was not in the original programme budget.
Process is the second source. The model is on the critical path but the output sits in a queue that is not worked within the actionable window. An alert model that generates output in real time but feeds a team working through a backlog at hourly intervals produces decision latency equal to the queue depth. The fix requires either increasing capacity to reduce queue depth or redesigning the process so that high-priority output is triaged separately from the general queue. Neither is a model improvement.
Governance is the third source. The model produces output that triggers a decision that requires approval, committee review, or escalation before action can be taken. In regulated industries this is often unavoidable, but the governance process can be designed to fit the actionable window or it can be designed without reference to it. Governance designed without reference to the decision’s actionable window will routinely produce decision latency that exceeds it.
None of these sources are visible in model accuracy metrics. All of them can be identified by measuring decision latency. The measurement requires three inputs for each decision type: the actionable window, the timestamp at which a scoring decision could theoretically be made, and the timestamp at which the decision is actually made. The gap between the second and third is decision latency. The comparison between decision latency and the actionable window tells you whether you are preventing or detecting.
The cascade architecture resolves the latency and sophistication tradeoff
The practical objection to millisecond-level inference requirements is that sophisticated models — deep learning architectures, large ensemble methods, models that consume many features — take longer to score than simple ones. The tension between model sophistication and decision latency is real. The resolution is a cascade architecture, sometimes called a step or waterfall model, and it is one of the most underdeployed patterns in enterprise AI.
The logic is straightforward. Not every transaction requires the same level of scrutiny. A first-stage model, fast and lightweight, scores every transaction in single-digit milliseconds. It is not trying to catch all fraud. It is trying to identify the subset of transactions where elevated risk justifies additional compute. Transactions that score below its risk threshold pass through quickly. Transactions that score above it are passed to a second stage, a heavier and more sophisticated model that takes more time but operates on a fraction of the total volume. A third stage, the most computationally expensive, may be called only on the transactions the second stage cannot confidently resolve. The overall system operates within the card authorisation window because the vast majority of transactions never reach the later stages.
The commercial consequence is significant. You do not need to run your most expensive inference on every transaction. You run it on the transactions where it will change the outcome. The first stage provides coverage. The later stages provide precision where coverage alone is insufficient. The architecture lets you have both speed at scale and sophistication where it matters, which a single-model approach optimised for either latency or accuracy cannot deliver.
The design of the first-stage model is the critical decision. It needs to be fast enough to score every transaction within the available window, sensitive enough to refer the right cases upward, and specific enough not to overwhelm the later stages with false positives. Getting this right is a different design problem from building the most accurate model possible — and it requires understanding the decision latency requirements before the architecture is chosen.
Why it is not being measured
Decision latency sits in the gap between two functions that rarely communicate about it. The data science team owns the model and measures model performance. The business operations team owns the process and measures operational throughput. Neither team owns the decision latency metric because it lives at the interface between the model output and the business process, which is territory neither function has a clear mandate to assess.
Vendors have no interest in surfacing it. A vendor whose model achieves strong accuracy metrics in offline evaluation has met the contractual performance criteria regardless of what decision latency looks like in production. The gap between model quality and production value is the client’s problem to identify and the client’s problem to solve.
The result is that most organisations running AI programmes believe, without having measured it, that their AI is operating within the actionable windows of the decisions it targets. Some of them are right. A meaningful proportion of them are not. They are funding accurate detection systems under the assumption that they are funding prevention systems, and the difference in outcome is being absorbed as operational loss rather than attributed to the architecture choice that produced it.
The fix is to measure it. The metric is not complicated. The conversation it starts is.
That conversation leads directly to a question that most AI strategy discussions never reach: if millisecond-level inference is required at the point of a transaction, where should that inference actually happen? For most large banks, insurers, and government institutions, the transactions themselves happen on infrastructure that has been processing them reliably at scale for decades. The answer to where inference should happen has significant implications for how AI deployment is designed — and it is a question the next piece in this series addresses directly.