The Detection Trap

Ask a tax agency where it is investing in AI and the answer, in most cases, is detection. Better fraud scoring on refund claims. Smarter audit case selection models. Improved risk signals on VAT transaction chains. These investments share a common assumption: that finding non-compliance is the binding constraint, and that surfacing more of it will translate into more revenue collected.

That assumption was correct when detection was the hard problem. In most modern tax administrations, it no longer is.

The average tax authority already has a referral backlog that its investigation teams cannot clear. Compliance risk models are generating more flagged returns than examiners can open. Fraud detection systems are producing more alerts than investigation units can action. Collections queues contain more delinquent accounts than collectors can contact. In each of these domains, the scarce resource is not suspicious cases — it is investigator hours to work them. And AI that produces more cases without improving how those cases are evaluated, prioritised, and resolved is not closing the tax gap. It is lengthening the queue.

The agencies building durable advantage from AI have understood something the detection-focused majority has not: the real operational bottleneck is not surfacing non-compliance. It is converting non-compliance into collected revenue within the constraints of finite capacity. That requires answering a fundamentally different question.

The Right Question

The question that drives most tax AI investment is: is this suspicious?

The question that drives revenue yield is: which intervention creates the highest expected revenue outcome?

These are not the same question. A case that is highly suspicious may have low expected revenue outcome — the taxpayer is insolvent, the underpayment is small, the legal position is uncertain, or the investigation complexity will consume examiner hours disproportionate to the likely adjustment. A case with a moderate suspicion score may have very high expected revenue outcome — a solvent business, a clear and well-documented underreporting issue, a cooperative compliance history that suggests early resolution is likely.

Current detection models produce the first answer. Almost no AI investment in tax administration is aimed at producing the second.

Expected revenue outcome is a product of four variables, each of which requires its own modelling capability — and each of which is currently either unmodelled or modelled separately from the others.

Probability and magnitude of underpayment. Not just whether a return is risky, but how much tax is likely to be at stake if the case is opened. A return that is moderately anomalous but belongs to a large, complex business may yield a much larger adjustment than a clearly suspicious return from a small trader. Detection models optimised for anomaly identification are not optimised for adjustment magnitude. They are different objectives, and conflating them produces case selections that fill examiner diaries with low-yield work.

Probability of successful recovery. Identified underpayment is not collected revenue. The probability of collection depends on the taxpayer’s financial position, the legal strength of the agency’s position, the likely appeal behaviour, and the statute of limitations trajectory on the case. A case that will yield a large assessment but spend three years in litigation before generating any revenue may represent worse expected outcome than a smaller, cleaner case that resolves within six months. Recovery probability modelling — incorporating financial health signals, legal position strength, and historical resolution patterns for similar cases — is almost entirely absent from current tax AI portfolios.

Intervention type and cost. Not all interventions are equal in cost or yield. An automated third-party data matching notice costs a fraction of a full field examination and resolves a large proportion of underreporting cases at the compliance end of the risk spectrum. A targeted compliance letter, a pre-filing inquiry, a desk examination, a field audit, and a criminal investigation each carry different resource costs, different taxpayer responses, and different revenue yields. Matching intervention type to case characteristics is a distinct optimisation problem from case selection — and one that current AI addresses poorly, defaulting to referral rather than recommendation.

Operational throughput. Even well-selected, well-matched cases produce no revenue if they sit in an examiner’s queue past their statute of limitations, or if case management bottlenecks prevent timely progression to assessment and collection. Throughput modelling — identifying which open cases are at risk of stalling, which are approaching statutory deadlines, and where examiner capacity is being consumed by complexity that could be resolved with specialist support — is the operational layer that determines whether good case selection translates into actual revenue.

Where This Changes Investment Decisions

The expected outcome frame does not eliminate the need for detection capability. Fraud models, risk scoring, and anomaly detection remain necessary inputs. The shift is in how they are used — as the first stage of a pipeline that ends in expected outcome optimisation, rather than as the output itself.

In audit and examination, the change is from selecting the most suspicious returns to selecting the returns with the highest expected yield per examiner hour. This requires combining the risk score with an adjustment magnitude estimate, a recovery probability score, and an intervention cost model. Agencies that have made this shift consistently report material improvement in revenue per examination — not because they are auditing more, but because the hours they are spending are concentrated where the return is highest.

In collections, the equivalent shift is from identifying all delinquent taxpayers to determining, for each delinquent account, what intervention will produce the highest recoverable yield at what cost and in what timeframe. A delinquent account belonging to a financially stressed individual who will self-cure within 90 days requires a different response than an account belonging to a solvent business that has the means to pay and is testing whether the agency will act. Treating both with the same collections strategy wastes resource on the first and loses revenue on the second. Propensity models that distinguish self-curers, ability-to-pay cases, and deliberate non-payers — and match each to the intervention most likely to convert — produce systematically higher collections yield from the same investigator capacity.

In VAT and transaction tax fraud — where carousel schemes can generate losses of €40–60 billion annually across the EU — the expected outcome question takes on a network dimension. Carousel fraud is organised: there are promoters, facilitators, and participating entities arranged in transaction chains specifically designed to frustrate individual case investigation. A detection model that flags individual entities for investigation may surface real fraud while being strategically useless — because investigating one node in a carousel ring without the others produces one small recovery and alerts the ring to restructure. Expected outcome modelling that scores not just individual cases but investigation sequences — identifying the cases that, worked in the right order, disrupt the whole scheme rather than one participant — is qualitatively different from detection, and qualitatively more valuable.

The Metric That Exposes the Problem

There is one metric that makes the detection trap visible: revenue per examiner hour.

Most tax agencies track this metric. Few use it to evaluate their AI investments. The typical AI business case in tax administration is built on cases flagged, anomalies detected, or alerts generated — outputs that measure the detection layer’s productivity, not the collection layer’s. A system that doubles the number of cases flagged while holding examiner hours constant has not improved revenue per examiner hour. It has halved the proportion of flagged cases that get worked.

Revenue per examiner hour improves only when the cases being worked are better selected, more efficiently progressed, and more reliably converted to collected revenue. That requires the four modelling capabilities described above — adjustment magnitude, recovery probability, intervention matching, and throughput management — working together in a pipeline that ends in a ranked action recommendation rather than a risk flag.

Agencies that have restructured their AI investment around this metric find that the constraint it exposes is almost never detection quality. It is almost always some combination of case prioritisation, intervention matching, and operational throughput — the problems that detection AI, however sophisticated, does not address.

The Recommendation

Build expected outcome models alongside detection models — and evaluate AI investment performance against revenue per examiner hour, not cases flagged.

In practice this means four things.

First, instrument the existing detection pipeline to measure what happens after a case is flagged. What proportion of flagged cases are opened? Of those opened, what proportion result in an adjustment? What is the average adjustment per case worked? What proportion of assessments are collected within twelve months? Without this instrumentation, AI investment decisions are made without feedback on the metric that matters — and detection models are optimised in isolation from the revenue outcomes they are supposed to drive.

Second, build adjustment magnitude models alongside risk scores. The question is not only whether a return is suspicious but how much tax is likely to be at stake. This requires combining risk signals with information about the taxpayer’s industry, size, transaction complexity, and historical compliance pattern. Agencies that have built this capability consistently report that the rank order of cases by expected adjustment is materially different from the rank order by risk score — and that the revenue difference between working the two lists is significant.

Third, add recovery probability to the case prioritisation model. A large expected adjustment on an insolvent taxpayer is not a large expected recovery. Financial health signals — payment history, asset indicators, business trajectory — are available to most agencies and are routinely used in collections but rarely integrated into audit case selection. Bringing recovery probability into the prioritisation model before a case is assigned to an examiner eliminates a class of high-effort, low-yield investigations that consume capacity without contributing to collected revenue.

Fourth, invest in intervention matching. Not every compliance issue requires a field examination. The agencies generating the highest revenue yield per investigator hour are not the ones running the most audits. They are the ones that have mapped the intervention spectrum — automated notice, targeted letter, compliance visit, desk examination, field audit, criminal referral — to the case characteristics that predict which intervention will convert most efficiently, and have built the recommendation infrastructure to route cases accordingly at scale.

The detection pile will keep growing. The tax gap is large, the data sources feeding risk models are expanding, and anomaly detection capability will continue to improve. What will not automatically improve is the rate at which detected non-compliance becomes collected revenue. That conversion rate is a function of how the pile is worked — and it is where the real AI opportunity in tax administration sits.