Why ML in AML is not optional — and why training on legacy dispositions reproduces the inconsistency

There is a conversation that happens in every evaluation of ML-based AML systems by a conservative bank or a conservative investor. It goes approximately like this: "The technology sounds impressive, but our examiner will never accept a black-box model for suspicious-activity detection. We need something interpretable. We'll stick with rules."

This objection has been recited for long enough that it has the character of a settled fact. It is not. It was largely true in 2018, it was partially true in 2022, and in 2026 it has become a misunderstanding — of what "interpretable" means, of what examiners actually evaluate, and of how modern ML systems in regulated environments are built. The institutions that continue to hold it as settled are making a decision that has real downstream cost, and deserve to have the reasoning examined.

But there is a second misunderstanding waiting on the other side of the first one. Once an institution accepts that ML in AML is necessary — and the math on rule maintenance, false-positive volume, and examiner expectation has made that conclusion fairly unavoidable — the much more important question is what the model gets trained on. Most of the ML-augmented AML systems on the market today train, at least in part, on the historical dispositions their customer banks generate. That is a structural mistake. It produces models that reproduce the inconsistency of the underlying judgment at higher throughput, and it is the problem this piece is most interested in.

We want to take both arguments seriously. What follows is not a marketing argument. It is a technical one.

The category confusion: interpretable ≠ explainable

The central word that gets conflated in these conversations is "explainability," and the conflation obscures what is actually happening. Two properties are getting blurred together.

Interpretability is the property that a human can read the system's code and understand what it will do. A rule-based engine with 23 rules is interpretable: an engineer can read rule 14, see that it fires when cash deposits exceed $9,000 in rolling 24 hours, and predict its behavior.

Explainability is the property that, when the system reaches a specific decision on a specific case, it can tell you why — in a way that is auditable, that cites the evidence, and that a human can validate.

These are not the same property. You can have interpretable systems that are not explainable in any useful way, and you can have machine-learned systems that are explainable at the case level with greater specificity than any rule engine can deliver.

Consider what a rule engine actually tells an examiner when rule 14 fires on Customer A's transaction. It tells them: "this transaction matched a threshold." It does not tell them why that threshold is the right one, why that customer's behavior is unusual relative to their own baseline, what other signals corroborate the flag, or why rule 14 was preferred over rule 22. The rule is interpretable; the decision it generates is not particularly explainable beyond "the threshold was crossed."

Now consider what a properly-built ensemble ML system can tell an examiner about the same transaction. It can produce a case-level feature attribution — specifying that this transaction's risk score was driven 34% by the velocity anomaly relative to Customer A's eighteen-month behavioral baseline, 22% by the counterparty's adverse media correlation, 17% by the structuring-signature pattern detection, and so on across the contributing signals. It can show which features a hypothetical counterfactual transaction would have needed to suppress in order to stay below threshold. It can report the confidence bounds on the decision and the training lineage of the model that produced it.

The ML system is doing something the rule engine is structurally incapable of doing: explaining the specific decision on the specific case, with attribution to the specific signals that drove it. This is a kind of explainability that is stronger, not weaker, than what rules produce.

What SR 11-7 actually requires

The regulatory framework under which U.S. banks operate ML models in supervised activities is SR 11-7, "Guidance on Model Risk Management," jointly issued by the Federal Reserve and the OCC. It is worth reading if you have not, because the lived perception of what it requires differs meaningfully from what it actually says.

SR 11-7 does not prohibit ML models. It does not require that models be interpretable in the sense of "readable code." It does require four things: sound model development (with appropriate data, methodology, and testing), effective model validation (by a function independent of model development), governance with clear roles and accountability, and documentation sufficient for a third party to understand how the model works and assess whether it is fit for its intended purpose.

None of those requirements rules out machine learning. What they rule out is ML that is shipped without the surrounding discipline — without documented validation, without feature lineage, without monitoring for drift, without clear governance around who is accountable when the model misbehaves. Properly-built ML systems in AML satisfy SR 11-7. Carelessly-built ML systems do not — and neither do carelessly-built rule systems.

The "regulators won't accept ML" objection, more honestly stated, is: "we don't want to build the surrounding discipline that SR 11-7 requires." That is a coherent position for an institution to hold. But it should not be misrepresented as a regulatory constraint. It is an organizational choice.

Feature attribution, in practice

Let us be concrete about what case-level explainability looks like when the system is built well. In modern ML systems, two families of techniques produce feature-level attribution for individual decisions: local linear approximations (LIME) and Shapley-value methods (SHAP). Both are well-established, both have extensive peer-reviewed literature, and both are routinely used in production ML systems in regulated industries — credit underwriting, fraud detection, healthcare claims adjudication — where the regulatory bar for decision auditability is at least as high as AML.

Applied to a monitoring system, these techniques produce, for each flagged transaction, a decomposition of the risk score into contributions from each underlying behavioral feature. The investigator reviewing the case sees not just that the model flagged the transaction but which specific features of that specific customer's activity drove the flag. An examiner reviewing the SAR sees, in the supporting documentation, the same attribution — traceable to named features with documented definitions, traceable to specific data inputs.

This is more, not less, than what a rule-based system gives them. A rule engine, when an examiner asks "why did this fire and that one didn't?", tells them "rule 14 matched and rule 22 didn't." A properly-instrumented ML system tells them "this transaction had these seven features with these seven attributions; shifting any of the top three would have moved the risk score below flagging threshold."

The deeper point, worth stating plainly: a rule engine's interpretability is the interpretability of its rules. It is not the interpretability of its decisions. The decisions are only as justified as the rules are, and the rules themselves are, in practice, opaque in origin. Why is the velocity threshold ten transactions per day and not eight? Because the rule-writer picked a number. Is that number well-calibrated to the base rates of structuring in the customer population? Generally not examined. A well-built ML system, by contrast, has to answer that calibration question explicitly during validation. The discipline that SR 11-7 requires of ML systems is discipline that rule systems have historically avoided — by being informal enough that the question did not arise.

Why the rule-system status quo fails on its own terms

Beyond the explainability argument, there is a harder failure mode of rule-based monitoring that the industry collectively underweights. Rules can only detect patterns that a human rule-writer has anticipated. By construction, they cannot detect patterns that have not yet been written as rules.

This is a problem because modern financial crime is adversarial. The typologies used by sophisticated bad actors change faster than institutions can write, test, deploy, and tune new rules. Rule-based systems are structurally backward-looking: they encode the set of known typologies as of the most recent rule-update cycle, and they are blind to everything outside that set until an analyst manually notices a new pattern and proposes a new rule.

Anomaly-detection models — a category of ML system — do not share this constraint. They surface activity that is unusual relative to learned baselines without requiring that the unusualness have been pre-specified. This is not a theoretical advantage. It is routinely the mechanism by which novel typologies are discovered and then formalized into investigation playbooks and, eventually, rule form. An institution running only rule-based monitoring is, by definition, a lagging detector of typologies. An institution running ML-based monitoring with anomaly components has a leading indicator.

From a regulatory perspective, this matters in an increasingly adversarial direction. Enforcement focus is shifting — slowly, but visibly — toward whether institutions are detecting the typologies that matter, not whether they are filing SARs against the typologies that have historically mattered. An institution that can demonstrate it has active detection capacity for emerging patterns is in a better posture, not worse, than one that can only point at its rule set.

What "ML-native" means, and what it doesn't

We use the phrase "ML-native" deliberately. It is not a claim that every detection component is a neural network. It is a claim about architectural philosophy.

An ML-native system treats machine learning as the primary detection substrate and rules as narrow, well-understood, surgical complements where they make sense (for example: sanctions screening, where rules are the right tool because sanctions lists are exhaustive and the detection problem is exact-match). An ML-bolt-on system, by contrast, treats rules as the primary substrate and uses machine learning to re-score or deprioritize rule outputs. The distinction matters because ML-bolt-on systems inherit the blind spots of the rule substrate — they can suppress noise, but they cannot surface activity that the rule substrate missed entirely.

Being ML-native also imposes a discipline on the team building the system. It requires feature engineering as a first-class activity (not an afterthought), ensemble architecture that reconciles specialized models rather than voting among redundant rules, drift monitoring across multiple time horizons, and case-level explainability that is built into the system's output rather than reconstructed after the fact. These are engineering investments. They are also the investments that produce the case-level explainability that makes the system defensible in examination.

The label-noise problem most ML-AML systems quietly carry

This is where the second misunderstanding lives, and it is at least as consequential as the first. When a financial institution evaluates an ML-augmented AML system, the implicit assumption is that the model has been trained on something that looks like ground truth — alerts that were correctly closed as false positive, cases that correctly resulted in SARs, escalations that correctly identified suspicious activity. The further assumption is that the institution's own historical dispositions, fed back into the model, will make it sharper over time.

The problem is that legacy analyst dispositions are not ground truth. They are the output of human judgment applied under workload pressure, inconsistent training, and evolving institutional culture. Inter-analyst variance on the same alert at the same institution is well-documented in operational AML practice — the same pattern reviewed by different analysts produces different dispositions, sometimes closed-as-false-positive by one analyst and escalated by another. This is not a training problem in the sense of "we need better analyst training." It is a structural property of the work itself.

An ML model trained on those dispositions as ground-truth labels learns the inconsistency. This is known in the supervised-learning literature as label noise, and the consequence is well-understood: the model reproduces the labelers' inconsistency at the model's throughput rather than the labelers' throughput. The institution gets a faster pattern-match of inconsistent reasoning, not a fix for the underlying problem. Examiners increasingly recognize this. An ML system that is just a high-speed mirror of a noisy analyst queue does not get more deferential treatment at exam time than the noisy queue itself did — and the institution that deployed it has a harder time defending the model risk decision under SR 11-7 review.

The architectural answer is to separate the two things that the legacy data trail bundles together. Regulatory rules — BSA thresholds, FATF jurisdiction lists, OFAC sanctions guidance, FinCEN typology directives — are deterministic, citable, and universal. They are the part of the institutional memory that should be absorbed into the model. They become features with explicit regulatory lineage that an examiner can trace. Analyst dispositions, by contrast, are the part that should be handled with explicit noise controls: weak features only, not ground-truth labels; SAR filings as weak positive signals with appropriate caveats; multi-analyst consensus cases as the high-confidence training set; legacy free-text disposition narratives excluded entirely as training inputs in favor of model-generated narratives that produce the same explanation from the same inputs regardless of which analyst is reviewing the case. The fuller version of this argument is the disposition variance article →

The point — for an institution deciding whether and how to adopt ML in AML — is that the framework choice matters as much as the framework adoption. ML is not optional. How it is trained determines whether the resulting system is defensible.

The argument, compressed

Let us compress the argument to its core. The reflex against ML in AML was grounded, a decade ago, in three concerns: examiners would not accept non-rule-based detection; ML systems could not produce case-level explanations; and the regulatory framework for model governance was not suited to ML. None of the three holds in 2026. Examiners routinely accept ML models where the model risk discipline is sound; case-level explainability techniques are mature and widely deployed; and SR 11-7 explicitly accommodates ML models under the same governance framework that applies to any other model type.

What remains is the work of building the surrounding discipline. That work is real, and it is why not every vendor claiming "ML-powered AML" actually ships defensible systems — and why the question of what the model is trained on matters as much as whether ML is used at all. The institutions that understand both halves of the argument — that ML is necessary, and that training data is the architectural decision — will be the ones writing better SARs, catching newer typologies, and facing examiners from a stronger posture in the cycles ahead.

Absorb the rules. Replace the inconsistency.