The Disposition Variance Problem: Why ML Trained on Legacy AML Data Reproduces the Inconsistency

In twenty years of frontline AML operations across HSBC, Morgan Stanley, and Capital One, the most consistent observation about disposition logic is also the one most rarely audited: the same alert, reviewed by different analysts at the same institution, gets disposed differently. Closed as false positive by one. Escalated for further review by another. Filed as a SAR by a third. The variance is not occasional, and it is not a failure of training or of policy adherence. It is a structural property of the work.

This piece is about why that variance matters in 2026 — when machine learning has become standard architectural practice in AML technology — and why the most consequential decision a bank or credit union makes when adopting an ML-augmented AML system is not whether to use ML, but what to train it on. The two halves of that decision look identical from a vendor brochure. They are not the same thing under examination.

The structural observation

Consider the conditions under which AML analyst dispositions are produced. An analyst is reviewing a queue of alerts. The queue is sized to be cleared within a workday. Each alert carries some metadata — transaction details, customer profile, prior history — and the analyst applies institutional procedure, training, judgment, and time-pressure-driven heuristics to produce a disposition. The disposition is written into a free-text narrative field. The analyst moves to the next alert.

Across a queue of hundreds of alerts per week, across a roster of analysts with different tenures and different training cohorts, across institutional cultures that shift over the months and years, the dispositions on similar activity will not be identical. They will not even be approximately identical. They will vary in ways that correlate with which analyst was working that shift, what their queue depth looked like that day, what the most recent training memo emphasized, and what the institutional tolerance for SAR over-filing happens to be at that moment.

This is not a training problem in the sense that "the analysts need more training." Many of the analysts in question have decades of experience and certifications that exceed what is required. It is not a staffing problem in the sense that "more analysts would fix it." More analysts produce more dispositions with the same variance properties. It is a structural property of human judgment applied under workload pressure with inconsistent guidance — and it is reproducible across every institution where the practice has been observed, regardless of size, sophistication, or regulatory posture.

Why ML trained on this data inherits the problem

In supervised machine learning, the model learns from labeled examples. For an AML model trained on historical dispositions, the labels are the dispositions themselves: this alert was closed as false positive, that alert was escalated, this case resulted in a SAR. The model's job is to learn the patterns in the features that predict each label.

When the labels are inconsistent — when the same pattern of features produced "closed" in one historical case and "escalated" in another — the model learns the inconsistency. This phenomenon is well-documented in the supervised-learning literature under the name label noise. The mathematical consequence is straightforward: the model's predictions converge toward the distribution of the labels, not toward whatever ground truth the labels were imperfectly approximating. If the labels reflect 80% reasoned judgment and 20% workload-pressure noise, the model learns to reproduce that mix.

What an institution gets, in practice, is an ML system that mirrors its analyst pool's inconsistency at the model's throughput. The system is faster than the analysts, which can look like efficiency. But the disposition logic is no more consistent than the human reasoning that fed the training set. The institution has built a high-speed mirror of a noisy queue, not a fix for the underlying problem.

Some readers will object that modern ML practice includes techniques for handling label noise — sample weighting, ensemble methods, robust loss functions, semi-supervised approaches. This is true, and these techniques do produce real improvements at the margin. But they cannot recover signal that was not present in the labels to begin with. If two analysts genuinely produce different dispositions on the same pattern, no statistical technique can determine which one was correct. The model can only learn the central tendency of the disagreement, which is not what the institution thought it was buying.

Why this is a regulatory problem, not just a quality problem

The model risk management framework most relevant to AML systems in the United States — SR 11-7 from the Federal Reserve, plus the parallel OCC and FDIC guidance — requires institutions to document their model's training data, validation methodology, and decision basis. An examiner reviewing an ML-augmented AML system is increasingly likely to ask the questions that follow from this framework. Where did the training labels come from? What is the variance among labels for similar input patterns? How did the institution validate that the model's outputs reflect defensible reasoning rather than the variance of the labelers?

An institution that trained its ML system on its own historical dispositions does not have good answers to those questions. The label variance is documented in the system's own data and reproducible by anyone with database access. The model's outputs are, by construction, a function of that variance. The defensibility posture under SR 11-7 review is materially weakened, not strengthened, by the ML adoption — exactly the opposite of what the institution intended.

This is the regulatory problem that the disposition variance creates. It is not abstract. An examiner who pulls a sample of the institution's recent SAR filings and runs them against a sample of closed-as-false-positive cases with similar feature patterns, and finds disposition variance correlated with which analyst reviewed the case, has identified a model risk issue. An institution running ML-augmented monitoring trained on that variance is responsible for explaining what the model is learning and why its decisions can be defended. That explanation is significantly harder when the training labels are themselves inconsistent.

Why incumbents can't fix this without rebuilding

The natural follow-up question is why the established AML technology vendors have not solved this problem. The answer has to do with what those vendors are economically and architecturally positioned to do.

An incumbent AML monitoring vendor's value proposition to its existing customer base is continuity. The customer institution has integrated the vendor's rule engine into operations over years or decades. The institution's analysts have produced dispositions inside that vendor's workflow. The training data, if the vendor adds ML capability, comes from that disposition history. The vendor cannot tell the customer, "your historical disposition data is not suitable for training ground truth," because that statement is functionally an admission that the entire prior product was operating on noisy inputs and producing variance-amplified outputs all along. The commercial position does not permit the architectural honesty.

What the incumbent vendors offer instead is ML capability layered on top of the existing rule and disposition substrate. The marketing language varies — "AI-augmented monitoring," "smart triage," "ML-prioritized alerts" — but the architecture is structurally the same: rules produce alerts, ML re-scores or re-prioritizes the alerts, the disposition history feeds back into the ML model as training signal. The variance problem is preserved end-to-end. The incumbents cannot escape it without a ground-up rebuild that their commercial relationships do not allow.

This is what we mean when we describe the disposition variance problem as a structural advantage for a new architecture rather than a feature that can be retrofitted. A platform built in 2026 for AML monitoring can make the architectural decision differently from the start — and the institutions evaluating their monitoring infrastructure now are in a position to demand that decision.

The architectural answer

The fix begins with a distinction that legacy AML systems quietly bundled together. The historical data trail from a legacy monitoring system contains two materially different things.

One is the set of regulatory rules the institution is required to apply. CTR thresholds defined in BSA 31 CFR 1010.311. FATF Recommendation 19 jurisdiction guidance. The OFAC sanctions framework. FinCEN's structuring and funnel-account guidance. These are deterministic, citable, and universal. They encode what the institution is regulatorily required to detect, and they do not vary by which analyst happens to be on shift. This is the part of the institutional memory that should be absorbed into the model, and absorbed with full lineage to the source regulatory authority. Every rule-derived feature should carry its citation and threshold documentation, so that an examiner reviewing the model's decision basis can trace the reasoning back to the underlying regulatory expectation. This is the half of the framing we describe as "absorbing the rules."

The other thing in the historical data trail is the layer of analyst dispositions on top — the closed-as-false-positive determinations, the SAR filings, the free-text narratives. This is the variance-prone layer. It should not be absorbed as ground truth. The treatment policy that handles it is operationally specific: closed-as-false-positive dispositions are weak feature inputs only, never used as ground-truth labels. SAR filings can be used as weak positive labels, with appropriate weighting that reflects single-analyst variance. Cases that received independent review by three or more analysts who reached the same disposition can be treated as high-confidence labels, because the consensus reduces the label-noise floor. Bank rule fires are features, not labels — whether a rule fired is deterministic, whether the resulting alert was a true positive is the disposition layer (subjective, noisy). Legacy free-text disposition narratives are excluded as training inputs entirely; the model generates its own disposition narratives from feature attributions, so that the same inputs produce the same explanation regardless of which analyst is reviewing the case.

The combined effect is a model that has absorbed the regulatory rules with documented lineage, while explicitly rejecting the legacy analyst dispositions that carry the variance. The detection decisions are traceable to regulatory authority. The disposition narratives are reproducible by construction. The SR 11-7 documentation has a clear story about what the labels are, why some are excluded, and why the model's outputs can be defended at examination.

What this means for an institution evaluating AML technology in 2026

The practical implication of the disposition variance argument is that the technology evaluation question is no longer "do we adopt ML in AML." That question has been answered by the enforcement environment, the false-positive volumes coming out of rule-based monitoring, and the examiner expectation that institutions will modernize. The remaining question is the one most evaluations do not ask: what is the candidate system trained on, and what does the vendor say about the variance in that training data?

The shorter version of the right question to ask a vendor is: do you train on our historical analyst dispositions, and if so, how do you handle the inter-analyst variance in that data? The answers separate the architectures cleanly. Vendors that train on historical dispositions without addressing the variance are shipping high-speed mirrors of the institution's existing noise. Vendors that absorb regulatory rules as features with documented lineage, and treat analyst dispositions with explicit noise controls (excluded as training labels for closed-as-FP dispositions; weak positive labels for SAR filings; high-confidence labels only for multi-analyst consensus cases), are building the architecture that gets cleaner defensibility under examination.

Vigilic is in the second category. The Label Treatment Policy that governs our ML training pipeline is a public commitment to that distinction, and the documentation we produce for examiner review reflects it. The model's training labels carry confidence tiers. The detection features carry regulatory lineage to Category 15 of the feature catalog. The disposition narratives are generated per case from feature contributions, so they reproduce the same explanation from the same inputs every time. The architectural choice is made early, not retrofitted, and that is what makes it defensible.

The compressed argument

To compress the argument to its essentials: legacy AML disposition data carries inter-analyst variance that is structural and irreducible. ML trained on that data inherits the variance and reproduces it at higher throughput. The regulatory consequence is a harder SR 11-7 defense, not an easier one. The architectural fix is to absorb the regulatory rules (which are deterministic and citable) and reject the legacy analyst dispositions as ground-truth labels (which are not). The disposition narratives are generated from the model's own feature attributions, not inherited from the inconsistent reasoning that produced the legacy data. Same inputs, same answer.

That is the difference between an ML-augmented version of the old system and an ML-native architecture that takes the disposition variance problem seriously.

Absorb the rules. Replace the inconsistency.

Swarna Revuru is co-founder and CRCO at Vigilic. She brings 20+ years of frontline AML operations experience from HSBC, Morgan Stanley, and Capital One to the design of Vigilic's compliance architecture, label treatment policy, and examiner alignment. Shaan Revuru is co-founder and CEO of Vigilic. Request a technical briefing.