Essential Reading
All Posts
-
Why a 94.7%-accurate smish classifier is a false-positive machine in production
Same model. Two prevalence assumptions. ~75 percentage points of precision difference. This is why public-benchmark accuracy does not transfer to fraud production
-
Rules Are the Bones, Models Are the Muscle
This is the third post in a short series on evaluating fraud detection. The first covered the lightweight evaluation you run during an active campaign. The second covered the heavy evaluation that gates a model release. This one steps back to a question that comes up every time someone reviews a detection system for the first time.
-
The Eval That Decides Whether You Ship
The companion to this post described the lightweight evaluation: the one you run under a clock, when a campaign is active and a response ships in hours. This post describes the other instrument. The heavy evaluation is what gates a model release. It is slow, deliberate, and expensive, and those properties are features rather than defects.
-
The Eval You Run When the Clock Is Running
Most writing about evaluation assumes you have time. It assumes a held-out dataset, a labeling panel, a power analysis, weeks of iteration. That assumption is fine for shipping decisions. It is useless during an active attack