The Accuracy Number That Doesn’t Tell You What You Think
Every risk adjustment vendor reports an AI accuracy rate. “95% coding accuracy.” “92% NLP precision.” “98% diagnosis identification rate.” These numbers appear in sales presentations, RFP responses, and marketing materials. They sound impressive. They’re also misleading in a way that matters for audit outcomes.
Most reported accuracy rates measure whether the AI correctly identified a diagnosis that exists in the clinical note. That’s pattern recognition: the system found the word “diabetes” or “CKD Stage 3” in the text. It’s a necessary capability. It’s also the easy part. The question that determines audit defensibility isn’t whether the AI found the diagnosis. It’s whether the documentation behind that diagnosis meets the evidentiary standard CMS applies during RADV review.
A system that finds diabetes in a note with 95% accuracy but doesn’t evaluate whether the note contains evidence of active monitoring, treatment adjustments, or clinical assessment is 95% accurate at the wrong task. The OIG’s March 2026 audits didn’t fail because diagnoses weren’t in the charts. They failed because the documentation didn’t prove active management. Every one of those failed codes was “accurately” identified in the record. None of them were adequately validated.
The Metric That Actually Predicts Audit Outcomes
The metric that matters is defensibility rate: what percentage of the system’s recommended codes survive MEAT validation scrutiny? This measures not just whether the AI found a diagnosis but whether the documentation supporting that diagnosis meets the standard auditors apply.
A system with 95% identification accuracy but 70% defensibility rate produces a large volume of codes that look correct but can’t be defended. 30% of its recommendations will fail under audit scrutiny. At scale, that 30% represents thousands of codes that enter the plan’s submissions as liabilities rather than assets.
A system with 90% identification accuracy and 92% defensibility rate produces fewer total recommendations but a higher percentage of defensible ones. The plan submits fewer codes, but more of them hold up. The net financial position, revenue retained after audits, favors the second system even though its “accuracy” headline number is lower.
Plans should ask vendors for defensibility rates, not just accuracy rates. If the vendor can’t produce this metric, the system doesn’t measure what matters. If the vendor reports accuracy without defensibility, the number describes the AI’s ability to read charts, not its ability to produce audit-ready output.
What Defensibility-Centered Metrics Look Like
A system designed for defensibility reports four metrics rather than one. Identification accuracy (did the AI find the diagnosis in the note). MEAT completeness rate (of the diagnoses found, what percentage have all required MEAT elements present in the documentation). Defensibility score distribution (across all recommended codes, the spread from strong to weak evidence). And false confidence rate (codes the AI recommended with high confidence that fail MEAT validation), which reveals where the system overestimates documentation quality.
The false confidence rate is particularly revealing. A system that frequently recommends codes with high confidence scores that subsequently fail validation has a calibration problem. It’s telling coders that documentation is strong when it isn’t, which accelerates the submission of indefensible codes.
Buying on the Right Number
Plans evaluating risk adjustment software should deprioritize headline accuracy rates and focus on defensibility metrics. Ask for MEAT completeness rates. Ask for false confidence rates. Ask the vendor to run their system against a sample of your charts and report not just how many codes it found but how many it would recommend for submission after full evidence validation. The gap between “found” and “defensible” is the gap that determines your audit exposure, and it’s the gap most accuracy metrics are designed to hide.


