We validated our dental AI with real dentists, in the clinic — and they signed off on 97.8%

We validated our dental AI with real dentists, in the clinic — and they signed off on 97.8%

Chapter of the Sprint DGX series. A clinical model isn’t proven with benchmarks: it’s proven when a professional, with the drug’s package insert in front of her, says “I’ll put my name behind this.” (from Chapter 1).

There’s a moment in any clinical AI project when the charts stop mattering. It isn’t when the model wins a public exam, or when it answers quickly: it’s when a dentist reads what it has produced, weighs it against her professional judgment, and decides whether she’d put her name behind it. That moment arrived in this sprint, and the answer was nearly unanimous: 97.8% acceptance. This chapter is about how you prepare a clinical corpus with the seriousness it deserves, and how you validate it with real professionals, in the clinic, while they see patients.

97.8% clinical acceptance in validation with a collaborating dentist: a dental AI that a professional signs off on.

The corpus that gives the product its substance

Before a model can reason well, someone has to prepare well what it learns. Our product has accumulated years of real conversations between the team at a partner dental clinic and its patients: administrative messages, reminders, communications, front-desk notes. It’s a goldmine of how a clinic is actually run, and, like any mine, it has to be refined before you use it.

We ran a thorough audit of the full corpus, with criteria tailored to each of the product’s agents (reception, scheduling, billing, drug reference, and the rest). The result was better than we expected: more than half of the material has an identifiable clinical-care reasoning structure, anchored directly in nine of the system’s ten agents, with hundreds of cases carrying safety-critical clinical relevance (allergies, pregnancy, anticoagulants). Valuable, real material, generated in the day-to-day of a clinic that works.

Refine before training

A large corpus is not the same as a useful corpus. A good share of those messages are automated templates from the practice-management software, appointment confirmations, reminders, that repeat without variation. We consolidated them: we kept a single copy of each identical piece of content, normalized the text, and stripped out the fields that add nothing to learning. The corpus went from “too large to review carefully” to a manageable, high-quality set, where every row contributes something distinct.

It’s unglamorous work and enormously important. The quality of a clinical model is decided here, at the prep table, long before training.

Privacy by design, not as an add-on

Working with patient data forces you to do things right from minute one. We built a layered anonymization pipeline, with an audit at every step:

  • Removal of direct identifying fields (internal identifiers, phone, email).
  • Detection of ID documents, IBANs, and postal addresses via strict patterns.
  • Recognition of proper names, including their Spanish variants and diminutives.
  • Patterns specific to the practice-management software and to the clinic staff’s names, confirmed with the team.

Each sensitive item is replaced by a neutral placeholder, consistent within each record. The pipeline injected more than 400,000 substitutions across the corpus, leaving the clinically relevant cases intact, because the goal is to protect people without losing the medical value of the information.

We tuned it carefully, with several review passes and independent auditing, until the personal-data footprint was reduced to a minimal, verified fraction. That’s how we understand privacy in healthcare: not as a box you tick at the end, but as a property of the system from the design stage.

The acid test: a real professional

With the corpus cleaned and protected, we generated a set of clinical-reasoning cases and asked ourselves the only question that matters: would a dentist sign off on this?

To answer it, we worked with a collaborating dentist from the partner clinic. And here a very real challenge appeared: an active clinic doesn’t stop. There’s no time to onboard onto software tools or to review huge documents asynchronously between patients. So we designed a validation method built for her reality: chat blocks with a human intermediary. The dentist on the phone; we read her each block of cases, she responds with her clinical judgment by voice, and we record her structured verdicts. Efficient, respectful of her time, and focused on the only thing that adds value: her professional judgment.

That method, which ended up being the sprint’s standard for external validation, makes the clinic part of the team without asking it to become a user of technical tools.

Iterate to “I’ll sign off on this”

Clinical validation isn’t a stamp you apply once: it’s a process. A first round already gave us a high acceptance rate, with specific notes from the dentist on how to present certain recommendations better and how to accommodate the clinic’s own criteria. We took every comment seriously, adjusted the model accordingly, and regenerated.

In the second round, on an equivalent sample with the same per-category proportions, the result was clear:

Verdict Count
Accept 44
Accept with a presentation note 1
Acceptance 97.8%

And there was one detail that left us especially satisfied. One of the second-round cases demonstrated, on its own, that the model hadn’t simply learned to “sound more cautious”: it had genuinely internalized the clinical criterion we’d asked of it, to the point of applying it correctly in a new situation. The dentist herself flagged it as evidence that the improvement had truly taken hold. A model that appears to comply is not the same as one that reasons well: the difference shows when an expert eye looks at it.

What goes into training

We closed this phase with a deliberate decision: not to chase big numbers for their own sake. What feeds the product’s reasoning layer is the refined, anonymized corpus plus the clinical cases validated by a professional, with 97.8% acceptance and no clinical reservations. Quality signed off by someone who understands, not volume without judgment.

That, for us, is the line that separates a demo from a clinical product. Anyone can show a model that answers. Few can show a model whose answers a dentist reviews, checks against the drug’s official package insert, and signs off on. Building AI for healthcare is, above all, earning that yes, and you earn it with well-prepared data, privacy as standard, and the judgment of real professionals at the center of the process.

Next chapter: from the trained model to one that answers in milliseconds, how the VLM is served so it can accompany the appointment in real time.

FAQ

How do you know the model is reliable and not just “sounds good”? Because a collaborating dentist validates it case by case, with her clinical judgment and against official sources like the drug’s package insert. Acceptance in the latest round was 97.8%, with no clinical reservations. A benchmark measures knowledge; professional validation measures trust.

What do you do with patient data? Privacy is a property of the system from the design stage. Before any training, the corpus goes through a layered anonymization pipeline that replaces all personal information with neutral placeholders and is audited at every step, leaving the personal-data footprint at a minimal, verified fraction. And, once in production, the model is designed to run on-premise: the data never leaves the clinic.

Why not validate thousands of cases instead of a few dozen? Because what adds value is the depth of clinical judgment, not volume. A stratified sample, carefully reviewed by a professional, says more about the model’s real reliability than a large number without expert scrutiny. We prefer signed-off quality over quantity without judgment.

Share:
AI applied to real problemsExplore our solutions
We validated our dental AI with real dentists, in the clinic — and they signed off on 97.8% | Blog | Quantum Howl