enInteligencia ArtificialLarge Language ModelsNVIDIAOdontologíaStartups

Teaching dentistry to an AI model, one stage at a time

9 de junio de 20265 min read861 words

Chapter 4 of the Sprint DGX series (from Chapter 1). How dental knowledge gets injected into a general VLM in stages, and why we still don’t have “the model” when it’s done.

You don’t teach an AI model dentistry in one shot. Just as you don’t train a dentist by cramming an entire textbook into their head on day one, specializing a general vision model for the dental clinic is a staged process: first the vocabulary, then the eye, then the real tasks. In this chapter we open the hood on training. Four sequential stages on top of Gemma 4 31B-IT, run on a node of eight H100s in a little over half a day of compute, taking the model’s accuracy on the public MMOral-Bench exam from 43.58% to 54.18%: +10.6 points. And why, even so, what comes out of the third stage still isn’t “the sprint model” but the foundation on which everything else is designed.

The four stages of dental VLM training: DKI (vocabulary), DCA (concept-image), SFT (clinical tasks) and RLT (reinforcement), raising accuracy +10.6 points on MMOral-Bench.

Why four stages and not a single fine-tuning pass

Running direct SFT on a generalist VLM with a dental corpus works, but it leaves capability on the table. The staged architecture injects the domain progressively, each stage building on the previous one:

Stage	What it learns	Modality	Vision tower
DKI, knowledge injection	Clinical vocabulary, dental terminology, SNOMED/ICD concepts	Text only	Frozen
DCA, concept-image alignment	Associate concept and region (caries in this patch, restoration in that one)	Multimodal	Frozen (aligner active)
SFT, supervised fine-tuning	End-to-end clinical tasks (Q&A, diagnosis, plan)	Multimodal	Frozen
RLT, reinforcement (GRPO)	Rule-based reward against the knowledge graph	Multimodal	Frozen

The decision to never unfreeze the vision tower in any stage is deliberate: dental panoramics and periapicals aren’t distributionally distinct enough from natural images to justify retraining the encoder, and a poorly tuned encoder destroys more capability than it adds. Domain adaptation flows through the aligner and the language backbone, not through vision.

Stage 1: DKI

Goal: inject dental clinical vocabulary on top of Gemma 4 31B-IT. Text only; vision and the aligner are left untouched.

The corpus for this stage is a clean clinical text dataset, the same curated material that had already trained our previous production model. We deliberately discarded a substantially larger corpus that included raw clinical histories: it introduced variability that degraded the downstream SFT. More data isn’t better when noise dominates: the call on what goes into the corpus matters as much as the volume.

Training converges cleanly in under six hours; the resulting adapter is merged to full weights to serve as the base for the next stage.

Stage 2: DCA

Goal: align visual concepts (lesion, restoration, prosthesis, anatomy) with their referents in text. Multimodal, but vision stays frozen: what gets updated via LoRA is the aligner, the layer that projects visual features into the language space.

The data for this stage is around two thousand hand-verified image-caption pairs, filtered from a substantially larger initial set of automatically generated pairs over thousands of images from our own corpus. The filtering discarded captions with low correlation to the visual content: to align concept and image, the quality of the pair matters far more than the quantity.

It’s a deliberately short stage, a matter of minutes, given how small the data is. Its job isn’t to teach the model to talk about images, but to anchor Stage 1’s clinical vocabulary to visual representations, so the following SFT starts from a coherent multimodal base.

Stage 3: SFT

Goal: fine-tuning on end-to-end clinical tasks. Fully multimodal: the model receives an image + clinical prompt and returns an answer.

The configuration is structurally identical to the previous two, changing only the data: hundreds of thousands of examples mixing text and image, all clinical.

The result, after nearly eight hours of training:

MMOral-Bench (closed, n=491): 54.18%.
Delta over Gemma 4 base: +10.6 points (43.58% → 54.18%).

The +10.6 points are measured on the same 491 closed questions from the benchmark, with the same answer extractor. It’s a genuinely comparable delta, not a moved metric.

What’s still unresolved

Stage 3 produces a functional checkpoint that’s better than the base, but it isn’t “the sprint model”. Three things remain, which the following chapters address:

MMOral-Bench is dominated by panoramics. Periapicals and intraorals also show up in the clinic, in a different proportion. We need to evaluate with modality-specific probes (Chapter 5).
This stage’s corpus doesn’t include specialized external datasets (an external panoramic dataset and an external intraoral dataset, both non-commercially licensed). Bringing in external data isn’t automatic: it requires checking case by case that they add rather than subtract.
Open-ended answer evaluation hasn’t been run yet. Before claiming that +10.6 points on closed questions translates into real clinical improvement, you have to measure it.

Status at the close of this phase

Four-stage pipeline operational; the first three completed.
Total compute for the three stages: a little over half a day on 8×H100 SXM.
MMOral-Bench (closed): 54.18% (+10.6 points over the base).
Modality-specific probes and ablation, queued for the following days.

Next chapter: our own panoramic (50 questions) and intraoral (75 questions) probes, bilingual Spanish/English, to measure whether those +10.6 points hold up when you change the image type and the language of the question.

Tags:

#dental-brain#dgx-h100#fine-tuning#innovation-lab#nvidia#nvidia inception#on-premise#sprint-dgx

Back to Blog