Chapter 4 of the Sprint DGX series (from Chapter 1). How dental knowledge gets injected into a general VLM in stages, and why we still don’t have “the model” when it’s done.
You don’t teach an AI model dentistry in one shot. Just as you don’t train a dentist by cramming an entire textbook into their head on day one, specializing a general vision model for the dental clinic is a staged process: first the vocabulary, then the eye, then the real tasks. In this chapter we open the hood on training. Four sequential stages on top of Gemma 4 31B-IT, run on a node of eight H100s in a little over half a day of compute, taking the model’s accuracy on the public MMOral-Bench exam from 43.58% to 54.18%: +10.6 points. And why, even so, what comes out of the third stage still isn’t “the sprint model” but the foundation on which everything else is designed.

Why four stages and not a single fine-tuning pass
Running direct SFT on a generalist VLM with a dental corpus works, but it leaves capability on the table. The staged architecture injects the domain progressively, each stage building on the previous one:
| Stage | What it learns | Modality | Vision tower |
|---|---|---|---|
| DKI, knowledge injection | Clinical vocabulary, dental terminology, SNOMED/ICD concepts | Text only | Frozen |
| DCA, concept-image alignment | Associate concept and region (caries in this patch, restoration in that one) | Multimodal | Frozen (aligner active) |
| SFT, supervised fine-tuning | End-to-end clinical tasks (Q&A, diagnosis, plan) | Multimodal | Frozen |
| RLT, reinforcement (GRPO) | Rule-based reward against the knowledge graph | Multimodal | Frozen |
The decision to never unfreeze the vision tower in any stage is deliberate: dental panoramics and periapicals aren’t distributionally distinct enough from natural images to justify retraining the encoder, and a poorly tuned encoder destroys more capability than it adds. Domain adaptation flows through the aligner and the language backbone (via LoRA), not through vision.
The stack is shared across all three stages: ms-swift 4.1.0, transformers 5.5.3, peft 0.18.1, torch 2.11.0+cu130, on the nvcr.io/nvidia/pytorch:24.09-py3 container (bumped up from the initial 24.01 for H100 compatibility).
Stage 1: DKI
Goal: inject dental clinical vocabulary on top of Gemma 4 31B-IT. Text only; vision and the aligner are left untouched.
The hyperparameters that matter:
| Parameter | Value |
|---|---|
| Precision | bfloat16, SDPA attention |
| Adaptation | LoRA rank 8, alpha 16, dropout 0.05, over all linear layers |
| Frozen | vision tower + aligner |
| Sequence length | 4,096 tokens |
| Epochs | 1 |
| Batch | 1 per GPU × 4 gradient accumulation → effective 32 (8 GPUs) |
| Learning rate | 1·10⁻⁵, cosine scheduler, warmup 0.1 |
| Parallelism | DeepSpeed ZeRO-2 + gradient checkpointing |
We don’t use packing because it requires flash_attn, which wasn’t available in this container; the padding overhead is acceptable at this batch size.
The corpus for this stage is ~239,000 clean clinical text records, the same curated dataset that had already trained our previous production model (a LLaMA 3.1 8B with a dental LoRA). We deliberately discarded a larger corpus (~588,000 records) that included clinical histories: it introduced variability that degraded the downstream SFT. More data isn’t better when noise dominates: the call on what goes into the corpus matters as much as the volume.
Training converges cleanly in under six hours; the resulting adapter is merged to full weights to serve as the base for the next stage.
Stage 2: DCA
Goal: align visual concepts (lesion, restoration, prosthesis, anatomy) with their referents in text. Multimodal, but vision stays frozen: what gets updated via LoRA is the aligner, the layer that projects visual features into the language space.
The data for this stage is 1,908 hand-verified image-caption pairs, filtered from an initial set of ~11,000 pairs generated with Claude Sonnet over ~5,700 images from our own corpus. The filtering discarded captions with low correlation to the visual content: to align concept and image, the quality of the pair matters far more than the quantity.
It’s a deliberately short stage, a matter of minutes, given how small the data is. Its job isn’t to teach the model to talk about images, but to anchor Stage 1’s clinical vocabulary to visual representations, so the following SFT starts from a coherent multimodal base.
Stage 3: SFT
Goal: fine-tuning on end-to-end clinical tasks. Fully multimodal: the model receives an image + clinical prompt and returns an answer.
The configuration is structurally identical to the previous two (same LoRA rank 8 / alpha 16, same ZeRO-2, same cosine learning rate, same 4,096 length), changing only the data: ~265,000 examples mixing text and image, all clinical.
The result, after nearly eight hours of training:
- MMOral-Bench (closed, n=491): 54.18%.
- Delta over Gemma 4 base: +10.6 points (43.58% → 54.18%).
The +10.6 points are measured on the same 491 closed questions from the benchmark, with the same answer extractor. It’s a genuinely comparable delta, not a moved metric.
What’s still unresolved
Stage 3 produces a functional checkpoint that’s better than the base, but it isn’t “the sprint model”. Three things remain, which the following chapters address:
- MMOral-Bench is dominated by panoramics. Periapicals and intraorals also show up in the clinic, in a different proportion. We need to evaluate with modality-specific probes (Chapter 5).
- This stage’s corpus doesn’t include specialized external datasets (DENTEX for panoramics, Vident-real for intraorals). Bringing in external data isn’t automatic: it requires checking case by case that they add rather than subtract.
- Open-ended answer evaluation hasn’t been run yet. Before claiming that +10.6 points on closed questions translates into real clinical improvement, you have to measure it.
Status at the close of this phase
- Four-stage pipeline operational; the first three completed.
- Total compute for the three stages: a little over half a day on 8×H100 SXM.
- MMOral-Bench (closed): 54.18% (+10.6 points over the base).
- Modality-specific probes and ablation, queued for the following days.
Next chapter: our own panoramic (DENTEX, 50 questions) and intraoral (Vident-real, 75 questions) probes, bilingual Spanish/English, to measure whether those +10.6 points hold up when you change the image type and the language of the question.





