Chapter 12 of the Sprint DGX series (from Chapter 1). The bench that validates the trained VLM as a service that’s actually deployable.
Training a model well and being able to serve it in the clinic are two different problems. A checkpoint that learns is just that: weights on disk. For a dentist to use it during a visit, it has to respond instantly, keep latency stable under load, and serve several clinics at once without stumbling. This chapter is about the leap from the lab to the service: a sweep of fifteen vLLM 0.21.0 configurations that brings time-to-first-token from the initial ~140 ms down to 24.72 ms, between 5.5 and 6.2 times faster, with a latency tail so narrow (under 1 ms between the median and the 99th percentile) that the wait is practically identical on every call. And, as an acid test, the model’s first look at a real radiograph already served in production.

The gap between training and serving
The inference utility that ships with the training framework is optimized for evaluating checkpoints during training. It is not a service. It has three operational limitations for production:
- No KV cache shared across requests, every call starts from scratch.
- No dynamic batching, concurrent requests are serialized.
- No OpenAI-compatible API, it requires a custom Python client and doesn’t integrate with the SDK the rest of the product uses.
The initial numbers, measured with that utility, are honest but not useful for deploy:
| Baseline metric (training utility, H100 TP=2 BF16) | Value |
|---|---|
| TTFT (time-to-first-token) | 136-153 ms |
| Throughput batch 1 | 12-14 tok/s |
| Throughput batch 4 | 38-40 tok/s |
| VRAM | 29 GiB/GPU TP=2 (58 GiB total) |
To serve the VLM to a clinic you need a real serving runtime: dynamic batching, a reusable KV cache, a concurrent-request scheduler, and a stable API. We evaluated several options, and the one that fit Gemma 4 best, without sacrificing accuracy or adding complexity, was vLLM, which had just shipped support for the model.
vLLM 0.21.0 with Day-0 support for Gemma 4
vLLM released version 0.21.0 with official Day-0 support for the gemma4 model_type (PR #38826). That gave us exactly what we needed: a serving runtime that understood Gemma 4 31B from day one, without waiting for the rest of the ecosystem to catch up.
Installing vLLM 0.21.0 in the sprint environment was clean: out-of-the-box, no build-from-source or downgrades, on the same torch and transformers we already had. Verified with a dry-run before running anything.
The serving architecture is a one-liner:
vllm serve --model <model> --dtype bfloat16 --tensor-parallel-size 2 --port 8000
OpenAI-compatible endpoint at POST /v1/chat/completions. Any client that speaks the OpenAI API connects directly. The QuantumHowl product’s client already speaks it by default, zero adaptation.
The bench sweep
The sweep covers four dimensions of the configuration space for production-representative load testing:
| Config type | Variants | n |
|---|---|---|
| Text-only | 3 prompt lengths {128, 512, 2048} × 4 concurrencies {1, 4, 16, 64} | 12 |
| Real conversation (ShareGPT) | batch 1 (100 conversations) + batch 4 (100 conversations) | 2 |
| Multimodal | vision input + text generation | 1 |
The full sweep ran in just over half an hour, with 14 of 15 configurations succeeding. The only one that failed was the multimodal one, due to an engine initialization conflict when running in parallel with the vllm serve already up, not a blocker: the multimodal smoke test was validated separately (next section).
The headline numbers
| Metric | vLLM 0.21.0 BF16 TP=2 H100 | Baseline (training utility) | Improvement |
|---|---|---|---|
| TTFT p50 | 24.72 ms | 136-153 ms | 5.5-6.2× |
| TTFT p99 | 25.4 ms | ||
| TTFT variance p50→p99 | < 1 ms | narrow tail | |
| Throughput batch 1, prompt 128 | 71.12 tok/s | 12-14 tok/s | 5.1-5.9× |
| Throughput batch 4, prompt 128 | 219 tok/s | 38-40 tok/s | 5.5-5.8× |
| Aggregate throughput batch 16, prompt 2048 | 549 tok/s | ||
| ShareGPT batch 1 (real conversation) | 100/100 success | ||
| ShareGPT batch 4 (real conversation) | 100/100 success | ||
| VRAM | 29 GiB/GPU TP=2 | 29 GiB/GPU TP=2 |
The TTFT variance between p50 and p99 under 1 ms is the operational signature that really matters. A 5.5× improvement in median TTFT is lost if the latency tail is high, a clinical user who sometimes waits 25 ms and other times 250 ms has an unstable experience. A narrow tail indicates that vLLM’s scheduler keeps latency consistent under load, not just in the average case.
The 549 aggregate tok/s in batch 16 with long prompts scale throughput about 7.7× relative to batch 1 when dynamic batching is well tuned, and that’s what makes it viable to serve several clinics concurrently without needing a dedicated GPU per customer.
Multimodal smoke test: the model’s first look at a real radiograph
Qualitative validation on a real panoramic with annotated restorations:
| Metric | Value |
|---|---|
| Throughput | 42.4 tok/s batch 1 |
| Prompt tokens | 301 (256 vision + 45 text) |
| Generated tokens | 256 |
| End-to-end latency | 6.04 s |
| Response language | Spanish |
| Clinical coherence | Verified by hand |
The response correctly describes the restorations visible in the panoramic, identifies the relevant FDI regions, and maintains an appropriate clinical tone in Spanish. It’s not a deploy-grade evaluation (n=1, qualitative), but it confirms that the end-to-end pipeline, training (Chapter 4) → serving → multimodal inference, works on real data, not just on the benchmark.
The production path is chosen
vLLM 0.21.0 with its OpenAI-compatible API is the deliverable’s serving runtime: fast, stable under load, and validated end-to-end on the trained model. The key piece is that the QuantumHowl product’s client already speaks that API natively, so serving the new VLM requires no architectural change, just pointing at a new endpoint.
Status at the close of this phase
- vLLM 0.21.0 BF16 TP=2 validated as primary serving.
- Bench sweep documented: 14 of 15 configurations succeeding.
- Headline numbers locked: TTFT p50 24.72 ms (5.5-6.2× better), 71.12 tok/s at batch 1, 549 tok/s aggregate, p50-p99 variance < 1 ms.
- Multimodal smoke test validated on a real panoramic.
- Production path defined: OpenAI-compatible REST, no architectural changes on the client.
Next chapter: why the DGX Spark matters as an on-prem, per-clinic deploy target, and how the trained VLM fits in its 128 GB of unified memory to run at full precision, inside the building, without any patient image leaving for the cloud.





