From the trained model to one that answers in milliseconds

From the trained model to one that answers in milliseconds

Chapter 12 of the Sprint DGX series (from Chapter 1). The bench that validates the trained VLM as a service that’s actually deployable.

Training a model well and being able to serve it in the clinic are two different problems. A checkpoint that learns is just that: weights on disk. For a dentist to use it during a visit, it has to respond instantly, keep latency stable under load, and serve several clinics at once without stumbling. This chapter is about the leap from the lab to the service: a sweep of fifteen vLLM 0.21.0 configurations that brings time-to-first-token from the initial ~140 ms down to 24.72 ms, between 5.5 and 6.2 times faster, with a latency tail so narrow (under 1 ms between the median and the 99th percentile) that the wait is practically identical on every call. And, as an acid test, the model’s first look at a real radiograph already served in production.

Time to first word: from about 145 ms to 24.72 ms with the production serving engine, between 5.5 and 6.2 times faster.

The gap between training and serving

The inference utility that ships with the training framework is optimized for evaluating checkpoints during training. It is not a service. It has three operational limitations for production:

  1. No KV cache shared across requests, every call starts from scratch.
  2. No dynamic batching, concurrent requests are serialized.
  3. No OpenAI-compatible API, it requires a custom Python client and doesn’t integrate with the SDK the rest of the product uses.

The initial numbers, measured with that utility, are honest but not useful for deploy:

Baseline metric (training utility, H100 TP=2 BF16) Value
TTFT (time-to-first-token) 136-153 ms
Throughput batch 1 12-14 tok/s
Throughput batch 4 38-40 tok/s
VRAM 29 GiB/GPU TP=2 (58 GiB total)

To serve the VLM to a clinic you need a real serving runtime: dynamic batching, a reusable KV cache, a concurrent-request scheduler, and a stable API. We evaluated several options, and the one that fit Gemma 4 best, without sacrificing accuracy or adding complexity, was vLLM, which had just shipped support for the model.

vLLM 0.21.0 with Day-0 support for Gemma 4

vLLM released version 0.21.0 with official Day-0 support for the gemma4 model_type (PR #38826). That gave us exactly what we needed: a serving runtime that understood Gemma 4 31B from day one, without waiting for the rest of the ecosystem to catch up.

Installing vLLM 0.21.0 in the sprint environment was clean: out-of-the-box, no build-from-source or downgrades, on the same torch and transformers we already had. Verified with a dry-run before running anything.

The serving architecture is a one-liner:

vllm serve --model <model> --dtype bfloat16 --tensor-parallel-size 2 --port 8000

OpenAI-compatible endpoint at POST /v1/chat/completions. Any client that speaks the OpenAI API connects directly. The QuantumHowl product’s client already speaks it by default, zero adaptation.

The bench sweep

The sweep covers four dimensions of the configuration space for production-representative load testing:

Config type Variants n
Text-only 3 prompt lengths {128, 512, 2048} × 4 concurrencies {1, 4, 16, 64} 12
Real conversation (ShareGPT) batch 1 (100 conversations) + batch 4 (100 conversations) 2
Multimodal vision input + text generation 1

The full sweep ran in just over half an hour, with 14 of 15 configurations succeeding. The only one that failed was the multimodal one, due to an engine initialization conflict when running in parallel with the vllm serve already up, not a blocker: the multimodal smoke test was validated separately (next section).

The headline numbers

Metric vLLM 0.21.0 BF16 TP=2 H100 Baseline (training utility) Improvement
TTFT p50 24.72 ms 136-153 ms 5.5-6.2×
TTFT p99 25.4 ms
TTFT variance p50→p99 < 1 ms narrow tail
Throughput batch 1, prompt 128 71.12 tok/s 12-14 tok/s 5.1-5.9×
Throughput batch 4, prompt 128 219 tok/s 38-40 tok/s 5.5-5.8×
Aggregate throughput batch 16, prompt 2048 549 tok/s
ShareGPT batch 1 (real conversation) 100/100 success
ShareGPT batch 4 (real conversation) 100/100 success
VRAM 29 GiB/GPU TP=2 29 GiB/GPU TP=2

The TTFT variance between p50 and p99 under 1 ms is the operational signature that really matters. A 5.5× improvement in median TTFT is lost if the latency tail is high, a clinical user who sometimes waits 25 ms and other times 250 ms has an unstable experience. A narrow tail indicates that vLLM’s scheduler keeps latency consistent under load, not just in the average case.

The 549 aggregate tok/s in batch 16 with long prompts scale throughput about 7.7× relative to batch 1 when dynamic batching is well tuned, and that’s what makes it viable to serve several clinics concurrently without needing a dedicated GPU per customer.

Multimodal smoke test: the model’s first look at a real radiograph

Qualitative validation on a real panoramic with annotated restorations:

Metric Value
Throughput 42.4 tok/s batch 1
Prompt tokens 301 (256 vision + 45 text)
Generated tokens 256
End-to-end latency 6.04 s
Response language Spanish
Clinical coherence Verified by hand

The response correctly describes the restorations visible in the panoramic, identifies the relevant FDI regions, and maintains an appropriate clinical tone in Spanish. It’s not a deploy-grade evaluation (n=1, qualitative), but it confirms that the end-to-end pipeline, training (Chapter 4) → serving → multimodal inference, works on real data, not just on the benchmark.

The production path is chosen

vLLM 0.21.0 with its OpenAI-compatible API is the deliverable’s serving runtime: fast, stable under load, and validated end-to-end on the trained model. The key piece is that the QuantumHowl product’s client already speaks that API natively, so serving the new VLM requires no architectural change, just pointing at a new endpoint.

Status at the close of this phase

  • vLLM 0.21.0 BF16 TP=2 validated as primary serving.
  • Bench sweep documented: 14 of 15 configurations succeeding.
  • Headline numbers locked: TTFT p50 24.72 ms (5.5-6.2× better), 71.12 tok/s at batch 1, 549 tok/s aggregate, p50-p99 variance < 1 ms.
  • Multimodal smoke test validated on a real panoramic.
  • Production path defined: OpenAI-compatible REST, no architectural changes on the client.

Next chapter: why the DGX Spark matters as an on-prem, per-clinic deploy target, and how the trained VLM fits in its 128 GB of unified memory to run at full precision, inside the building, without any patient image leaving for the cloud.

Share:
AI applied to real problemsExplore our solutions
From the trained model to one that answers in milliseconds | Blog | Quantum Howl