enInteligencia ArtificialLarge Language ModelsNVIDIAOdontologíaStartups

GDPR Art.9 and 142 GB of clinical images on cloud GPU: how we moved the DICOMs without breaking the law

2 de junio de 202612 min read2,255 words

Chapter 3 of the Sprint DGX series (from Chapter 1). The 26-day arc in which legal friction proved harder than the machine learning.

The conversation the technical pitch never includes

When you pitch a clinical AI project to a sponsor the size of NVIDIA, there is one part that never gets signed in the application email: how your health data physically travels from your local disk to a multi-tenant server you will be leaving in 60 days.

GDPR has a great deal to say about that. And it isn’t making suggestions.

This chapter tells how we handled it over 26 days of the sprint, what we ran into along the way, and what the most uncomfortable moment of the whole project was. Spoiler: it wasn’t a vendor reply that came late, nor a wipe that didn’t happen. It was something we discovered ourselves, with no one forcing us to go looking for it.

Anonymizing a DICOM is not the same as “now it’s public”

The most common misconception in clinical AI projects is to assume that emptying the patient fields in a DICOM image turns it into unrestricted data. It does not.

Under Article 9 of GDPR —the special category of health data—, what many people call “anonymization” is actually pseudonymization (Art.4.5). The difference is legally enormous: pseudonymization is reversible if someone regains access to the mapping table that links an opaque identifier back to the real patient. And as long as the data is reversible, it remains health data.

That means that even if you have emptied every PII field from the DICOM headers and replaced the identifiers with hashes, those GB you want to upload to the cloud are still within the reinforced regime of Art.9. GDPR then asks you to document four specific things:

Who is who: controller, processor, sub-processor. The roles have to be clear.
What happens to the physical disk blocks once processing ends.
Whether there is a Data Processing Agreement (DPA) that explicitly covers Art.9.
Whether you can obtain a Certificate of Destruction documenting the deletion method and the date, ideally under a recognized standard such as NIST SP 800-88 Rev.1.

Without those four answers in writing, uploading the data to the instance means taking on undocumented legal risk. And the sprint could not start the upload without this.

Day 10: two emails at the same time, not one

We sent two simultaneous emails. Not one. The difference matters.

One went to the grant orchestrator, who is the party that manages provisioning and the teardown of the node. The other went to the underlying physical-infrastructure provider, who is the party with actual custody of the disks where the images would land.

The reason for sending both in parallel is operational. If you only write to the orchestrator, the most likely outcome is that after a few days they reply “that’s handled by the hardware provider, contact them”. Two emails in parallel cover the whole chain from the very first moment, and each party answers for what it actually manages.

The four questions went roughly like this:

When an instance is deleted, what happens to the physical blocks of the underlying disk? Are they overwritten, zero-filled, crypto-erased before being returned to the pool, or is the deletion only logical?
Is there any documented mechanism to trigger a secure wipe of the working volume from inside the instance, before deleting it?
Which entity is the formal contact for obtaining a Certificate of Destruction once the deletion is complete?
Is there a Data Processing Agreement that explicitly covers the processing of health data under Art.9 of GDPR?

The closing of both emails was almost more important than the questions: “This does not block provisioning of the instance. We are going to start with benchmarks on public datasets. The upload of clinical data is conditional on receiving these answers in writing.”

That last sentence is what keeps the sprint moving while the legal conversation advances. We did not halt everything: we halted only the upload of our own data. The node gets provisioned, the benchmarks on public datasets start, the first methodological phases run.

The plan B we documented the same day

Sending the emails without a plan B would mean depending blindly on the vendor. If the answers are slow or come back negative, what do we do?

Three options, filed the same day:

AES encryption at rest of the clinical-data directory inside the instance, with a key that never touches the server’s own persistent disk. The compute cost is trivial on 8 H100.
Manual secure wipe before delete: run a secure multi-pass deletion over the data path before deleting the instance. It isn’t perfect —block-level erasure on the host is not guaranteed this way— but it is real defense in depth.
Upload only already-encrypted blobs: the key lives on a local machine, it is decrypted in memory when loading into the model, and the cleartext data never touches the instance’s persistent disk.

With plan B on the table we had room to maneuver. If everything went wrong, the upload would be done encrypted and the sprint would go on. The legal conversation could take as long as it needed.

Plot twist number one: the provider chain wasn’t exactly what we assumed

When the instance was provisioned we discovered that the real backend architecture had one more layer than we assumed when drafting the emails. Without naming commercial parties —because the exact sub-processor chain is the partner’s contractual information, not ours to make public—, here is what matters:

The applicable GDPR regime turned out to be more favorable than expected, not less. The actual sub-processor publishes a standard DPA, holds recognized international certifications (including HIPAA equivalences in specific regions), and the terms are auditable.
The “technically correct” recipient of the second email was slightly different from the one we had written to. Zero harm: the questions that mattered had also gone to the orchestrator, which is indeed the project’s correct counterpart.

There was also an operational detail that simplified the chain of custody more than we expected: the instance type was non-stoppable by design. It cannot be paused. The node starts on Day 1 and only shuts down at the end of the sprint, with a single deletion event. That sounds like a limitation, but from a data-protection standpoint it is an advantage: there is a single point of extinction, instead of multiple stop/start cycles with persistence between sessions.

Plot twist number two: the serious problem was something else

Here comes the uncomfortable moment.

Well into the sprint, we ran an internal forensic audit over the instance’s filesystem. We didn’t run it because something felt off: we ran it because our internal workflows include a periodic PII review of the file tree as a preventive practice. No one forced us to do it. Had it not been on the calendar, we would not have found what we found.

Inside the DICOM export directory there was a subdirectory left over from the first pass of the pipeline, before the header strip. A sizeable volume of intermediate files that should never have persisted.

The final DICOM, the one in production during the sprint, was correctly anonymized: empty headers, opaque identifiers. But that intermediate directory, forgotten from the first pass, had patient names in the filesystem paths themselves. Pseudonymization broken at the file-tree level.

The operational lesson is brutally simple: the DICOM header strip covers the content of the file, not the path where the file lives. If your pipeline generates folder structures with real identifiers before a later rename, those paths are PII at filesystem scope, exactly as regulated as PII in headers.

The underlying fix was made in a new version of the anonymization pipeline, which generates opaque identifiers from the first pass and does no renames afterward. But before deleting anything, it was time to build a chain of custody.

Chain of custody before haste

The temptation to immediately delete what shouldn’t be there is enormous. It is precisely the temptation you have to resist.

For a week we built the complete documentary trail: SHA256 hashes over the entire content, a snapshot of the filesystem state, a temporal record of each step, a verified full copy of the already-anonymized DICOMs onto our local NAS.

The chain of custody is what makes it possible that, if someone asks three years from now what happened to that data, there is a documented answer with timestamps and hashes that add up. It isn’t ceremony: it’s what lets you answer “yes, we handled it, here’s the proof” instead of “we think we did…”.

The day of the deletion: seven seconds

Once the chain of custody was complete and verified, we executed the deletion. On NVMe, the operating system took a little over seven seconds to remove the entire directory.

Subsequent check:

The correctly anonymized DICOM was not affected.
Random sample of files from the local NAS: all hashes verified against the pre-computed ones.
Full canonical final verification.

Seven seconds to delete the problem. And several days of prior work to make sure those seven seconds could be defended before any auditor who came knocking.

The contractual data-destruction event —the deletion of the full instance at the end of the sprint— is still scheduled for its day. But the specific risk from the forensic audit was already closed.

State at the close of the GDPR arc

DICOM verified clean: zero residual PII, neither in content nor in paths.
Full documentary chain of custody: initial emails, forensic audit, destructive cleanup, and a full backup on the local NAS.
Legal regime covered: the sub-processor’s DPA, the orchestrator as formal processor, the grant sponsor with no involvement in the data path.
Plan B documented and available for future scenarios, even though it didn’t need to be executed.

The lesson, no varnish

There are two ways to handle Art.9 compliance in a clinical AI project on cloud GPU. The first is to do the bare minimum, hope no one asks, and pray that no patient decides to exercise their rights. The second is to assume from Day 1 that legal friction is part of the project, plan it in parallel with the ML, and give it the time it demands.

We go with the second, and we recommend it. Not because the first one doesn’t work much of the time —it does—, but because when it stops working, it’s already too late to improvise.

And about the phantom directory: we found it ourselves, with no one watching us. That’s the difference between complying and doing things right.
Next chapter: with the data clean and the legal side covered, we began to train for real. Three consecutive phases over the base model, a first significant jump on the dental benchmark, and the feeling, for the first time in the sprint, that the model was starting to “speak dental”. The day it stopped being a generic VLM.

Frequently asked questions

Does anonymizing a DICOM take it out of the GDPR Art.9 regime?

No, unless the anonymization is irreversible in the strict sense. What clinical practice calls “anonymization” is usually pseudonymization: a mapping table exists that allows the patient to be re-identified. As long as that reversibility is possible, the data remains health data and the reinforced Art.9 regime applies.

What is a Certificate of Destruction and why does it matter?

It is a formal document attesting that a dataset has been destroyed following a recognized standard (typically NIST SP 800-88 Rev.1 for digital destruction), with method, date, and responsible party. It matters because it lets you answer with evidence before an audit, a patient rights request, or a regulatory inspection. Without a certificate, your word is the only proof.

Why send two simultaneous emails instead of one and wait?

Because when a question crosses two vendors, the risk isn’t that they tell you no: it’s that each one tells you “the other one handles that” and you lose weeks. Sending to both at once forces each one to answer for what it does manage, and makes it clear from the first email that there is no ambiguity about who the right contact is for each part of the chain.

Why was the “phantom directory” a problem if the content was already anonymized?

Because the DICOM header strip removes PII from the content of the file, but does not touch the path where the file is stored. If your pipeline generates folders with real surnames or identifiers and then renames only the content, the folder names are still personal data. At filesystem scope, paths are regulated PII exactly like the headers.

Why wasn’t the directory deleted as soon as it was discovered?

Because deleting it without first building the chain of custody is equivalent to destroying evidence. If someone asks two years from now what exactly was there, what steps were taken and when, the only defensible answer is one documented with hashes and timestamps. The week of work before the deletion is what separates “we handled it” from “we think we handled it”.

Is it viable to process clinical data under GDPR Art.9 on cloud GPU?

Yes, as long as the provider publishes a DPA with standard contractual clauses for health data, operates in European regions, and a formal documented deletion mechanism exists at the end of processing. The choice of sub-processor matters: there are clouds where this conversation is trivial and others where it’s impossible. And for continuous use, not a one-off sprint, our opinion is clear: on-premise remains the cleanest path.

Tags:

#dental-brain#dgx-h100#fine-tuning#innovation-lab#nvidia#nvidia inception#on-premise#sprint-dgx

Back to Blog