Format and compliance

Deletion vectors and GDPR: Iceberg's right-to-erasure story, honestly.

Until Iceberg V3, "Iceberg has a right-to-erasure problem" was a defensible architectural objection from compliance teams. V3 introduces Puffin-based deletion vectors (the same mechanism Delta Lake has used for years), which materially changes the technical conversation, but it does not on its own satisfy GDPR Article 17, because the technical mechanism is necessary without being sufficient.

Reading time: about 16 minutes. Evidence tier: B for the V3 deletion-vector spec mechanics (Apache Iceberg V3 specification, Puffin file format documentation). The regulatory claims below are my reading of GDPR, PCI-DSS, and HIPAA text and publicly available regulator guidance, and they are not legal advice. For a binding interpretation in your jurisdiction, consult named counsel and your data protection officer.

The honest framing

"Supports deletion vectors" is not "satisfies Article 17."

I want to set the frame for this post before walking through the mechanics, because the marketing shape of this story is wrong in a way that matters. Iceberg V3 shipped the deletion-vector specification in 2025. Engine support is rolling out across Spark, Trino, DuckDB, and Snowflake through 2026, which is genuinely new and genuinely useful, but "Iceberg V3 supports Puffin deletion vectors" and "Iceberg V3 satisfies the GDPR right-to-erasure" remain two different sentences that vendor marketing tends to collapse into one.

GDPR Article 17 is a legal obligation that runs across every accessible copy of the personal data: primary table, snapshots, time-travel windows, cold-tier archives, disaster-recovery replicas, third-party backups, downstream warehouses, machine-learning training sets, and any other place the row has been written. A deletion vector affects the primary table, while the rest of that list does not care about Iceberg metadata, so satisfying the regulation is an orchestration problem, a retention-policy problem, and a backup-architecture problem before it's a table-format problem.

What V3 may change is the part of the problem that lives inside the lakehouse. For that piece, the V2-era objection "Iceberg can't honor erasure requests in a reasonable timeframe" stops being a defensible architectural argument. The remaining objections (the ones about replicas, archives, and time-travel windows) apply equally to Iceberg, Delta, and every other versioned table format.

V2 mechanics

What a V2 delete actually did.

It's worth being precise about the V2 mechanics, because the compliance objection is about the specific gap between V2 and V3 rather than about Iceberg in general.

In Iceberg V2, a query like DELETE FROM users WHERE user_id = 'alice@example.com' did the following:

Identified affected files. The engine scanned the table metadata to find data files that might contain rows matching the predicate.
Wrote one or more delete files. Either a position-delete file (storing the row positions to delete within each affected data file) or an equality-delete file (storing the equality predicate for matching at read time).
Recorded the delete files in a new snapshot. Subsequent reads of the table would consult the delete files and skip the matched rows.

After the operation, three things were true at once: queries against the table no longer returned the deleted rows, the original data files in object storage still contained those rows, and a future compaction operation would eventually rewrite the data files without the deleted rows, at which point the bytes were physically gone, but only when compaction ran.

The compliance objection lived in the gap between the first and third items. GDPR Article 17 requires that personal data be erased "without undue delay." Different EU data protection authorities have interpreted "without undue delay" in different ways. Some guidance points toward 30 days, some toward 72 hours, and the European Data Protection Board has not pinned this down with a single binding number in primary guidance I can cite. What is uncontroversial: a weekly compaction cadence on a high-volume table is almost certainly outside any reasonable reading of "without undue delay," and weekly was the typical V2 cadence for big tables because of the operational cost.

The V2-era workarounds were three:

Run compaction on demand after every DELETE. Operationally expensive, especially for tables with frequent erasure requests, and the per-erasure cost was hard to amortize.
Use crypto-shredding. Encrypt personal data with a per-subject key, then destroy the key when the subject requests erasure. The bytes remain but are computationally unrecoverable. Some regulator guidance accepts this as erasure; others have been more skeptical, particularly in EU contexts where "anonymization" and "pseudonymization" are sharply distinguished. Treat this as a legal-counsel question in your jurisdiction, not a software guarantee.
Choose Delta Lake instead. Delta has had Puffin-based deletion vectors longer, with a faster compaction story. This was the V2-era recommendation for compliance-first deployments and the primary reason in my earlier Iceberg-vs-Delta framework to prefer Delta when erasure obligations were a primary requirement.

V3 makes the third workaround unnecessary by closing the capability gap inside the primary table. The first two workarounds (on-demand compaction, crypto-shredding) remain available and may still make sense for specific patterns, but they're no longer architectural escape hatches for an Iceberg-specific weakness.

V3 mechanics

What V3 deletion vectors actually change.

A V3 deletion vector is a Puffin-stored bitmap that lives alongside the data file, marking which rows in that data file are logically deleted. It replaces both V2 position-delete and equality-delete files for most use cases.

The architectural shift is that a deletion vector is attached to its data file rather than living as a separate object the engine has to merge at read time, so from the query engine's perspective, reading a V3 data file with a deletion vector is just "skip the rows marked in the bitmap." There's no second file to load, parse, and join against the data file, and the read-path overhead drops accordingly.

The operational shift is that deletion vectors are small (typically under one percent of the data file size) and cheap to write, so a delete against a single row writes one Puffin file per affected data file, and the cost is dominated by predicate evaluation rather than I/O. This is the same shape Delta Lake has used for several years, and the mechanism is convergent because the underlying problem is the same.

The compliance-relevant shift is that deletion-vector-aware compaction is cheap enough to run aggressively, because V2 compaction was a "rewrite the entire data file to remove the deleted rows" operation, whereas V3 compaction can selectively rewrite only data files whose deletion-vector ratio crosses a threshold (for example, "if more than five percent of rows in this file are marked deleted, rewrite it without those rows"). Selective compaction is fast enough to run hourly or even continuously on a maintenance schedule. The lag between "user requests deletion" and "bytes are physically gone in the primary table" may drop from days-to-weeks (V2 cadence) to minutes-to-hours (V3 cadence).

A second compliance-relevant property is auditability. A V3 deletion vector explicitly records, in metadata, the snapshot at which a row was marked deleted. An auditor's question ("show me when this row was marked deleted") is answerable from table metadata with a definitive timestamp, without requiring a separate audit log. The bytes-rewritten timestamp is similarly available from the compaction-snapshot metadata.

For most reasonable readings of "without undue delay" under Article 17, hours is comfortable while days is questionable and weeks is not, so V3 puts the primary table in the comfortable bucket without any of the V2-era workarounds, and that's as far as the technical claim goes. The legal claim (that the deployment as a whole satisfies Article 17) depends on a lot more than the primary table, and the rest of this post is about the "lot more."

Caveats that don't go away

What V3 deletion vectors don't fix.

There are three categories of erasure problem that V3 deletion vectors do not solve, and I've ordered them by how often they catch teams off guard.

1. Time travel and snapshot retention

Iceberg's time-travel feature (querying the table as it existed at a previous snapshot) is structurally incompatible with hard erasure. If you can query "the table as it existed yesterday," then yesterday's snapshot still contains the personal data of a subject who requested erasure today, and V3 doesn't change that.

To honor an erasure request fully, you have to also expire snapshots that contain the deleted data, which means your maximum time-travel window is bounded by your erasure-policy SLA. For a 30-day treatment of "without undue delay," your maximum time-travel retention is 30 days. For a 72-hour treatment, three days. There is no architectural escape from this tradeoff; it's a policy choice between operational recovery ("we can roll the table back if an ETL job corrupts it") and erasure fidelity ("we honor deletion requests across all readable state").

Most production deployments split the difference: a bounded time-travel window aligned to the erasure SLA, and a documented guarantee that snapshots older than that window no longer contain rows subject to deletion requests. This is a tradeoff Iceberg shares with Delta Lake and every other versioned table format, so it isn't an Iceberg-specific weakness.

2. Cold-tier archives

If you've tiered old Iceberg data to S3 Glacier Deep Archive, Azure Archive, or physical tape backups (a common pattern for long-retention security logs), deletion vectors don't reach those archives. The cold-tier copy of the data file still contains the deleted rows, and the Puffin metadata that records the deletion lives only in the warm tier.

The V3 fix for this is "rewrite the cold-tier file too," but cold-tier rewrites are expensive (retrieval fees, restore latency, tape rewrites) and operationally awkward. The pattern most teams use in practice is to not tier rows to cold storage until they are past the erasure window, which means cold-tier data is, by policy, no longer subject to right-to-erasure requests because the retention period has expired.

That policy is legally defensible if and only if the cold-tier retention horizon is itself within whatever you've told regulators and data subjects. If you're keeping cold-tier copies indefinitely and treating them as exempt from erasure, that's a conversation to have with counsel rather than a configuration choice you can settle on your own.

3. Replicas, backups, and downstream copies

Disaster-recovery replicas, cross-region copies, third-party backup tools (Veeam, Druva, NetBackup, Cohesity), and downstream warehouses fed by CDC pipelines: none of these read Iceberg metadata or deletion vectors. They see Parquet files as opaque blobs. An erasure request that propagates through V3 deletion vectors to the primary table does not propagate to the replicas, backups, or downstream copies on its own.

This is, in my experience, the most underestimated piece of the right-to-erasure architecture problem. The V2-era teams that chose Delta Lake "for GDPR compliance" had this exact issue and usually solved it the same way: maintain a cross-system erasure orchestrator that propagates deletion requests to every copy of the data, with an auditable log of which systems acknowledged the request. V3 deletion vectors don't change the orchestration problem; they just make the primary table compliant on its own.

The legal posture I'd want before defending an Article 17 response is roughly: "the row was marked deleted in the primary table at time T1, physically rewritten by T2, propagated to replicas by T3, purged from backup catalogs by T4, and confirmed removed from the downstream warehouse by T5, all within the SLA we committed to." V3 helps with T1 and T2. The rest is orchestration, contract terms with backup vendors, and downstream-pipeline design. None of it is implied by the bullet "Iceberg supports deletion vectors."

PCI-DSS and HIPAA

Two adjacent regimes where the picture is similar.

GDPR is not the only compliance regime that cares about erasure or deletion, and there are two adjacent ones where V3 deletion vectors may change the technical conversation in the same way and where the legal hedge is the same shape.

PCI-DSS

PCI-DSS v4.0 Requirement 3 covers cardholder data retention and disposal. Cardholder data should be retained only as long as required for legal, regulatory, or business reasons, and the standard requires that systems "support secure deletion of cardholder data." The "secure deletion" language has historically been read by QSAs (qualified security assessors) in two ways: cryptographic erasure of data at rest via key destruction, or demonstrable physical removal from accessible storage. The PCI Council's more recent guidance has leaned toward the second reading, but the interpretive flexibility varies by assessor.

V3 deletion vectors plus aggressive compaction may give you a defensible answer for a PCI audit of the primary table: "the row was marked deleted at snapshot N (timestamp T1), the underlying data file was rewritten without the row at snapshot M (timestamp T2), the Puffin metadata records both events, and T2 - T1 is within our committed disposal SLA." Whether your QSA accepts that as "secure deletion" depends on the QSA, the rest of your control environment, and the same archive and replica considerations described above. I would not represent V3 deletion vectors as "PCI-compliant erasure" to an assessor without a named QSA opinion in writing.

HIPAA

HIPAA's right-to-amendment (45 CFR 164.526) is similar in spirit to GDPR's right-to-erasure but with different mechanics. A patient may request that their protected health information be amended, and the covered entity has 60 days to act. The "remove the record entirely" path is rare in practice (most amendments are corrections rather than deletions), but for the cases where deletion applies, V3 deletion vectors give the same audit-trail capability described for PCI: a metadata-recorded marking timestamp and a metadata-recorded rewrite timestamp, both within the 60-day window. Again, this addresses the primary table. The covered entity's broader obligations around backups, business-associate copies, and downstream sharing remain orchestration problems that V3 doesn't touch.

Architecture implications

What this changes for V2 deployments today.

If you're operating a security data lake on Iceberg V2 with an active right-to-erasure obligation under GDPR, CCPA, PCI-DSS, or HIPAA, V3 should be on your near-term roadmap. Five specific moves I'd prioritize, in order:

Audit your current erasure SLA end-to-end. What's the lag between an erasure request and the bytes being physically gone in every copy of the data: primary, replicas, backups, downstream warehouses, archives? For most V2 deployments running weekly compaction on the primary, the primary-table lag is days-to-weeks; the full-stack lag is usually worse. Compare to what you've committed to data subjects and regulators. If there's a gap, that's the size of the fix you need to scope.
Plan the V3 engine upgrade. Verify that your query engines (Spark, Trino, DuckDB, Snowflake, ClickHouse) support V3 deletion vectors and the Puffin file format at the versions you run. As of mid-2026, Spark 4.1 has full support, Trino's V3 support is rolling out, DuckDB has partial support depending on the build, and Snowflake's adoption is catalog-side and ongoing. Pin this to your engine roadmap, not to the abstract Iceberg version.
Update your compaction schedule. Once on V3 with engine support, switch to deletion-vector-aware selective compaction running hourly or continuously. The cost increase over weekly full-table compaction is modest because the metadata overhead is small; the compliance benefit on the primary-table side may be substantial.
Don't extend your time-travel retention. The time-travel-versus-erasure tradeoff is unchanged. Continue to bound time-travel retention by your erasure SLA, and document the tradeoff explicitly so internal stakeholders understand why the operational-recovery window is capped.
Audit the rest of the data estate. Replicas, third-party backups, downstream warehouses, machine-learning training sets, analytics caches, BI extracts. Build or buy a cross-system erasure orchestrator. This is the part most easily neglected when the primary-table story looks clean.

If you're choosing between Iceberg and Delta Lake for a new deployment with compliance as a primary requirement, the V2-era recommendation "choose Delta for GDPR" is no longer the deciding table-format argument. The two formats use convergent mechanisms for the primary table. The remaining decision factors (vendor neutrality, multi-engine support, ecosystem maturity, Databricks commitment, the longer Delta production track record on these specific erasure features) are the ones that should drive the choice. Compliance still matters; it just no longer narrows the choice on its own.

What I won't claim

Three sentences I keep out of customer conversations.

Three claims I see in vendor decks and blog posts around V3 deletion vectors that I will not make to a security architect or compliance team, because they cross a line between technical capability and legal interpretation that the speaker is usually not qualified to cross.

"Iceberg V3 is GDPR-compliant." Table formats are not GDPR-compliant. Deployments are, when reviewed by counsel, against a specific set of facts including retention policy, backup strategy, replica topology, and downstream propagation. A table format is one ingredient.
"V3 deletion vectors satisfy Article 17." Article 17 imposes obligations on the controller across all copies of the data. A mechanism that affects the primary table satisfies part of the implementation, not the obligation in full.
"V3 closes the compliance gap with Delta." The capability gap on the primary table has closed in the spec. The track-record gap (Delta has shipped this mechanism in production for longer) has not, and risk-averse compliance teams may reasonably weight that. I think the gap is small enough that vendor-neutrality and multi-engine arguments outweigh it, but reasonable people disagree.

The shorthand I use when an architect asks the bottom-line question: Iceberg V3 may remove the last technical reason to prefer Delta on right-to-erasure grounds, but it does not on its own make any deployment compliant with any regulation, because the legal posture is downstream of architecture, policy, and counsel review rather than downstream of a table-format upgrade.

Practical guidance

What I'd actually do in 2026.

Five moves I'd recommend to a security architect operating a lakehouse under any active erasure regime, roughly ordered by cost to execute:

Write down the SLA you've already committed to. Pull the privacy notice, the data processing agreement template, the BAA template if you're in healthcare, and whatever the contract says about response time to deletion requests. The committed number is the design constraint. A lot of teams design to "best effort" because they've never written the number down.
Map the actual copies of the data. Primary table, all snapshots, all replicas (DR, cross-region, read replicas), all backups (in-platform and third-party tooling), all downstream consumers (warehouses, BI tools, ML training pipelines, caches), all archives. This is usually the longest list in the room and the one that drives the rest of the architecture.
Decide where deletion vectors actually help. V3 deletion vectors solve the primary table cleanly. They do nothing for the rest of the copy list. Cost the V3 upgrade against the piece of the problem it actually addresses, not against the regulatory obligation as a whole.
Invest in the orchestration layer. Whatever you call it (privacy-rights service, deletion orchestrator, subject-access fulfillment), this is the load-bearing piece. A well-instrumented orchestrator with named retry, named acknowledgment from every system, and an auditable per-request log is more important to your regulatory posture than any specific table format choice.
Get named counsel on the interpretation questions. Cryptographic erasure versus physical erasure. "Without undue delay" in your jurisdictions. Whether retention exceptions apply to cold-tier archives. Whether QSA opinion accepts deletion-vector compaction as "secure deletion." These are not technical questions, and resolving them with a vendor blog post is how teams end up with surprises at audit time.

None of this is meant to dim the V3 story, because Iceberg V3 deletion vectors are a real, useful, and overdue improvement, and the right-to-erasure conversation around Iceberg-based security lakehouses is materially better now than it was in 2024 now that the architectural objection that drove the choice for the last decade no longer holds. That's a shift worth marking, but it's not the same shift as "Iceberg is now GDPR-compliant," and I'd rather draw the line clearly than leave architects holding a regulatory exposure they didn't realize was still there.