Writing · Lakehouse format · Read-through
V4 relative paths vs DuckLake. Different facets of the same complaint, not the same fight.
The take I keep seeing on LinkedIn is that Iceberg V4's just-finalized relative-path support is the
Iceberg community "taking aim" at DuckLake. I read the actual V4 proposal text alongside the DuckLake
primary-source materials and the framing doesn't survive contact with either document. V4 relative paths
is a portability and disaster-recovery fix that keeps Iceberg metadata in *-metadata.json and
*.avro files. DuckLake replaces the file-based metadata layer entirely with a SQL database.
Both reduce the same general complaint (Iceberg metadata is awkward at scale) but they target genuinely
different facets of it. Conflating the two obscures real architectural choices a regulated-environment or
on-prem-constrained operator needs to make.
Attribution note
The "Iceberg table as a contract" framing I use in the middle of this piece is not mine. It originates with Roman Kolesnev (Principal Software Engineer, Streambased) in his post "Your Iceberg table doesn't need to exist". Tom Scott (Streambased CEO) wrote a related Kafka-to-Iceberg virtual-view piece in April 2026. My contribution is applying that framing to the cybersecurity on-prem context; the framing itself is Streambased's.
The framing I keep seeing
"Iceberg is going after DuckLake." It isn't, quite.
Dipankar Mazumdar's announcement that "Relative Path Support is finalized for Apache Iceberg v4" hit LinkedIn around 2026-05-24 and the responses landed in two clusters. The first cluster was the reasonable one: practitioners noting that the change makes table relocation, replication, and backup cleaner. The second cluster was the one that got me to write this piece. A handful of posts framed the finalization as the Iceberg community responding to competitive pressure from DuckLake, the database-backed lakehouse format that DuckDB Labs and friends launched in May 2025.
I understand the reflex. Both projects are in the same broad neighborhood. Both are pitched, at least in part, as "make Iceberg metadata less painful at scale." The DuckLake creators have been pointedly explicit that file-based metadata is the wrong architecture for high-frequency workloads. V4 is the first Iceberg release cycle to land at the same time DuckLake has visible momentum. If you squint, it looks like a response.
But when I sat down with the actual V4 relative-paths proposal text and re-read the DuckLake primary sources side by side, the two efforts are not solving the same problem. They are solving genuinely different facets of the same complaint. The complaint ("Iceberg metadata is awkward at scale") has at least three sub-facets I can name: it's hard to move (portability), it's slow at high commit rates (operational latency), and it's awkward to coordinate across concurrent writers (transactional ACID over a shared file store). V4 relative paths takes a clean shot at the first sub-facet and explicitly disclaims the others. DuckLake takes a clean shot at the second and third and barely engages with the first.
This piece is the read-through. I'll quote the actual proposal text and the actual DuckLake materials, then map out where each architecture lands and where an operator's pain has to be diagnosed before either is the answer.
Section 1
What V4 relative paths actually scopes.
The V4 relative-paths proposal (the working document Iceberg committers were iterating on into May 2026) is unusually clear about its goals. The headline goal is in the first paragraph:
Goal #1: Zero Rewrite Relocation. Eliminate the need for any level of metadata rewrite to change table location.
And the use cases it cites are the ones that have been painful for Iceberg operators for years:
Key use-cases that require metadata and data relocation include: Replication, copying table state and history to another data center or availability zone for high availability or read scaling. Backup, archiving table state and history for disaster recovery and compliance purposes. Data Migration, moving tables between different storage systems (e.g., HDFS to GCS, on-prem to cloud).
Read those use cases as a regulated-environment operator. Replication for HA across availability zones. Backup for disaster recovery and compliance. Data migration from HDFS to GCS, or on-prem to cloud. These are exactly the operations a financial services or healthcare lakehouse has to be able to do, and they are exactly the operations that V2 and V3 Iceberg made unnecessarily painful, because every data-file path stored inside the metadata was an absolute URI bound to the bucket the table was first written to. Moving the table required either rewriting every manifest or relying on bucket-level redirection tricks that don't survive across cloud providers.
What makes the proposal worth reading verbatim is the non-goals section. The proposal is disciplined about what it is not trying to do:
Non-Goals: Defining the specific mechanisms or services used to physically move or copy table data and metadata files between locations. This proposal focuses on the metadata representation to enable such operations.
Read that one carefully. The proposal is not specifying how you move the bytes. It is not building a replication service, a backup tool, or a migration utility. It is changing the on-disk representation of paths inside the existing metadata files so that the metadata files themselves are no longer location-bound. The tools that move the bytes (DistCp, rclone, S3 batch replication, vendor-specific replication APIs) remain whatever you were already using.
The actual mechanism is a small change to how path strings are interpreted:
A path string stored in any Iceberg metadata file is defined as: Absolute, a path string that includes a URI scheme (e.g., gs://, s3://, hdfs://, file:///). Relative, a path string that does not include a URI scheme.
That's the entire rule: a path with a scheme is absolute, and a path without one is relative, resolved
at read time against the table's current location, which the catalog or the engine knows. The
format of the metadata files themselves (the v3.metadata.json, the manifest list, the
Avro manifest files) is unchanged, the encoding is unchanged, the lookup path is unchanged, and the number
of files an engine opens to plan a query is unchanged. The only thing that changes is that a snapshot's
worth of file paths can be a bag of bucket-relative strings instead of a bag of absolute URIs.
That is a real win, and I don't want to undersell it. For an operator who has to maintain DR replicas of a multi-petabyte Iceberg table across regions, or who anticipates an HDFS-to-cloud migration in the next two years, or who runs a regulated-environment lakehouse where storage-system mobility is a compliance primitive, V4 relative paths is a genuine quality-of-life improvement. But it isn't a re-architecture of how Iceberg stores metadata, because the metadata still lives in files, the manifest files still get read at query-planning time, and the commit protocol still goes through the catalog, so almost none of the operational pain that DuckLake's creators describe gets touched by this proposal.
Section 2
What DuckLake actually does.
DuckLake launched in May 2025 from the DuckDB Foundation, with the public framing established in two
primary sources. The first is the official launch post on ducklake.select (2025-05-27). The
second is the YouTube conversation between Hannes Mühleisen (DuckDB creator, CWI) and Jordan Tigani
(MotherDuck, formerly engineering lead on BigQuery) titled "DuckLake & The Future of Open Table
Formats". If you only have time to read or watch one DuckLake source, the Mühleisen / Tigani video
is the one. Mühleisen built it and Tigani has the BigQuery operational scar tissue that informs why he
thinks DuckLake's architecture choice is the right one.
The architectural shift in DuckLake is not subtle. From Sanchit Vijay's November 2025 LinkedIn walkthrough, which I'm quoting because he names the contrast cleanly:
While formats like Iceberg and Delta Lake manage metadata through complex file hierarchies (JSON manifests, Avro files, manifest lists), Ducklake takes a structurally different approach by storing all metadata directly in a SQL database.
Vijay summarizes the design as three principles, which match the official launch framing closely enough that I'll quote them straight:
- Simplicity: "Just a SQL database for metadata and Parquet files for data."
- Scalability: "the same proven architecture that powers BigQuery and Snowflake, but with open formats."
- Speed: "Single SQL query replaces multiple HTTP requests for metadata operation, reducing latency and enabling sub-millisecond writes through data inlining."
The "BigQuery and Snowflake" comparison is the one Tigani earned the right to make. BigQuery's storage engine has always held catalog metadata in a transactional database, which is a big part of why it can plan queries against ten-thousand-partition tables in tens of milliseconds, while file-based table formats spend most of their planning budget opening manifest files. Snowflake's FoundationDB-based metadata service is the same idea applied at warehouse scale. DuckLake's pitch is that you can have that operational pattern without locking yourself into either vendor's storage engine. The data is still Parquet on object storage, the metadata is in a Postgres or DuckDB or SQLite catalog you control.
I ran a version of that planning claim in my own lab rather than taking the vendor's word for it, and
the direction held cleanly while the magnitude came down to earth. Holding the engine constant (DuckDB
reading both formats) and letting files accumulate from ten to two hundred at five million rows,
Iceberg's end-to-end planning grew 6.5× and its plan_files step grew 17.6× as
the manifest chain lengthened, while DuckLake's SQL-catalog resolution stayed flat at about three
milliseconds, which is the BigQuery and Snowflake pattern reproduced on a single box. The streaming
advantage reproduced too, though at a modest 2–4× on tiny commits rather than the
100–900× the vendor's best-case numbers suggested, and at a billion rows reading
byte-identical Parquet the two formats came back at read parity on three of four queries, so the
catalog-in-a-database pattern is a write-path and planning story rather than a read-speed one. All of
this is Tier B, single machine, so the ratios transfer and the absolute times don't; I work through the
write-contract framing it implies in
the write pattern is the architectural decision
and the read-neutrality in
the encoder is the read lever.
I went back and re-ran the streaming piece on a real object store rather than local disk, because the
thing I most distrust about a single-box benchmark is that local disk hides the per-commit round-trip
cost that an S3-style backend makes you pay, and that turned out to be the part that sharpens. Writing
100,000 rows as one batch commit versus 100 streaming commits against the MOAR reference stack on a MinIO
object store, Iceberg's per-commit overhead widened well past the local-disk picture: ingest went from
0.44 s to 16.3 s (about 37×), query planning from 8.7 ms to 181 ms (about 21×), and the
metadata footprint blew out from four files at 8.9 KB to 301 files at 4,579 KB, because every one of
those 100 commits writes its own metadata.json, manifest list, and Avro manifest, and on an
object store each of those is a separate round-trip rather than a cheap local write. DuckLake ran the
same 100-commit stream with its metadata in the catalog database, so there was no per-commit
metadata.json or manifest proliferation and planning stayed flat at about seven
milliseconds; with inlining off it still wrote 100 small Parquet files but ingested in 2.9 s, roughly
5.6× faster than Iceberg's stream, and with inlining on it wrote zero Parquet files for the small
commits and kept them in the catalog until compaction. That's consistent with the local-disk run and
sharper than it, because the object store is where Iceberg's metadata tax actually gets charged, and it
reinforces the same read of where the DuckLake advantage lives: the win is the flat catalog metadata and
the inlined small commits on the write path, not a read-speed multiplier, since on byte-identical data the
two formats read at parity. The honest caveat is unchanged, one host and Tier B, so the shape is the
finding and the absolute seconds aren't.
Tomasz Tunguz's June 2025 Pulse post (he's at Theory VC, which is not a neutral observer but is a well-informed one) put numbers on the resulting performance envelope:
Instead of building a custom catalog server, DuckLake uses a simple, elegant idea: a standard database to manage metadata. It uses a database for what it's good at.
DuckLake achieved sub-second query planning on a petabyte of data with 100 million snapshots, a scale that other systems can't handle.
full ACID compliance, so concurrent reads and writes are handled seamlessly, allowing entire teams (and their AI agents) to work on the data lake simultaneously.
The Michael Ritchie post on definite.app (also referenced widely as "Sub-Second Latency on a Petabyte") is the original framing source for the petabyte-scale numbers. I take all the performance claims with the caveat that they come from people who are publicly enthusiastic about the architecture. The benchmarks aren't independently replicated yet, and DuckLake's adoption is still early enough that production track-record data is thin. The architectural argument is clearer than the empirical validation at this stage.
Sean Knapp (Ascend.io founder) is a useful Tier-B data point on adoption: Ascend uses DuckDB extensively, was excited about DuckLake from the moment it launched, and his interest signals that the enterprise data-engineering crowd outside the DuckDB-native bubble is paying attention. Whether DuckLake's adoption breaks out of the DuckDB-native cohort and reaches the kind of enterprise scale that Iceberg has captured is the open question I'm tracking.
What I want the reader to take from this section: DuckLake is not a tweak to where path strings get
stored. It is the removal of file-based metadata as a layer. Manifest files don't exist in DuckLake.
Manifest lists don't exist. The *-metadata.json handshake doesn't exist. There is a SQL
database that knows the table's current snapshot, the list of Parquet files that snapshot points at, and
the change history, so a reader queries the database, gets the relevant Parquet file paths, and reads the
Parquet files from object storage, and that's the entire protocol. Commits are SQL transactions, which means
multi-writer ACID falls out of the database's existing concurrency control instead of needing to be
simulated through optimistic concurrency over a catalog API.
Section 3
What the SQL catalog costs back, on the correctness side.
The flat ~3 ms planning in the previous section is what the SQL catalog buys, and I think it's a real
win for the write-path and planning workloads it targets. But moving the metadata into a transactional
database doesn't make catalog-layer pain disappear, it relocates it, from file-coordination problems
into database-coordination problems, and a security team standing DuckLake up on a Postgres catalog
meets that surface directly. So I ran the same kind of bench against DuckLake that the lab runs against
every engine, working the issue tracker's own open correctness and operability bugs rather than the
marketing, and pinning each verdict to the exact versions I tested (DuckDB 1.5.3 with the DuckLake
extension at commit e6a3bd0a), because catalog-layer bugs close release-to-release and a
bare "DuckLake has these bugs" would be a less honest verdict than a version-bound one. Three issues,
three different shapes of answer (Tier B, single host, the issues' own minimal repros).
The one I'd lead with for a security operator is the cross-store delete-conflict gap
(duckdb/ducklake #1215),
which persists on the version I tested. Two concurrent deletes of the same row both
commit without raising a conflict when one delete is inlined (the row count is at or under
DATA_INLINING_ROW_LIMIT) and the other is written as a Parquet delete file (over the
limit), because the commit-time conflict check never compares the inlined store against the Parquet
store. The deleted rows then come back: in both commit orders my repro left 29 rows
that a correct system would have deleted to zero, with no error raised anywhere. That
is the worst class of bug for a security lakehouse, because the operations that hit it are exactly the
ones you cannot afford to be wrong about, a GDPR erasure request, a retention-expiry purge, a
tombstoned false-positive, any one of which can silently not take effect under ordinary concurrent SQL
and leave resurrected data sitting in a table you've certified as deleted. A planning-speed advantage
doesn't help you if the delete you thought committed didn't.
The second, and the most security-specific, is a wide-schema wall on a Postgres catalog
(duckdb/ducklake #1184),
which also persists. DuckLake unconditionally creates a backing
ducklake_inlined_data_<id>_0 table that carries every user column as
BYTEA, which runs straight into Postgres's hard 1600-column-per-table limit, so a
sufficiently wide table can't be created at all. I measured the boundary cleanly: 1500 columns creates
fine, while 1600 and 1700 both fail with ERROR: tables can have at most 1600 columns, and
setting ducklake_default_data_inlining_row_limit = 0 doesn't rescue it because the backing
schema is still emitted with every column. This is the one that should give a security architect pause,
because security schemas go wide on purpose, a fully-flattened OCSF event with all profiles and
observables, or a normalized EDR or firewall dataset, routinely runs past 1600 columns, which is the
same width pressure I work through in the
flattening-away-your-detection-logic
bench, so a Postgres-backed DuckLake catalog can't represent the schema-on-write
shape that a lot of security normalization actually produces.
The third is the one that cuts the other way, and it's the reason I version-bind all of this rather
than asserting it. A Postgres connection-pool timeout
(duckdb/ducklake #1031)
was fixed between 1.5.2 and 1.5.3. On DuckDB 1.5.2 (where the extension auto-resolves
to commit 415a9ebd, the exact commit named in the issue) creating 60 tables and then
running a single information_schema.tables query hangs for 60 s and dies with
Connection pool timeout: all 14 connections in use, the pool exhausting at one connection
per worker thread on a 14-thread host, which is the thread-local-caching mechanism the issue's
root-cause writeup described. On 1.5.3 the identical workload returns in 0.036 s. The issue is still
labeled open upstream, but running both versions is what separates a real fix from a repro that just
under-triggers on a given host, and the lesson there is the method more than the bug: "probably fixed"
is worth nothing without reproducing the old version as a control, which is the same discipline I apply
to the
answer-equality CI work
where a silently-wrong Parquet reader got fixed across a chDB point release the same way.
None of this is vendor-bashing, and I want to be explicit that it isn't, because the SQL-catalog design is genuinely interesting and these are the same probes the lab runs against Iceberg and every other engine. The point is narrower and fairer than "DuckLake is buggy." The flat planning is real and it's bought with a database, and a database is a different correctness and operability surface than a tree of manifest files, so a security team weighing DuckLake as its catalog has to price in delete-conflict correctness under concurrency and the Postgres column ceiling alongside the planning win, and re-check all three against the next DuckLake release before repeating any of it, because #1215 and #1184 are open today and may well close the way #1031 already did.
Section 4
The structural difference, in one paragraph.
V4 relative paths changes the paths inside the manifest files, while DuckLake removes the manifest files, and that's most of the argument right there, because one change is to the encoding of strings inside an existing layer while the other change deletes the layer.
If you want a slightly longer version: an Iceberg V4 table that uses relative paths still has a metadata
directory, still has a v4.metadata.json file per commit, still has a manifest list per
snapshot, still has one or more Avro manifest files per snapshot. An engine planning a query against
that table opens the metadata.json, follows it to the manifest list, opens the manifest list, follows
it to the manifest files, opens the manifest files, and gets the data file paths to scan. That's at
minimum three HTTP round trips before the engine even knows which Parquet files to read. Relative
paths doesn't change the trip count. It changes whether the paths the engine assembles from those
files include a URI scheme or get a URI scheme prepended at resolution time.
A DuckLake table doesn't have any of those files. An engine planning a query against a DuckLake table issues a SQL query against the metadata database (typically a single statement, often satisfied from the database's in-memory cache) and gets back the list of Parquet files. One round trip, sometimes zero if the catalog is colocated, then the Parquet reads themselves. The per-commit metadata footprint is a few SQL rows instead of a metadata.json plus manifest list plus manifests. Multi-writer commits use the database's transactional isolation rather than the optimistic concurrency over a catalog API that Iceberg uses.
These are not the same change, and they aren't even adjacent changes, because they sit at different layers of the stack and address different operational pains, with very different implications for how you run the system.
Section 5
Both move in the same direction. They target different facets.
I want to be careful not to overstate the "different fight" point. There is a shared direction between V4 relative paths and DuckLake: both reduce the friction of operating Iceberg-style tables at scale. The friction has multiple facets, though, and each solution targets a different facet.
Here's a useful frame I borrow from Roman Kolesnev at Streambased, in his post titled "Your Iceberg table doesn't need to exist":
An Apache Iceberg table is just a contract: a schema, a set of snapshots, manifests pointing to data files. It's all just bytes in a particular layout. The query engine doesn't care how those bytes came into being; it just needs them to be there when it asks.
Kolesnev's point (credit where it's due, this is his framing, not mine) is that "Iceberg table" is a contract over bytes, not a particular implementation of those bytes. Once you accept that, you can see V4 relative paths and DuckLake as two different ways of editing the contract. V4 keeps the contract shape and changes one of its terms (path encoding). DuckLake changes the contract shape entirely (database transactions instead of files) and accepts a different set of byte layouts on the other side. The "Iceberg table is just a contract" framing is also what makes Tom Scott's related Streambased Kafka-to-Iceberg piece work: "Instead of physically moving data from Kafka into Iceberg, create a virtual view that spans both systems." If the table is a contract, the bytes can be wherever (in Iceberg files, in DuckLake's database, in Kafka) as long as the contract is satisfiable.
Two things are worth pulling out of that framing, because the rest of this piece, and the format war generally, turns on them. The first is that "contract" is really two contracts. One is a read contract. Give the engine a table identifier and it gets back a schema and a set of scannable bytes, and it does not care whether those bytes were written by a Spark job, inlined into a Postgres row, or synthesized on demand from a Kafka topic. Every backend on the spectrum preserves that read contract; preserving it is the whole point of the interface. The other is a write contract, or commit contract: how data actually enters the table. Classic Iceberg writes files and registers them through the catalog. DuckLake commits a SQL transaction. A Streambased-style virtual table never writes at all, because producing to Kafka is the write. The read side is where "any engine can read it" lives. The write side is where the streaming-ingest economics live, because committing every few seconds is where Iceberg's per-commit file footprint becomes a $/GB problem. When I say the format choice is workload-dependent, the workload axis that matters most is usually the write contract, not the read one.
The second thing is a question the framing raises without answering. Does the contract cohere? "Iceberg is just a contract over bytes" reassures only if "Iceberg-compatible" keeps meaning one interoperable thing as the backends multiply. So far the signal is convergence, not fragmentation. The Iceberg REST Catalog spec has become the de-facto interface engines code against; Apache Polaris graduated to an Apache top-level project on 2026-02-19; and as of spring 2026 Polaris, the open-source Unity Catalog, Snowflake's Open Catalog, AWS Glue, and BigQuery's managed interface all speak REST with no incompatible extensions I can find documented. The divergence that does exist is about format scope, Unity spanning both Delta and Iceberg, rather than REST-protocol interop. That's the optimistic read. The pessimistic one, which I'm still watching for, is backend-specific REST extensions that quietly make "Iceberg-compatible" stop guaranteeing interoperability. The contract holding as backends multiply is a bet, not yet a settled fact.
Applied to the facets of the "Iceberg metadata is awkward at scale" complaint:
- Portability and DR. Iceberg metadata files contain absolute paths, which makes moving a table across buckets, regions, or storage systems painful. V4 relative paths is the targeted fix. DuckLake doesn't engage with this facet directly. The catalog database has its own portability story (database dumps, replication) that is honestly less mature than Iceberg's file-based portability story was, even before V4 made it easier.
- Operational latency at high commit rates. Iceberg's per-commit file footprint becomes expensive when you're committing every few seconds (streaming pipelines, agent telemetry sinks, EDR ingestion). DuckLake's database-backed metadata is the targeted fix; relative paths doesn't change commit latency at all. V4 has separate proposals (single-file commits, Parquet for metadata) that do engage with this facet, but those proposals are not the relative-paths work.
- Multi-writer concurrency. Iceberg's optimistic-concurrency-over-catalog protocol works well up to a point and gets messy when you have many concurrent writers contending on the same table. DuckLake leans on the metadata database's transaction isolation, which is the mature fix for the general case. V4 doesn't directly address this; the catalog-managed-metadata work in V4 narrows the race-condition surface but doesn't move concurrency control into a database.
- Cross-engine portability. Iceberg's win has always been "any engine can read it." V4 keeps that promise. The file format is unchanged, every engine that reads V3 can read V4 with a spec update. DuckLake is currently DuckDB-native with growing support; cross-engine read coverage is the open adoption question. This facet isn't really part of the "metadata is awkward" complaint, but it's the facet that ends up deciding which architecture wins where.
So yes, both V4 and DuckLake are pushing on the same general complaint. But they're pushing on different sub-facets of it. An operator who diagnoses their pain as portability and DR is going to be happy with V4 relative paths and not gain much from DuckLake. An operator who diagnoses their pain as commit latency and multi-writer contention is going to be happy with DuckLake and not gain much from V4 relative paths. An operator who diagnoses both pains may want both, and "both" is a coherent stack, because Iceberg V4 and DuckLake are not mutually exclusive.
Section 6
When V4 helps, when DuckLake helps, when neither helps.
When V4 relative paths is the right answer.
Multi-region or multi-cloud lakehouses where DR involves copying the metadata tree alongside the data. On-prem-to-cloud migrations where the source is HDFS and the destination is GCS or S3 or Azure Blob. Regulated environments where backup-and-restore drills are part of audit posture: DORA in financial services, 17a-4 in broker-dealers, HIPAA backup requirements in healthcare. The common thread is that the operator is going to physically move bytes between buckets, and they want the metadata to come along without requiring a metadata-rewrite job. V4 relative paths is the targeted, low-risk fix for these cases. It's also additive; adopting relative paths doesn't break any existing tooling.
When DuckLake is the right answer.
High-frequency commit workloads where Iceberg's per-commit file footprint is the latency floor. Multi-writer workloads where you have many concurrent writers and the catalog API is the contention point. Operationally constrained teams who don't want to run a separate catalog service and would rather use the Postgres or DuckDB they already operate. Single-tenant or small-team data lakes where the simplicity of "one database knows everything" is a real win against the complexity of "every engine reads through a shared catalog protocol."
I want to be honest about where I am still uncertain on DuckLake: cross-engine read coverage is the adoption question that decides whether DuckLake is "the right answer" for a multi-tool shop or "the right answer if you're committed to the DuckDB ecosystem." As of this writing, DuckDB-native reads work cleanly, Trino and DataFusion connectors exist, Spark and Snowflake support is partial, and Iceberg-side catalog implementations that read DuckLake metadata are early. The architecture is elegant; the ecosystem coverage is still catching up.
When you might want both.
A tiered architecture isn't crazy. Iceberg V4 with relative paths as the long-term retention and audit-tier table format, DuckLake as the high-frequency ingest and operational-tier metadata layer, with periodic compaction or hand-off between the two. This is more architecture than most teams should take on, but for security-data workloads (where you have streaming detection-tier latency requirements and multi-year audit-tier retention requirements) the dual-format pattern is at least worth thinking through.
Two cautions on that tiered pattern, both of which I've sharpened since I first wrote this. The first is that "one read contract across both tiers" is more aspiration than current reality. DuckLake reaches outside engines through DuckLake-native clients, not through an Iceberg REST endpoint, and its published roadmap through v2.0 doesn't add one. Today the bridge between a DuckLake operational tier and an Iceberg retention tier is a metadata copy, not a transparent shared catalog. The virtual end of the spectrum is even less settled: Streambased's ISK is the most interesting articulation of never-write Iceberg I've seen, but it's a seed-stage, vendor-published design with no named production deployment and no independent benchmark, so I file it as a thesis to watch, not a tier to build on yet. The second caution is the honest null. V4's commit-footprint work (single-file commits, Parquet for metadata) might make materialized Iceberg good-enough across all three tiers, in which case the polyglot substrate is complexity I talked myself into rather than complexity the workload demanded. I lean toward tiering being real for security data, where streaming detection and multi-year retention pull genuinely opposite directions, but the null deserves a fair test before anyone builds the three-backend version.
When neither helps the actual pain.
Both V4 and DuckLake assume a working object store underneath. If your pain is at the storage layer
(you're an on-prem operator who needs S3-compatible object storage and you've been relying on MinIO),
neither V4 nor DuckLake is the answer to that pain. MinIO's main repository archived on 2026-02-12
(verified via GitHub; the repo's archived field is true). That doesn't mean MinIO stops
working; it means new feature development from the community is paused and the on-prem object-store
question is more open than it was a year ago. Operators are looking at SeaweedFS, Ceph with the RGW
S3 gateway, and where cloud-eligible, Cloudflare R2 or AWS S3 directly. Picking the right format on
top of an unstable storage layer is the wrong order of operations, so diagnose the storage question first.
Likewise, if your pain is detection-tier latency (sub-second analyst queries against streaming telemetry) neither V4 nor DuckLake is the right primary architecture. That pain belongs to the streaming layer (Flink, RisingWave, pipeline-based detection) and to the query-engine layer (ClickHouse, Druid, StarRocks), not to the lakehouse table format, because the format is the durable persistence and hunting/analysis tier, and a slow detection pipeline is a different problem than an awkward metadata layer.
Section 7
What in V4 actually is closer to a DuckLake fight.
If you want to track the Iceberg-versus-DuckLake competitive vector, relative paths isn't where to look, so the places to watch are:
Single-file commits. V4 proposes consolidating per-commit metadata into a single file to reduce I/O overhead under high-write workloads. That directly addresses the same operational latency facet DuckLake's database-backed metadata is built around. If single-file commits land cleanly, the gap between Iceberg's per-commit footprint and DuckLake's per-commit footprint narrows, though it doesn't close, because DuckLake's commit is still a single database row insert and Iceberg's is still a file write plus a catalog round-trip.
Parquet for metadata. V4 proposes replacing the Avro encoding for manifests with Parquet. The win is columnar pushdown on the metadata itself, so engines only load the metadata fields they need per query, instead of reading whole manifest blobs. This is closer to the kind of metadata access pattern a SQL-database-backed catalog gets natively. It's not equivalent (there's still a file read instead of a SQL query), but it shrinks the per-query metadata cost.
Catalog-managed metadata mode. The V4 work on catalog-managed metadata.json optionality moves the authoritative version of the table state out of the metadata.json file and into the catalog service (Polaris, Unity, Nessie, Glue). That's structurally closer to DuckLake's "the database is the authority" model. Iceberg's catalog-managed mode still has files on disk; DuckLake's doesn't. But the trust boundary moves in the same direction.
Row lineage and concurrency. The V3 row-lineage work and the V4 concurrency proposals together touch the multi-writer ACID facet DuckLake's transactional metadata addresses head-on. Row lineage gives Iceberg something close to a per-row last-updated timestamp without external metadata; the V4 concurrency work tightens the optimistic-concurrency protocol enough that high-contention multi-writer workloads have a better story.
None of these are the relative-paths work, which is a portability and DR fix, so the competitive vector with DuckLake, to the extent there is one, runs through commit-footprint optimization, catalog-managed authority, and concurrency, and those are the V4 threads to track if you're watching the format war.
Section 8
Why I think the matrix-shaped evaluation is the right shape.
One useful cross-validation for the way I score formats in the capability matrix: Dipankar Mazumdar (the same author whose announcement kicked off this piece) has been advocating for a six-dimension framework for evaluating open table formats. Read/Write Optimization, Ecosystem Compatibility, Table Maintenance Services, Table Evolution and Versioning, Platform Tools, Use Case Alignment. He's writing as a Cloudera director and Iceberg contributor, so he has the lens of someone who actually has to recommend formats to enterprise customers.
I take his framework as independent validation that this space genuinely is multi-dimensional and that the "Iceberg vs DuckLake" question doesn't have a one-dimensional answer. A format that wins on Ecosystem Compatibility may lose on Read/Write Optimization. A format that wins on Table Maintenance Services may lose on Use Case Alignment for a specific workload. The Mazumdar six-dimension framework and the matrix-shaped evaluation arrive at the same shape from different starting points.
The reason this matters for the V4-vs-DuckLake read is that the two systems score very differently across his six dimensions. V4 wins on Ecosystem Compatibility and on Table Evolution and Versioning; DuckLake wins on Read/Write Optimization and on Use Case Alignment for high-frequency workloads; Table Maintenance Services and Platform Tools are too early to call on DuckLake. A scoring framework that respects the multi-dimensional reality doesn't reduce to "V4 wins" or "DuckLake wins." It reduces to "here are the workloads each is the right fit for."
Closing
What's still open. What would change the read.
I want to close honestly. There's plenty I don't know yet, and there are specific signals I'm tracking that would move my read on this.
V4 milestone progress. Relative paths is finalized as of the May 2026 LinkedIn announcement, but the broader V4 milestone (Iceberg GitHub issue #58) tracks a longer list of proposals: single-file commits, Parquet for metadata, catalog-managed metadata mode, enhanced column statistics, efficient column updates. The cadence at which these merge through 2026 is the leading indicator of whether V4 narrows the gap with DuckLake on the operational-latency facet or doesn't. I check the milestone monthly.
DuckLake adoption beyond DuckDB-native users. The question that decides DuckLake's market shape isn't whether it works (it works) but whether Trino, Spark, Snowflake, and the rest of the cross-engine ecosystem land first-class read support. Sean Knapp at Ascend is one signal in this direction. The Apache Spark and Snowflake DuckLake connector status through the rest of 2026 is the signal I'm tracking.
Cloudflare R2 Data Catalog and managed-Iceberg-catalog semantics. R2's Data Catalog is the most interesting recent move in the managed-Iceberg-catalog space. It's an R2-native catalog with its own metadata semantics, and for on-prem operators eyeing a managed-egress-free object store, R2 is structurally appealing. Whether R2 Data Catalog adopts V4 relative paths quickly, and whether it engages with the catalog-managed-metadata mode, tells me something about how the managed-catalog vendors are reading the V4 work.
On-prem object store stabilization. MinIO's repository archive in February 2026 reopens a question that had been closed: what does an on-prem operator stand up under an Iceberg or DuckLake table today? SeaweedFS and Ceph RGW are the two answers I see practitioners gravitating toward. If either consolidates as the default, the on-prem path for both V4 and DuckLake becomes cleaner. If neither does, the gravitational pull toward S3 or R2 strengthens for everyone except the most regulated shops.
Whether the REST catalog contract stays coherent. The reassuring read on all of this is that "Iceberg" is converging on the REST Catalog spec as a shared interface, with Apache Polaris now an Apache top-level project and the major catalogs all speaking REST. The signal I'm watching is whether the REST spec, or V4, formally blesses non-file metadata backends, which would make the database-backed and virtual realizations first-class instead of copy-bridged, or whether backend-specific extensions start eroding what "Iceberg-compatible" guarantees. The first path makes the tiered architecture in section 5 buildable on a single real contract; the second collapses it back into three formats sharing a name.
Vortex as the next file-format facet. The format conversation isn't only the
Iceberg-vs-DuckLake metadata fight; there's a live question one layer down, at the file format itself.
Vortex, the columnar format now under the Linux Foundation (installable as vortex-data, renamed
from the yanked vortex-array, which is why it looked abandoned for a while), claims large
random-access and scan speedups over Parquet. I ran it against zstd-Parquet on an OCSF corpus rather than
take the launch numbers, and the measured story is more modest than the pitch: Vortex read faster but
single-digit (roughly 1.7–2.6× on a full decode, 3.3–4× on a low-selectivity needle), not the headline
10–100×, and its on-disk footprint was scale-dependent, a touch smaller than Parquet at 100K rows and about
26% larger at a million, with identical answers across both formats. The catch that keeps it out of this
essay's main argument is that Vortex isn't an Iceberg data file format yet: Iceberg 1.11.0 shipped the
pluggable File Format API but the Vortex plugin is still an open issue, so for now it's a standalone
datapoint, not a swap-in for the file layer under an Iceberg table. If that plugin lands, the read-speed
facet of the complaint gets a third answer alongside V4 and DuckLake, and I'm watching the File Format API
for it.
What would change my read. If DuckLake ships first-class Spark, Trino, and Snowflake connector support in 2026 and that ecosystem coverage rivals Iceberg's, the "DuckLake is currently DuckDB-native" caveat I leaned on in section 5 falls away and DuckLake becomes a serious candidate for the operational-tier metadata role at multi-tool shops. If V4 single-file commits and Parquet for metadata land cleanly and benchmarks show commit-latency parity with database-backed metadata, the operational-latency facet of the original complaint largely closes inside Iceberg's existing architecture and DuckLake's pitch shrinks. Both are plausible. Either would change where I land.
What I'm pretty confident about: the framing "Iceberg V4 is going after DuckLake" remains a misread of what V4 relative paths actually is. The two efforts target different facets of a shared complaint, and the architectural choice between them belongs to the operator's diagnosis of their own pain. That's a more useful read than a category war.
Two formats. Different facets. Diagnose before you choose.
The capability matrix scores Iceberg V3, Iceberg V4, Delta Lake, Hudi, and DuckLake against the workloads they're actually run on, not against each other in the abstract. If you're trying to diagnose which facet of your pain is the dominant one, that's the framework I use to talk it through.