Writing · Lakehouse internals

The encoder is the read lever, not the table format.

I went into this expecting to find a read-speed difference between Iceberg and DuckLake, because that's the comparison everyone reaches for when they pick a table format, and I'd seen Iceberg come back slower in my own earlier runs. What I found instead is that the table format was never the thing I was measuring. The lever on read speed is the Parquet writer that produced the files, and once I took the writer out of the comparison by registering the byte-identical bytes into both catalogs, the two formats read the same.

What I thought I was measuring

Is Iceberg slower to read than DuckLake?

The two table formats sit at different points on the spectrum, and the way you usually decide between them is by reputation: Iceberg is the heavier specification with the broad catalog ecosystem and the manifest-based metadata layer, while DuckLake keeps its table metadata in a SQL database and aims to be the lighter, lower-ceremony option. When someone asks me which one reads security telemetry faster, the honest starting position is that I didn't know, so I set up a scale ladder to find out, generating synthetic network-connection events and querying them through DuckDB against an Iceberg table and against a DuckLake table at one million, ten million, a hundred million, and a billion rows.

On the first pass Iceberg looked slower, and not by a trivial amount. Across the query set it came back somewhere between about 1.1 times and 1.55 times slower than DuckLake on the same logical data, which is the kind of gap that would feel decisive if you stopped there and wrote it up. If I'd published that table it would have read as a clean format-versus-format result, with DuckLake winning the read path, and a practitioner deciding between the two on read latency would have taken Iceberg off the shortlist on the strength of it. The number was real in the sense that the clock genuinely measured it. It just wasn't measuring what the headline would have said it was measuring.

What stopped me from writing it up that way is that the two setups weren't reading the same files. Each format had written its own Parquet through its own path, so when I compared read latencies I was comparing two different sets of bytes that happened to encode the same rows, and any difference between them could just as easily live in the bytes as in the format's read machinery. Before I could say anything about the formats I had to make the files identical, which turned out to be the whole story.

Where the gap actually lived

Iceberg defaulted to ZSTD, DuckLake to Snappy.

The first confound was the easy one to name once I looked. The two paths weren't even compressing the data the same way, because pyiceberg's write path defaulted the files to ZSTD while the DuckLake path wrote Snappy, and ZSTD trades decompression work for smaller files in a way that shows up directly on a scan-heavy read. So a chunk of the apparent Iceberg slowness was a codec default sitting underneath the format, not a property of the format itself, and the moment you say "I measured Iceberg against DuckLake" while one is on ZSTD and the other is on Snappy, you've already mislabeled the result. The format names were on the axis, but the codec was doing the work.

The fix that seems obvious is to set both writers to the same codec and call it controlled, and that's where the second and more stubborn confound surfaced. Matching the codec does not match the bytes, because the two Parquet writers disagree about everything below the codec. On the identical input data, with the codec held equal, PyArrow's writer produced a file of about 193 MB where DuckDB's writer produced about 114 MB for the same rows, which is a roughly 1.7-times difference in size that has nothing to do with compression algorithm and everything to do with encoding decisions. The largest single contributor I could see is dictionary encoding on the high-cardinality columns, where PyArrow makes different choices about when to dictionary-encode and when to fall back, and pyiceberg gives you no per-column control to override it, so you can't simply tell it to encode the way DuckDB does.

That second confound is the one I'd want a benchmark reader to internalize, because "same codec" feels like it should mean "same bytes" and it doesn't. Two correct Parquet writers, handed the identical rows and the identical compression codec, will still emit substantially different files, and a file that's 1.7 times larger has more to read off disk, more to decompress, and a different layout for the scanner to walk. If that difference is allowed to ride along inside a comparison labeled Iceberg-versus-DuckLake, then the comparison is measuring the writers and reporting on the formats, which is exactly the kind of mislabeled result that makes published read benchmarks untrustworthy.

The only honest comparison

Write the files once, register the same bytes into both.

If the writer is the confound, then the way to measure the formats is to remove the writer from the comparison entirely, and both of these catalogs let you do exactly that because both can adopt existing Parquet files rather than insisting on writing their own. So I wrote one canonical set of Parquet files a single time, then registered those same files into an Iceberg catalog with pyiceberg's add_files and into a DuckLake catalog with ducklake_add_data_files, and read both through DuckDB. The point of the design is that there is now exactly one set of bytes on disk and the two catalogs are pointing at it, so anything that differs between the two reads has to come from the format's metadata and read path rather than from the data, because the data is literally the same data.

At a billion rows, with the writer held out of the comparison this way, three of the four queries came back at parity. The filtered lookup landed at 1.00 times, the byte rollup at 1.01 times, and the subnet rollup at 1.01 times, all with a coefficient of variation at or under about 2.5 percent, which is well inside the run-to-run noise at that scale and is as close to "no difference" as a timing measurement gets to report. When the bytes are identical, the two table formats put the same data in front of DuckDB at the same speed, and the 1.1-to-1.55-times gap from the first pass simply evaporated, because that gap was the codec default and the writer divergence the whole time and never the format.

I'll be straight about the one query that didn't come back at parity, because reporting only the three clean ones would be the dishonest version of this. A heavy GROUP BY with about 16.7 million distinct groups diverged by roughly 1.3 times between the two on the identical bytes, so the engines' read paths still differ on the hardest aggregation, the kind that builds an enormous hash table and is sensitive to how the scan feeds it. That's a real residual difference and I'm not going to wave it away, but it sits on the heaviest scan in the set rather than across the board, and the shape of the result is that on identical bytes the formats read neutrally for the ordinary queries and diverge only at the extreme, which is a much narrower claim than "Iceberg is slower" and a much more accurate one.

Why the writer is the real variable

Read speed is set on the write path.

The reason this generalizes past my particular setup is that a columnar read is mostly a function of what's sitting on disk, and what's sitting on disk is decided when the file is written. The codec, the row-group size, the page size, whether a column is dictionary-encoded, how the statistics that drive predicate pushdown are laid out, all of that is fixed at write time and then read back over and over for the life of the file. The table format, by contrast, is the layer that tracks which files belong to the table and what their snapshots are, and once a query has resolved which files it needs, the format steps out of the way and the engine reads Parquet. So when you change the writer you change the thing the reader actually touches, and when you change only the format while keeping the same bytes, you've changed the bookkeeping and left the read alone, which is why the parity result is what I'd now expect rather than a surprise.

That reframes what you're choosing when you choose a table format, because you're not choosing read speed, and a benchmark that says you are has almost certainly let the two writers produce different bytes. The decision that actually moves read latency happens earlier, on the write path, in the encoder you use and the codec and row-group settings you give it, and that decision is largely independent of whether the resulting files end up catalogued by Iceberg or by DuckLake. If you care about read speed, the place to spend your attention is the writer and its configuration, and the place to be skeptical is any chart that attributes a read difference to the format while quietly using a different writer for each side.

None of which makes the format choice unimportant, it just moves the decision onto the axes where the formats genuinely differ. What you're really picking between Iceberg and DuckLake is the catalog and metadata model: how each handles the small-file problem and compaction, what its commit and concurrency story looks like, how its metadata scales as the table accumulates snapshots, and what tooling and engines can read it in your environment. Those are real and consequential differences, and they're where the decision belongs, so the practical move is to pick the format on the write path and the metadata properties you need and stop treating a read benchmark as the tiebreaker, because on identical bytes the read benchmark has very little to say.

What this is and isn't evidence of

One machine, one engine, ratios over absolutes.

I want to bound this carefully, because it's the kind of result that's easy to over-read. The run is a single machine with one reader engine, DuckDB, against synthetic network-connection data, so the absolute times are a property of my hardware and mean nothing transplanted to yours, and I'd treat the ratios as the transferable part rather than the milliseconds. The right way to read "three of four queries at parity, the fourth at about 1.3 times" is as a claim about relative behavior on identical bytes, not as a benchmark you could quote a latency figure from, and I'd be the first to push back on anyone lifting the numbers out of that frame.

There are also honest open questions I haven't closed. I read with DuckDB, and a different reader, Trino or DataFusion or ClickHouse, might lean on the format's metadata differently enough to break the parity I saw, so the neutrality result is specific to this reader until I run the others. The heavy high-cardinality GROUP BY that diverged deserves its own look, because I don't yet know whether that 1.3-times gap is a stable property of how each format feeds a large aggregation or an artifact of one run's scheduling, and I'd rather say that plainly than fold it into the clean story. What I'm confident about is the narrower thing the design actually establishes, which is that the Iceberg-versus-DuckLake read gap I started with was a writer and codec artifact, and that on byte-identical files the formats read neutrally for ordinary queries on this setup.

The reason I trust that narrower claim is the design rather than the size of the effect. By writing the files once and registering the same bytes into both catalogs I removed the one variable that was big enough to swamp everything else, and a parity result that survives that removal is more informative than a large difference that didn't control for it. The earlier 1.1-to-1.55-times gap was the louder number, but the quieter parity number is the one that's actually about the formats, and a quieter result you can defend beats a louder one you can't.

How not to get fooled by a read chart

Never trust a read benchmark that let the writers differ.

The practical takeaway I'd hand to anyone choosing a table format is that the read benchmark you've been shown is probably answering a different question than the one on its label. If the two sides were written by different encoders, or even by the same encoder at different codec defaults, then the chart is a comparison of writers wearing the formats' names, and the only read benchmark worth believing is one that registers byte-identical files into both catalogs the way this run did. So the first question to ask of any format-versus-format read result is whether the bytes were the same, and if the answer is no, or if the methodology doesn't say, the number tells you about the writers and not the formats and you should treat it that way.

This sits next to a companion finding I pulled out of the same work, that two Parquet writers handed the identical data and the identical codec still disagree on file size by that 1.7-times margin, which I wrote up separately in Same codec, different sizes because it deserves its own treatment. The two essays are the read side and the write side of one observation, which is that the encoder is where the read cost is decided, and the table format is the catalog around it. If you want the general version of how I keep a comparison like this from lying to me, the method is in how to run a benchmark that doesn't lie, and the short version is that you isolate the one variable you mean to study and you make everything else identical, which here meant making the bytes identical before you let the clock run.

For security data specifically the stakes are ordinary engineering ones rather than anything exotic, but they're real, because the format decision tends to get made early and lived with for years across a lot of telemetry, and making it on a read benchmark that was secretly measuring the encoder is how you end up ruling out a perfectly good option for a reason that was never true. Pick the format on the write path and the catalog and metadata behavior you actually need, tune read speed where it's actually set, in the writer, and keep the two questions apart so neither one gets answered with the other one's evidence.

Evidence: Tier B (first-party, single machine; ratios transfer, absolute times don't). SDW Lab Iceberg-vs-DuckLake read comparison: one canonical Parquet set written once and registered into an Iceberg catalog via pyiceberg add_files and a DuckLake catalog via ducklake_add_data_files, read through DuckDB. At one billion rows three of four queries at parity (filtered 1.00×, byte_rollup 1.01×, subnet_rollup 1.01×, CV ≤ 2.5%); one 16.7M-distinct GROUP BY diverged ≈1.3×. The default-config gap (≈1.1–1.55× apparent Iceberg slowness) traced to Iceberg defaulting to ZSTD vs DuckLake's Snappy; at a matched codec PyArrow wrote ≈193 MB where DuckDB wrote ≈114 MB on identical data, with pyiceberg exposing no per-column encoding control. Methodology and the runnable comparison are published in the SDW Lab ocsf-read-scan benchmark; generality across other reader engines is an open follow-up.

Pick the format on the write path, not the read chart.

Over an open table format, read speed is set by the Parquet encoder, not by Iceberg-versus-DuckLake. When the bytes are identical, the formats read the same. The decision that's actually yours to make is the catalog and metadata model, and the read benchmark worth believing is the one that made the bytes identical first.

Read the lab page → Same codec, different sizes How to run a benchmark that doesn't lie