Writing · Lakehouse internals

Same codec, different file sizes.

I set out to compare two table formats on storage efficiency and ran into a more basic problem first, which is that two Parquet writers told to use the same codec at the same level produced different file sizes on byte-for-byte identical data. PyArrow wrote 193 MB where DuckDB wrote 114 MB, same ZSTD, same level, same rows, and once I understood why I stopped trusting the phrase "compressed with ZSTD" to mean anything specific about how big a file would be.

What I was actually testing

Which format stores security data smaller?

The question seemed simple enough to settle in an afternoon. I had a corpus of synthetic network-connection events and I wanted to know whether Iceberg or DuckLake stored it more efficiently, because storage is a real cost line when you're keeping security telemetry for a year or more and the difference between two formats compounds across every partition. The plan was the obvious one: write the same data into an Iceberg table and into a DuckLake table, measure the bytes on disk, and report the ratio. Both formats sit on Parquet underneath, both support the usual codecs, so I expected the comparison to come down to whatever each format does with metadata and manifests, which is small.

The early numbers told a clean story, and it was the story everyone tells. The Iceberg table came out meaningfully smaller than the DuckLake one, which lined up with the conventional read that Iceberg is the more storage-efficient format, the one the mature ecosystem has tuned hardest. I almost wrote that down. The reason I didn't is that I'd written the Iceberg table through pyiceberg, which defaults its Parquet files to ZSTD, and I'd written the DuckLake table through DuckDB, which was defaulting to Snappy, so I was comparing a ZSTD file against a Snappy file and calling the difference a property of the format. That's not a format comparison, it's a codec comparison wearing a format comparison's clothes, and the moment I noticed it the whole result became suspect.

So I did the thing that felt like it should fix it: I matched the codec. I told both writers to use ZSTD at the same level and re-ran, expecting the format difference to shrink to the small metadata-and-manifest residue I'd assumed it would be all along. It did not shrink the way I expected. It moved, and it moved in a direction that told me my mental model of what "matching the codec" buys you was wrong.

Where the model broke

193 MB and 114 MB, same codec.

With both writers set to the same ZSTD at the same level on the identical data, PyArrow produced a file of 193 MB and DuckDB produced one of 114 MB. That's not a rounding difference or a metadata footnote; the PyArrow output is roughly 70% larger for the same rows under the same nominal compression setting, which is the kind of gap people normally attribute to a codec change, except there was no codec change. The codec was held constant on purpose, and the files still came out almost a factor apart. The first time you see that you assume you've misconfigured something, so I checked the levels, checked the row counts, checked that the two tables held the same data, and they did. The setting that was supposed to determine the size was identical, and the size was not.

What's happening is that the codec is the last and smallest step. A Parquet writer makes a string of decisions before the compressor ever sees a byte, and those decisions determine how much there is to compress in the first place. The big one here is dictionary encoding: PyArrow was dictionary-encoding columns that DuckDB chose to leave as plain values, and on a high-cardinality column (one with many distinct values, like an ephemeral source port or a connection identifier) a dictionary is close to a pessimization, because you pay for the dictionary itself and get little dedup back since few values repeat. DuckDB's writer looks at the cardinality, decides the dictionary won't earn its keep, and skips it, so it hands the compressor a smaller, more regular stream. Run-length encoding, the page layout, how values are chunked into pages and column chunks, all of it is set before ZSTD runs, and all of it is the writer's call.

The piece that turned this from an annoyance into a finding is that pyiceberg gave me no per-column lever to fix it. I went looking for the setting that would tell PyArrow not to dictionary-encode the high-cardinality columns, the way I could equalize the codec, and through the pyiceberg write path it wasn't exposed in a way that let me match DuckDB's choices column by column. So I couldn't make the two writers agree by configuration. The encoding strategy is baked into each writer, and matching the one knob that's easy to match (the codec) leaves the knobs that actually move the number untouched.

The result that flipped

"Iceberg is more storage-efficient" was a default.

Once I understood the writer was doing the work, the original result inverted. The reason the Iceberg table had looked smaller was that pyiceberg defaulted to ZSTD while DuckLake defaulted to Snappy, and ZSTD compresses harder than Snappy, so the Iceberg files were smaller for a reason that had nothing to do with Iceberg. When I held the codec constant and let each format use its own writer, the DuckLake-written files came out the smaller of the two: 1.14 GB against 1.93 GB for the same dataset at the larger scale I re-ran to confirm it. The format people call more storage-efficient was, on a matched codec, writing the larger files, because its default writer made encoding choices that left more bytes on disk.

I want to be careful with how far I push that, because the honest reading is not "DuckLake is the more efficient format" any more than the original reading of "Iceberg is" was honest. Both statements attribute to the format something the writer decided. The Iceberg-versus-DuckLake size gap I started with was a comparison of two defaults, ZSTD against Snappy, and the inverted gap I ended with was a comparison of two writers' encoding strategies, PyArrow's dictionary-everything against DuckDB's cardinality-aware choice. Neither gap is a fact about the table format's specification. Iceberg and DuckLake both store Parquet, and Parquet of a given logical content can be a wide range of sizes depending on who wrote it and how, so the format name on the table tells you almost nothing about the bytes underneath.

That left me unable to answer my original question the way I'd framed it, which is itself the answer. I could not make a fair Iceberg-versus-DuckLake storage comparison by matching the codec, because the codec is not where the size difference lived. The only way to take the writer out of the comparison is to stop letting each format write its own files and instead register the same physical bytes into both catalogs, so that the Parquet on disk is identical and the only thing varying is the format's own metadata layer. That's a different experiment from the one I set out to run, and it's the only one that isolates the format from the writer.

What the codec name hides

Compression is a strategy, not a knob.

The mental model worth replacing is the one where "compression" is a single setting you turn up or down, like a quality slider, with the codec name standing in for the whole thing. In Parquet the codec is one stage of a pipeline, and it operates on a byte stream that the writer has already shaped through encoding. Dictionary encoding replaces values with small integer references into a table of distinct values, which is a large win on a low-cardinality column like a protocol name or an event type and close to dead weight on a high-cardinality one. Run-length encoding collapses repeated runs, which depends entirely on whether the writer sorted or clustered the data so that runs exist. Page and column-chunk sizing changes how much context the compressor has to work with at once. Every one of those is a choice, the writer makes all of them, and only after they're made does ZSTD or Snappy get its turn on whatever's left.

So two writers that both honestly report "ZSTD level N" can hand that compressor very different inputs, and a smaller, more regular input compresses to fewer bytes regardless of the codec setting. DuckDB's writer, on this data, made choices that produced less to compress, which is why its file was smaller at the same nominal level. None of this means PyArrow's writer is wrong; dictionary-encoding aggressively is a reasonable default that pays off on many real schemas, and PyArrow exposes the controls to tune it if you write through its own API rather than through pyiceberg's wrapper. The point is narrower, which is that the codec setting is a poor predictor of file size, and the writer's encoding strategy is a strong one, so a size number means little until you know which writer produced it.

This connects to a thread I've followed on the read side, where the same writer choices that move the bytes on disk also move query latency, because an engine reads what the writer laid down and an encoding that's cheap to scan is faster to filter. I've written about that separately in the encoder is the read lever, and the two findings are the same observation from opposite ends: the writer is doing most of the work that benchmarks credit to the format or the engine, on both storage and speed.

What this is and isn't evidence of

One dataset, two writers, one machine.

I'd rather state the limits than have a careful reader find them. This is one dataset of synthetic network-connection events, two specific writers (PyArrow through pyiceberg, and DuckDB), on a single machine, and the headline gap depends on the cardinality profile of these particular columns, because the whole mechanism is about whether dictionary encoding helps or hurts on the data you actually have. On a corpus that's mostly low-cardinality the two writers would land much closer together, since dictionary encoding would earn its keep for both. So I'm not claiming PyArrow always writes larger files, or that DuckDB is the better Parquet writer in general; I'm claiming that on this data, at a matched codec, they differed by enough to dominate the format comparison I was trying to make.

The durable part, the part I'd defend past this one dataset, is the methodological one: the codec name is not the compression, matching it does not make a format comparison fair, and a file-size number is not interpretable until you know which writer produced it and how it was configured. That holds regardless of the specific megabytes, because it follows from how Parquet writers work rather than from my particular numbers. The numbers are the demonstration; the writer-is-the-variable lesson is the finding.

What this also means, and I'll say it plainly because it cost me a clean result, is that the storage comparison I wanted is harder than it looks and most published versions of it are probably comparing writers without saying so. If a benchmark writes each format with its own default writer and reports the size difference as a format property, it's measuring the defaults, and the defaults can flip the ranking, as mine did when I moved from the ZSTD-versus-Snappy framing to the matched-codec one.

How to read a size claim

Ask which writer produced the file.

When you read that format A is some percentage smaller or faster than format B, the question that collapses most of these claims is which writer produced each file and at what settings, because the writer is doing most of the work the benchmark attributes to the format. A ZSTD-versus-Snappy difference reported as an Iceberg-versus-DuckLake difference is the cleanest example, but the subtler one is two files at the same codec and the same level that still differ because the writers encode differently, and that one doesn't even leave a settings difference for you to notice. If the methodology doesn't name the writer and its encoding configuration, the size number is describing a default, and a default is a choice somebody made for you rather than a fact about the format.

For a security-data estate this matters in dollars, because storage is the recurring cost of keeping telemetry around long enough to investigate it, and a 70% difference in on-disk size from the writer alone is the difference between a retention budget that holds a year and one that holds seven months. The good news is that the lever is real and yours to pull: if your writer is bloating high-cardinality columns with dictionaries that don't earn their keep, the fix is in the writer's configuration, not in switching table formats, and you'll get a better return tuning the encoding than migrating Iceberg to DuckLake or back. The format choice is a real decision for other reasons, but storage efficiency on a matched writer is mostly not one of them.

The broader habit this run reinforced is the one I keep coming back to, which is to isolate the variable you think you're measuring before you trust the number. I'd matched the codec and assumed that made the comparison fair, and it didn't, because the real variable was a level up from where I was looking. The discipline that catches this is the same discipline that catches a benchmark timing the wrong thing, and I've written down the version of it I use in how to run a benchmark that doesn't lie, because the storage case and the speed case fail in the same way: a variable you didn't control quietly authoring the result you publish.

Evidence: Tier B (first-party, reproduced; single machine). The size divergence was found in the SDW Lab ocsf-read-scan format-comparison work: at a matched ZSTD codec and level on identical synthetic network-connection data, PyArrow-via-pyiceberg wrote 193 MB where DuckDB wrote 114 MB, with the gap traced to PyArrow dictionary-encoding high-cardinality columns that DuckDB left as plain values and pyiceberg exposing no per-column control to equalize it; at the larger scale the matched-codec DuckLake-written files were 1.14 GB against 1.93 GB for Iceberg, inverting the ZSTD-default-versus-Snappy-default size gap that produced the "Iceberg is more storage-efficient" reading. The only writer-neutral comparison registers identical physical bytes into both catalogs. Generality across other data shapes and writers is an open follow-up.

The codec name is not the compression.

Two writers, same codec, same level, same data, files almost a factor apart. Before you believe a storage or speed number that pits one format against another, find out which writer produced each file and how it was configured, because that's where the difference usually lives.

Read the lab page → The encoder is the read lever How to run a benchmark that doesn't lie