Writing · Benchmark methodology

How to run a benchmark that doesn't lie.

A benchmark is a measurement, and a measurement you can't trust is worse than no measurement at all, because it carries the authority of a number while pointing in the wrong direction. I learned the rules below the way you'd expect, which is by watching my own lab nearly publish a result that was confidently, precisely false, and the rules are not clever. They're the boring discipline that separates a number you can stand behind from a number that just happens to be the one that came out, and the first of them comes before any clock is started at all.

The rule that comes first

Verify the answer before you trust the clock.

I want to open with the case that taught me to put this rule above all the others, because it's the one that nearly got past me. I was running a cross-engine scale ladder, the same simple analytical queries executed by ClickHouse's embedded engine, chDB 4.1.8, and by DuckDB, over byte-identical Parquet, the whole exercise built to ask which engine was faster on security-shaped data. Almost as an afterthought I'd put an answer-equality check at the front of the harness, on the principle that there's no point reporting that one engine is 8% faster than another if the two aren't even computing the same thing. Before timing anything, the harness runs each query on each engine once and compares the results, and I expected that check to be a formality. At 100 million rows it failed: a selective count(*) with an equality filter came back from chDB tens of rows short of the count DuckDB computed over the same files, no exception, no warning, no log line flagging a skipped row group. The query compiled, the engine ran it fast, and it returned a number that was simply too small.

The part worth sitting with is the counterfactual. If I'd trusted the timings, and fast is exactly what a performance benchmark is built to reward, the run would have produced a clean, plausible, publishable table showing chDB holding its own against DuckDB at scale, and the missing rows would have ridden along inside a result that looked completely normal. The faster the wrong engine, the more convincing the bad number, which inverts the usual intuition that a quick answer is a trustworthy one. A timing-only benchmark would have laundered a silent wrong answer into a performance win and published it as one. The only reason it didn't is that the gate I almost didn't bother building compared the two answers before it compared the two clocks. The full case, what the bug was and how I isolated it to one engine's equality-pushdown path on the tail row groups of the file, is its own essay (the query engine returned the wrong answer); what matters for methodology is the order of operations. Correctness is the gate, and timing runs only after the answers agree. Everything else in this piece is downstream of getting that order right.

So rule zero is to verify the answer before you trust the clock, which on generated data is nearly free because you know the ground truth by construction, and on real data means computing a trusted reference answer once and pinning it. Run the same query on every engine and on the reference, fail loudly the moment any two disagree, before a single timing is recorded so a divergence stops the run rather than getting averaged into a result. One refinement the lab handed me later is that "disagree" needs a type-aware definition: the same column of floating-point values summed by five engines lands on three subtly different totals from ordinary IEEE-754 rounding, all of them correct, while the integer counts and sums agree to the last bit, so the gate compares integer-typed answers exactly and floating-point ones within a tolerance rather than crying wolf over benign rounding. The rest of these rules make the speed number honest once you've established that the answers it describes are real.

Rule one

Report the noise, and scale until the signal clears it.

Every comparative result should carry its coefficient of variation, the run-to-run standard deviation as a percentage of the mean, because a difference between two engines means nothing until you know how much the same engine varies against itself. This is the rule that decides whether a benchmark is measuring an engine or measuring the weather, and the lab finding that earned it is uncomfortable. At 1 million rows the chDB queries finished in 5 to 50 milliseconds, and at that scale the coefficient of variation ran as high as 55%, which means the spread between a fast run and a slow run of the identical query on the identical engine was wider than most of the differences I'd have been tempted to report between engines. Any "engine A beat engine B" call in that regime is a coin flip wearing the costume of a result.

The variation settled as the work grew. By 10 million rows the coefficient of variation came down to around 5%, and by 100 million rows to around 4%, which is the actual reason you run large. It isn't that big numbers are more impressive; it's that scale is how you get the per-query work to dominate the fixed-cost jitter, the process startup and the cache effects and the scheduler noise that swamp a sub-millisecond query, until the signal you care about finally clears the noise you can't avoid. A surprising amount of published benchmarking gets run at a scale where the headline difference sits inside the error bars, and the author either doesn't compute the variation or doesn't report it, so the reader can't tell whether the 8% gap is a real property of the engine or this morning's luck. The discipline is to scale up until the coefficient of variation is small relative to the effect you're claiming, and if it won't come down, to say so and stop claiming the effect.

I owe a related essay on choosing that scale deliberately rather than by habit, on why the right row count is the one where your variation drops below your effect size and not a round number that looks serious (scale before you measure), but the short version lives in the numbers above: 55% variation at a million rows, 4% at a hundred million, same query, same engine, same machine. The benchmark didn't get more honest because the engine got faster, it got more honest because the noise stopped being able to hide inside it.

Rule two

Matching a codec is not config parity.

When you compare two formats or two engines, the thing you most want to hold constant is the data, and the trap is assuming that two tools writing "the same data with the same compression" have actually produced comparable files. They haven't. On identical input at the same codec, PyArrow wrote a Parquet file of 193 MB where DuckDB's writer produced 114 MB, a difference of roughly 1.7× from the encoder alone, with nothing changed about the logical rows or the compression algorithm named on the tin. The writer chooses dictionary encoding thresholds, page sizes, row-group boundaries, and how aggressively it applies run-length and delta encodings, and those choices move the file size and the read cost enough to swamp the format-versus-format difference you set out to measure. If you let each engine write its own files and then time the reads, you aren't comparing the formats, you're comparing the writers, and you probably can't tell which.

Config parity, every encoding knob matched across both writers, is the partial fix and it's worth doing, but matching every knob you can name still leaves the ones you can't, so the cleaner answer is to remove the writer from the comparison entirely. Write the data once, then register those same physical bytes into both catalogs. For an Iceberg-versus-DuckLake read comparison this is concrete: you point Iceberg's add_files and DuckLake's ducklake_add_data_files at the byte-identical Parquet, so both catalogs describe the same files on disk and the only thing left varying is the engine's read path. Done that way, Iceberg and DuckLake came out read-neutral on identical bytes, which is the true result, and one you can only see once you've stopped accidentally benchmarking two encoders against each other and calling it a format comparison.

The general form of the rule is that the variable you're testing has to be the only variable that moves, and "same data" is doing a lot of quiet work in that sentence. Same logical rows is not the same as same bytes, and same codec is not the same as same encoding, so when you can arrange to feed both sides the identical files you should, and when you can't you should at least know which uncontrolled knob is carrying your result.

Rule three

Isolate the run, or measure the contention instead.

A timing benchmark has to own the machine while it runs, because anything else heavy on the same host competes for the same cores, the same memory bandwidth, and the same disk queue, and that contention doesn't add a constant you can subtract out. It adds variance, and worse, it adds variance that lands unevenly across the engines depending on when each one happened to be scheduled against the competing load, which is precisely how a benchmark invents a difference that isn't there. I've watched a co-running job inflate the coefficient of variation enough to turn two engines that were genuinely within noise of each other into a clean-looking ranking, and the ranking flipped on the next run because it was never measuring the engines in the first place. It was measuring which one lost the fight for the cache that particular afternoon.

The practice is dull and it works. Run benchmarks one at a time, with the build jobs and the data generation and the other experiments stopped, and don't run a second benchmark in another terminal because you're impatient, which is the version of the mistake I'm most prone to. If you have to share the host, pin the work and account for it honestly rather than pretending the contention washes out, but the default should be a quiet machine. This rule earns its place next to the coefficient of variation because the two failure modes look identical from the outside, both showing up as a spread in the timings, and you'll waste a day chasing an engine difference that was really the compiler you left running in the background.

None of this requires a dedicated benchmarking cluster, which is fortunate because I don't have one. It requires the discipline to let the run have the machine to itself for the few minutes it takes, and to resist parallelizing the one thing whose whole purpose is to be measured serially.

Rule four

Control the environment, including the power plan.

The machine you measure on has settings that move your numbers, and the one that surprised me most was the operating-system power plan. On this Windows host running the benchmarks under WSL2, switching from the default balanced plan to High Performance dropped the coefficient of variation on a sustained workload from 5.8% to 0.8%, and on a short workload from 19.8% to 2.7%. That is not a small effect, and it came from a setting rather than from hardware, no faster CPU, no more memory, no migration to a proper bare-metal box. The balanced plan was throttling the clock between bursts and ramping it back up unevenly, so each run started from a slightly different frequency state and the timings scattered accordingly. Pinning the plan to High Performance kept the clock steady, and the steadier clock made the measurement repeatable.

The general lesson is that "control the environment" means more than closing your browser, because the modern stack is full of adaptive behaviors that trade steady-state performance for power and that stay invisible until you go looking for the variance they cause. CPU frequency scaling, turbo boost, thermal throttling, the laptop-versus-plugged-in governor, the WSL2 memory and CPU caps, the filesystem the temp files land on; any of these can put a wobble in your numbers that you'll misattribute to the engine. The honest move is to fix the ones you can, document the ones you can't, and report the coefficient of variation so the reader sees how steady the platform was while you measured. A benchmark run on an unpinned laptop power plan is reporting the governor's mood as much as the engine's speed.

I'll flag the obvious caveat here because it's the honest one: a setting that cut my variance by an order of magnitude on this host might do less on a server with a fixed clock, and the specific numbers are this machine's. The transferable part is the instruction to go find the adaptive behaviors before they find you, and to treat the environment as something you configured on purpose rather than whatever the laptop happened to be doing.

Rule five

Hash the logical rows, not the bytes.

The last rule is about integrity checking, the part of the harness that confirms two runs operated on the same data, and it has a subtle trap. The obvious way to check that a Parquet file is unchanged is to hash the bytes, and for most file formats that's correct, but Parquet written by a parallel engine is not byte-reproducible by default. Write the same logical table twice with DuckDB and you can get two files whose bytes differ, not because the data differs but because the parallel writer interleaves row groups in whatever order the threads finished, so the row order on disk shifts between runs while every logical row is identical. A byte hash of those two files disagrees, and if your integrity check is a byte hash you'll either get spurious failures or, worse, you'll loosen the check until it stops protecting anything.

The cause is the parallelism, not the writer being wrong, and the fix is to hash what you actually care about, which is the logical content. Compute the hash over the rows in a defined order, or over an order-independent digest of the row set, so two files with the same data and different physical layout hash the same and two files with genuinely different data don't. As a side benefit, if you do want byte-level reproducibility for archival, writing single-threaded or with an explicit ORDER BY gets you there, and the reproducible layout came out roughly 20% smaller in my runs because the sorted order compresses better, so determinism and size pulled in the same direction rather than against each other. That's a pleasant surprise rather than the point, though. The point is that an integrity check has to verify the thing you mean by "same," and for Parquet "same" lives at the level of the rows, not the bytes.

There's a matching risk one layer over, where hashing the logical rows proves two runs saw the same data, but it can't catch a corruption that both runs share, and Parquet's own defense against that, the per-page CRC32 checksum, is honored inconsistently across readers. In a probe where I flipped a single byte inside a checksummed page, chDB verified the checksum and raised an error, while DuckDB and DataFusion don't verify page checksums at all and PyArrow and Polars ship the check turned off by default, so four of the five returned a confident wrong sum. That only bites when a byte actually flips, which makes it an integrity backstop for cheap or cold storage rather than a routine worry, but it makes the same point one layer down: verifying the answer and verifying the bytes are different jobs, and an integrity story for evidence-grade logs wants both, the logical-row hash for run-to-run sameness and the page checksum, where the reader honors it, for a silent corruption underneath.

Put rule zero and rule five together and you have the correctness spine of the whole method: the answer-equality gate checks that two engines computed the same result, and the logical-content hash checks that they computed it over the same data, and between them they close off the two ways a benchmark can be precisely measuring the wrong thing. The speed number sits on top of that spine, and it's only worth reporting because the spine is holding it up.

What this is and isn't

One machine, Tier B, and the method is the part that travels.

I should be precise about what these results are, because the discipline I'm arguing for applies to me first. This is single-machine, first-party, Tier B work. The absolute milliseconds are this host's and nobody else's, the 1.7× PyArrow-versus-DuckDB file-size gap is from these particular writer defaults at these versions, and the power-plan variance numbers are a property of this Windows and WSL2 setup. I'd push back on anyone, including me, who quoted the raw numbers as a server-class result or a universal constant. What transfers is not the milliseconds but the ratios and, more durably, the method: the coefficient of variation has a scale below which your comparison is noise, encoders make "same codec" an incomplete control, contention inflates variance, adaptive power behaviors wobble an unpinned clock, and parallel writers make byte hashes the wrong integrity check for Parquet. Those are properties of the measurement problem, and they hold whatever host you run on.

The closest analogues for what a credible benchmark looks like come from outside security, in the TPC suite and in MLPerf, both of which publish their methodology openly and subject results to audited review rather than asking you to take a vendor's table on faith. Security data has lacked the equivalent, an independent layer with open methodology and a named reviewer, and the reasons it's lacked one are partly contractual, which I've written about separately: the schema-on-read SIEMs prohibit customers from running competitive performance tests at all, so the only benchmarks that exist are vendor-funded by construction. This piece is the other half of that argument. Even when no contract is stopping you, even on open engines you ran yourself, the benchmark will still mislead you if it skips rule zero, because open methodology you can read is no protection against an answer nobody verified.

So the method is the deliverable, more than any single milliseconds figure, and that's deliberate. A speed number ages the moment the next engine version ships, but a harness that gates on correctness, reports its noise, controls its data and its environment, and hashes the right thing keeps producing numbers you can stand behind across versions and hosts. The numbers are evidence; the method is what makes them admissible.

Evidence: Tier B (first-party, single machine, reproduced). Findings drawn from SDW Lab runs on a Windows/WSL2 host: the cross-engine answer-equality divergence (chDB 4.1.8 vs DuckDB over byte-identical Parquet); coefficient of variation up to 55% at 1M rows settling to ~5% at 10M and ~4% at 100M; PyArrow 193 MB vs DuckDB 114 MB at the same codec on identical data; the High-Performance power plan dropping CV from 5.8% to 0.8% (sustained) and 19.8% to 2.7% (short); and Parquet's lack of default byte-reproducibility from parallel row order, with the reproducible single-threaded / ORDER BY layout running roughly 20% smaller. Absolute timings are this host's; the ratios and the method transfer. Methodology is published with the lab.

A number you can stand behind, or a number that just came out.

The difference between the two isn't a faster machine. It's a harness that checks the answer before the clock, reports its own noise, feeds both sides the identical bytes, owns the host while it runs, and hashes the rows instead of the file. The lab publishes the method openly and gates the reference implementation, so the result is reproducible without anyone violating a license.

Read the lab page → The query engine returned the wrong answer Why vendor benchmarks are the only benchmarks