Writing · Detection integrity

The query engine returned the wrong answer and didn't tell you.

The promise of the open lakehouse is that the engine is a commodity: you keep your data in Parquet under an Iceberg table and you point whatever engine you like at it, swapping ClickHouse for DuckDB for Trino as the workload or the price changes. I went to measure how true that is across a scale ladder, and the measurement turned up something I wasn't looking for, which is that one engine answered a simple filtered count tens of rows short of the others over the byte-identical file, returned that wrong number fast, and raised no error doing it.

What I was actually testing

Are the engines really interchangeable?

Interchangeability is the load that the whole open-architecture argument carries, so it's worth being precise about what it actually claims. Three things are supposed to be true when you store security telemetry in an open columnar format and read it with more than one engine: the SQL ports (the same query runs on each engine with at most cosmetic dialect changes), the latency is roughly comparable (no engine is mysteriously an order of magnitude slower on the same bytes), and the answer is the same (a count is a count, a sum is a sum, regardless of which engine computed it). The first two get all the attention because they're what benchmarks measure and what migration plans worry about. The third tends to be assumed, because it feels like it shouldn't even be a question: the file is the file, SQL has defined semantics, so why would two correct SQL engines disagree on how many rows match a predicate?

I was testing the first two. The run was a straightforward scale ladder (1 million, 10 million, 100 million rows of synthetic network-connection events written once to Parquet and read by ClickHouse's embedded engine, chDB, and by DuckDB), and the question in front of me was the boring methodological one of where the noise floor sits. At a million rows the queries finish in single-digit milliseconds and the run-to-run variation swamps any real difference, with the coefficient of variation up around 19%, so any "engine A beat engine B" call at that scale is a coin flip dressed up as a result. By 10 million the variation settles to around 5%, and by 100 million to around 4%, which is the actual point of running large: you scale until the signal you care about clears the noise you can't avoid. That finding alone is worth stating plainly, because a surprising amount of published benchmarking is done at a scale where the headline difference is inside the error bars.

To make the comparison fair I'd built in an answer-equality check, almost as an afterthought. Before timing anything, the harness runs each query on each engine once and compares the results, on the principle that there's no point reporting that engine A is 8% faster than engine B if they're not even computing the same thing. I expected the check to be a formality. It was not.

Where it broke

Forty-nine rows short, no error.

At 100 million rows the equality check failed. A selective lookup (a count(*) with an equality filter on a high-cardinality column, the kind of query that sits under any "how many events matched this indicator" question) came back from chDB 49 rows short of the count DuckDB computed over the same Parquet files. Not 49% short, which would have screamed; 49 rows out of a result in the hundreds, a quiet few-percent undercount on that particular slice. No exception, no warning, no log line flagging a skipped row group or a short read. The query compiled, the engine ran it, and it returned a number that was simply too small.

The thing I want to convey about that moment is how easily it could have not happened. If I'd trusted the timings (chDB was fast, and fast is what a performance benchmark is looking for) the run would have produced a clean, plausible, publishable table showing chDB holding its own against DuckDB at 100 million rows, and the 49 missing rows would have ridden along inside a result that looked completely normal. The only reason I saw it is that the equality gate I almost didn't bother building compared the two answers before it compared the two clocks.

So I stopped measuring speed and went to find out what was actually wrong, because a silent wrong answer is a far more interesting result than a benchmark, and a far more dangerous one if it were sitting in a detection pipeline instead of a test harness.

Pinning it down

It's the tail of the file.

A wrong answer at 100 million rows is expensive to study, so the first instinct was to shrink it, and that's where the bug got more interesting, because at 10 million rows it wouldn't reliably reproduce at all. The same generator at a tenth the scale, swept across all two thousand distinct filter values, came back clean: every count matched the ground truth. Only back at the original hundred-million-row scale did the undercount return, and there it was unmistakable, with a few hundred of the two thousand probe values each landing a handful of rows short. So the defect is scale-dependent, and that dependence is itself the first clue, because it tracks the number of row groups in the file (roughly eight thousand at a hundred million rows, only a few hundred at ten million), which points at the read path's row-group handling rather than at the data or the query. It also carries a methodology lesson I nearly tripped over: a cheaper, smaller reproduction can hide a scale-dependent bug, so you have to re-test at the scale the thing was first seen at rather than assume it shrinks.

With the bug pinned to scale, the shape of it came clear, because the trigger is the equality predicate specifically. When I rewrote the same filter as a LIKE against the same column on the same engine, chDB counted correctly, so it isn't that chDB can't read the file. When I loaded the same data into chDB's own native MergeTree storage instead of reading the Parquet, it counted correctly there too, so it isn't that chDB can't evaluate the predicate. And turning off the Bloom-filter pushdown, or falling back to the engine's older Parquet reader, made the undercount vanish while leaving everything else the same. So the defect lives at one precise intersection: the engine's newer Parquet reader, evaluating an equality (= or IN) filter pushed down into a Bloom-filter probe, on a file with enough row groups to trip it. That is a narrow, specific defect in one columnar read path, and the scale-and-Bloom fingerprint is the signature of a pushdown-pruning false negative rather than anything to do with the data or the SQL.

For the record, the engine and version are chDB 4.1.8, which embeds ClickHouse 26.3.9.1, and the point of naming it isn't to single out a project: it's open source, the reproduction is in the open, and the right next step was an upstream report, which is a contribution rather than a complaint. That report is filed (chdb-io/chdb#587), and the same probe confirmed the defect is already fixed in ClickHouse 26.5, so what's left of it lives only in builds still pinning the older embedded reader. The durable point survives the fix: the engine doing the undercounting is one of the genuinely good ones, fast and widely used, which is exactly why "we picked a solid engine" is not the same thing as "our answers are right."

Why silent is the dangerous part

A count is a detection threshold.

In a security context a count(*) WHERE is rarely just a count. It's the number behind "how many hosts beaconed to this domain," or "how many failed authentications preceded that success," or "how many records are in scope for this retention requirement." Those numbers feed thresholds, alert conditions, and compliance attestations, and the entire value of the substrate is that they're correct. An engine that returns a count a few percent low, silently, doesn't trip a threshold it should have tripped, doesn't escalate a beaconing pattern that crossed the line, and reports a scope figure that's quietly wrong on an audit. None of that announces itself, because the failure mode of a too-low count is indistinguishable from the world simply being quieter than you feared.

This is the same shape as the worst detection failure there is, the rule that never fires and does it quietly, where zero results from a search looks exactly like a calm network rather than like a broken pipe. A wrong-but-plausible count is worse than a crash, because a crash gets investigated and a plausible number gets believed. And the ordinary toolchain offers no protection here, because schema validation checks the shape of the result (an integer came back where an integer was expected) and never the magnitude (whether that integer is the right one). The query was valid SQL, the result was a valid integer, and everything in the normal path was satisfied while the answer was wrong.

The reason this matters more on an open lakehouse than on a single closed engine is precisely the thing that makes the lakehouse attractive. The whole point is that you can swap engines, and the more engines that can read your tables, the more independent read paths there are, each with its own pushdown implementation and its own set of seams. The portability that lets you escape one vendor's pricing is also surface area for exactly this class of disagreement, which is a trade worth making, but only if you know you're making it.

What this is and isn't evidence of

One engine, one version, one filter.

I want to be careful not to overclaim from a single finding, because the temptation to inflate this is real and I benefit from the inflated version. This is one engine, at one version, on one class of filter, against one data shape (a file with many small row groups). It is not evidence that engines routinely disagree, that chDB is unreliable in general, or that open formats can't be trusted, and I'd push back on anyone (including me) who tried to stretch it that far. What it is evidence of is narrower and, I think, durable: answer-equivalence across engines is not free, it is not guaranteed by SQL semantics or by an open format, and the only way to know you have it on your data, your shapes, and your engine versions is to check.

The honest open question is generality. I don't know yet whether the same equality-pushdown undercount appears in DataFusion, in Polars, in Trino, or in a newer chDB build, and the responsible next step is a small cross-engine probe that runs the same many-row-group equality test against each of them and reports who passes as loudly as who fails, because the finding here is the method, not a single vendor's name. Until that probe runs, "engines are not answer-interchangeable at scale" is a well-evidenced hypothesis with one clean datapoint, not a settled law, and I'd rather say that than dress one bug up as a movement.

It's also worth saying what held up. DuckDB matched ground truth on every probe; chDB's own MergeTree path was correct; LIKE was correct. The disagreement was localized and specific, which is the good news inside the bad news: this is a findable, fixable class of defect, not a fog of general unreliability, and a gate that's looking for it catches it cleanly.

What the follow-up found

I ran it down, and most of the stack fails safe.

The honest open question I left a moment ago was generality, and the responsible next step was to actually run it, so I did. The cross-engine probe grew to thirteen distinct Parquet readers, each reading the same byte-identical file and checked against the generator's ground truth, and the headline disciplines the original finding more than it amplifies it: eleven of the thirteen read the file correctly, and at the time only two were silently wrong. One was the chDB reader from before, and when I re-ran this on the current libraries that one had already been fixed in the very next point release, 4.1.8 to 4.1.9 — which fits what it always looked like, a version-scoped regression in one reader's Bloom-filter probe rather than a property of the engine, since the standalone ClickHouse server at an older version had read the same file correctly all along. The other is fastparquet, the retired pure-Python reader mis-decoding a dictionary-encoded column, and on the latest version it still gets it wrong. So the honest current count is one silently-wrong reader out of twelve rather than two, and that makes the case stronger, because the failure set is real, version-bound, and moving: we caught two, one was fixed by a point release inside the same cycle, one persists, and which it is on any given day is exactly the thing you cannot assume. That is the argument for running the equality check in CI rather than trusting that the readers agree.

Both of those bugs lived in the Parquet library rather than in anything SQL-shaped, so the natural move was to push the same verify-the-answer probe down below the engine into the storage format itself, and that's where the more interesting half of the follow-up is. Parquet pages can carry a CRC32 checksum, but writing one and checking it on read are separate decisions, made differently across readers: on a single deliberately flipped byte inside a checksummed page, chDB verifies the checksum and raises an error, while DuckDB and DataFusion don't verify page checksums at all and pyarrow and Polars ship the verifier turned off by default, so four of the five hand back a confident wrong sum. That extends the discipline one layer down. The cross-engine gate catches an answer one reader gets wrong, but only a checksum catches a corruption that every reader would agree on, so "verify the answer" has to grow into "verify the bytes" once evidence-grade logs are sitting on cheap or cold storage. The fair caveat is that this one bites only when a byte actually flips, which makes it an integrity backstop rather than a routine hazard.

Two of the deeper probes cut against the alarming reading, and those deserve saying as plainly as the bug did. Summing the same column of floating-point numbers on five engines produced three slightly different totals, but the integer sums, the counts, the minimums and the maximums were identical to the last bit, which is the tell that the float spread is ordinary IEEE-754 rounding from different accumulation orders and not a defect in any engine. The practical consequence is narrow and useful: an answer-equality check should compare integer-typed results exactly and floating-point results within a tolerance, rather than flag benign rounding as a disagreement. And when I went straight at the pruning machinery, the row-group statistics and the page index and the bloom filters that let an engine skip data it believes can't match, every engine was sound on an identical-data test built to expose exactly that failure, so the chDB undercount does not generalize into "pruning is unsafe." The silent-wrong mode stays concentrated in a couple of specific reader paths, and most of the layer, when it fails, fails loudly.

The one place the open read contract genuinely breaks turned out not to be a bug but a design choice. A Parquet file encrypted with the format's own modular encryption is readable only by the library that wrote it, holding the key, so a file pyarrow encrypts is simply unopenable by DuckDB, Polars, DataFusion or chDB. For regulated data that has to be encrypted at rest, turning encryption on inside the file quietly revokes the swap-any-engine promise the whole architecture rests on, and the fix is to encrypt at the volume or object-store layer instead, so the bytes the engines read stay portable. Put the pieces together and the follow-up sharpens the original argument rather than inflating it: the swap surface is mostly trustworthy and mostly fails loudly, the genuinely silent failures are few and findable, and the thing that surfaced every one of them is the same cheap habit of checking the answer, and a layer down the bytes, before trusting either.

Why the gate is the product

A timing benchmark would have called this a win.

Sit with the counterfactual for a second, because it's the whole argument. A benchmark that measures only speed would have looked at chDB returning that count quickly and recorded a good result, and the wrong answer would have been laundered into a performance win and published as one. The faster the wrong engine, the more convincing the bad number. That inverts the usual intuition that a fast result is a trustworthy one, and it means a performance benchmark without a correctness gate isn't a weaker version of a good benchmark, it's an instrument that can actively certify the wrong answer.

This is the part of independent measurement that doesn't get talked about. I've written before about the contract regime, the clauses that prohibit customers from benchmarking their own SIEM, and the way that leaves only vendor-funded numbers standing. That's one half of why you can't take a published performance claim on faith. This is the other half, and it applies even when nobody's contract is stopping you: a benchmark you ran yourself, on open engines, with nobody's license restricting you, will still lie to you if it only checks the clock. Open methodology is necessary and it isn't sufficient, because methodology-you-can-read is no protection against an answer-nobody-verified.

Which is why I keep coming back to the idea that the verification is the deliverable, not the speed number. The reason a security-data practice can credibly stand between vendors and buyers isn't that it runs faster benchmarks, it's that it runs benchmarks that check whether the answer is right before they report how fast it was wrong. The cross-engine answer-equality gate isn't ceremony bolted onto a performance test; on this run it was the only thing in the entire apparatus that did its job, and the performance test was the part that would have misled me.

What the gate actually is

Cheap to build, the only thing that caught it.

The gate is not sophisticated, which is rather the point. It needs a ground truth, which you get for free if you generate the data, since you know how many rows match any predicate by construction; on real data the equivalent is a trusted reference answer, computed once and pinned. It runs the same query on every engine and on the reference, and it fails loudly the moment any two disagree, before any timing is recorded, so a divergence stops the run rather than getting averaged into a result. The probe values are chosen in advance rather than after seeing the answers, so the check can't be quietly tuned to pass. And it reports the engines that agree just as prominently as the one that didn't, because the output you want is "here is which engines are answer-equivalent on these shapes," not "here is a vendor to embarrass."

If you run an open lakehouse for security data, the practical version of this is a small standing check in your pipeline: a fixed set of queries whose answers you've verified once, re-run against each engine you actually use and each time you bump an engine version, with a hard failure on any disagreement. It costs almost nothing to run and it's the difference between finding a pushdown bug in a test and finding it after a quarter of under-counted detections. The same instinct shows up one layer down in the data, where a field mapped to the wrong meaning fails just as silently as a count that reads a few rows short, and I keep a small runnable check for that case for the same reason.

The broader lesson I took from the run is almost embarrassingly old-fashioned: verify the answer before you trust the speed, and don't let an engine's reputation stand in for a check you could have run in an afternoon. The open lakehouse makes engines a commodity, which is genuinely good, but a commodity you swap freely is a commodity whose answers you have to keep verifying, and the verifying is not the tax on the architecture, it's the part that makes the architecture safe to use.

Evidence: Tier B (first-party, reproduced; single machine). The divergence was found in the SDW Lab clickhouse-vs-duckdb scale run and is scale-dependent: it reproduces at 100M rows / ~8,139 row groups (about 285 of 2,000 probe values short by 3–13 rows each) and does not reliably reproduce at 10M / ~814 groups; chDB 4.1.8 (embedding ClickHouse 26.3.9.1) Parquet equality-pushdown vs DuckDB, chDB-MergeTree, and LIKE, all matching the generator's ground truth. Methodology and the 100M reproduction are published with the lab. The generality follow-up has since run (twelve Parquet readers: ten correct, two silently wrong when caught — one since fixed in a chDB point release, one still wrong on the latest), alongside four lower-level bake-offs taking the same check down to page checksums, encodings, pruning soundness, and modular-encryption portability. The upstream report is filed (chdb-io/chdb#587), and the same probe confirmed the defect is already fixed in ClickHouse 26.5 — a version-scoped regression.

The edge this sits on

Edge: silent-wrong is the failure shape with negative warning time — the chDB 4.1.8 undercount returned 49 rows short at 100M with no error raised, scale-triggered (it doesn't reproduce at 10M), and nothing in the pipeline looked unhealthy.

Signal: none exists by construction, which is the point of this essay — the answer-equality gate is the signal you have to install, because a production estate without a cross-engine oracle cannot see this class of failure at all.

Caveat: one host, Tier B; what travels is ordering and shape. Locate yourself on the breaking-points map with your own numbers.

Fast and wrong is worse than slow and right.

The open lakehouse makes the engine a commodity you can swap. That portability is real, and so is the new surface it opens: more read paths, more places an answer can quietly diverge. The independent layer that's worth paying for is the one that checks the answer before it reports the speed.

Read the lab page → Why vendor benchmarks are the only benchmarks See engagements