Security Data Works

Writing · Measurement hygiene

Scale before you measure, or you're reporting noise.

Most of the security-tool micro-benchmarks I see are run at a scale where the queries finish in a few milliseconds and the run-to-run jitter is larger than the difference being reported, so the headline winner is a coin flip dressed up as a measurement. I ran the same workload across a scale ladder to find where that breaks, and on this host the answer was stark: the same engine's results were so unstable at 10 million rows that a "23% faster" claim there would have meant nothing, and only as the data grew did the numbers settle into something you could actually compare.

The number that gets quoted

A speedup with no error bar around it.

The benchmark that drives an architecture decision almost always arrives as a single ratio. Engine A is 23% faster than your SIEM on this query, or the new lakehouse answers a 30-day search in 1.8 seconds against the incumbent's 9, and the ratio gets pasted into a slide and a procurement memo and from there into a roadmap. What's missing from nearly every one of those numbers is the thing a statistician would ask for first, which is how much that figure moves when you run it again. A measurement without a sense of its own variability isn't really a measurement, it's a single draw from a distribution, and if the distribution is wide then the next draw could just as easily have put the other engine ahead.

The way you size that variability is the coefficient of variation, the run-to-run standard deviation expressed as a percentage of the mean, and it's the cheapest discipline in benchmarking because all it costs is running each query more than once and doing the arithmetic. If your CV is 5% and engine A beats engine B by 30%, the gap clears the noise and you can believe it. If your CV is 40% and the gap is 23%, the gap is inside the noise, which means the difference you're about to report could vanish or reverse on the next run, and reporting it as a finding is closer to reading tea leaves than to measuring anything.

So before I trust any speed comparison, mine or anyone's, I want to know where the noise floor sits for the system under test, because the noise floor is what decides whether a gap is signal. I went to measure exactly that across a scale ladder, expecting it to be a boring methodological warm-up before the real benchmark. The warm-up turned out to be the finding.

What the scale ladder showed

Fifty-five percent variation at ten million rows.

The run was straightforward. I took the same workload (synthetic network-connection events written once to Parquet) and read it with ClickHouse's embedded engine, chDB, across a scale ladder of 1 million, 10 million, and 100 million rows, running each query repeatedly and recording not the headline time but the coefficient of variation across the repeats. At the small end the result was almost comically unstable. The queries were finishing in something like 5 to 50 milliseconds, which is fast enough to be useless as a measurement, because at that duration the run-to-run jitter (scheduler hiccups, cache state, whatever else the operating system is doing in that 20-millisecond window) is a large fraction of the total time. On this host chDB's CV ran as high as 55% at 10 million rows on the shortest queries, which is to say the query time swung by more than half from run to run while nothing about the workload changed.

The instability collapsed in a clean, monotonic way as the data grew. At 1 million rows the variation sat around 19%, which is already too loose to trust a sub-25% gap. By 10 million rows the picture was mixed, with the shortest micro-queries still up around that 55% high-water mark while the heavier ones had settled toward the single digits, around 5%. By 100 million rows the CV had come down to roughly 4% and stayed there, which is finally tight enough that a real 10% or 20% difference between two engines would stand clearly outside the noise. The pattern is exactly what you'd expect once you say it out loud: the larger the query's real work, the smaller the fixed jitter is as a fraction of it, so scaling the data is how you drown the noise under signal.

The practical line that falls out of this is specific rather than philosophical. On this kind of workload, any query that finishes in under about 100 milliseconds and any scale below roughly 10 million rows is noise-dominated, so a comparison run there is measuring the machine's mood, not the engines. A "23% faster" headline produced at a million rows with no CV reported is an artifact rather than a weak result, and you should treat it the way you'd treat a coin that came up heads once.

The discipline this implies

Find the noise floor first, then report it.

The method that comes out of this is short and I've started treating it as non-negotiable. First, scale the data until the signal you actually care about clears the coefficient of variation, which usually means running large on purpose rather than running small for convenience, because the small run is the one that lies to you. Second, report the CV next to every number you publish, so the reader can see for themselves whether a gap is real, and so you can't quietly hide a 40% variation behind a clean-looking mean. Third, discount any comparison whose gap is smaller than the combined run-to-run variation of the two things being compared, because a 15% difference between two engines that each vary by 10% is not a 15% difference, it's a wash you happened to catch leaning one way.

A few corollaries ride along with that, and they're the unglamorous mechanics that separate a measurement from an anecdote. Warm versus cold matters, because the first run pays for cold caches and file-system reads that later runs don't, so mixing a cold first run into the average inflates both the mean and the variance and you should decide deliberately which regime you're measuring. Trial count matters, because you can't estimate a coefficient of variation from two runs, and the noisier the system the more trials you need before the CV itself stabilizes. And a single run is never a measurement, full stop, because one number carries no information about its own reliability, which is the entire failure I'm describing here.

None of this is exotic. It's the ordinary hygiene that any experimental field would insist on, and the only reason it bears repeating in this domain is that the security-tooling benchmark genre has largely skipped it, partly because a tight, scaled, CV-reported result is less flattering than a single fast number, and partly because the people producing the numbers are usually the people selling the engine.

Why this matters to a buyer

You can't tell a real 10x from jitter.

The reason this isn't merely an academic complaint is that these are the numbers driving real architecture decisions. The "X is faster than your SIEM" claims that show up in vendor decks and conference talks are frequently produced at modest scale with no coefficient of variation anywhere in sight, which leaves the buyer unable to do the one thing they most need to do, which is tell a genuine order-of-magnitude advantage apart from a result that happened to land favorably. A real 10x is worth reorganizing a pipeline around. A 23% gap that's actually a 40%-CV coin flip is worth nothing, and from the slide alone the two look identical, because the slide shows you the mean and hides the spread.

That asymmetry favors whoever is producing the number, because a small-scale run is cheap, it's fast to generate, and it tends to produce a dramatic-looking ratio precisely because the noise is large, so the incentive structure quietly rewards exactly the methodology you'd want to avoid. A buyer who knows to ask "at what scale, over how many trials, with what coefficient of variation" can deflate most of these in one question, and the discomfort that question produces is itself diagnostic, because a result that was measured properly has those numbers ready and a result that wasn't goes quiet.

This sits alongside a structural problem I've written about separately, which is that the contract regimes around incumbent SIEMs often prohibit customers from publishing benchmarks of their own tools at all, so the only numbers in circulation are the vendor-funded ones. That argument is about who is permitted to measure. This one is about whether the measuring was done correctly even when it was permitted, and the two failures compound, because a number that nobody independent was allowed to check and that wasn't scaled past its own noise floor is doubly worthless and still routinely treated as fact. The full version of the method, the harness and the checks end to end, is what I'm building toward in a companion piece on how to run a benchmark that doesn't lie to you, but the noise-floor discipline is the first gate and the one that catches the most.

What this is and isn't

One host's numbers, a portable principle.

I want to be precise about what I'm claiming, because the temptation to inflate a tidy result is real. The specific magnitudes here are this machine's: a 55% coefficient of variation at 10 million rows, 19% at a million, around 5% and then 4% as the data scaled, all measured on a single host with one engine on one workload. Run the same ladder on a quieter box, or a busier one, or against a different storage layout, and the exact percentages will move, possibly a lot, so I'd push back on anyone who quoted "55% at 10M" as a constant of nature rather than as one host's reading. This is Tier B evidence, first-party and reproducible but single-machine, and I'd rather say that plainly than dress one run up as a law.

What does transfer is the principle, and I think it transfers cleanly. Sub-100-millisecond queries are jitter-dominated on essentially any general-purpose machine, because the fixed costs that drive the variance (scheduling, cache behavior, background work) don't shrink just because your query is small, so the ratio of jitter to real work blows up at the small end no matter whose hardware you're on. The consequence, that you have to scale until the signal clears the noise and that you have to know your own noise floor before you report a difference, holds regardless of where the floor actually lands for you. The number is local; the discipline is general.

The honest open question is how the floor moves across engines and storage formats, because chDB's variance profile isn't guaranteed to match DuckDB's or Trino's, and a columnar engine reading a well-partitioned table may bottom out at a different scale than one reading many small files. The responsible next step is the same one good benchmarking always points to, which is running the ladder across more than one engine and reporting the floor for each, so that the noise floor becomes a published property of the test rather than an assumption hidden inside it.

I've taken the first step in that direction, and it confirms the worry that the floor is engine-specific rather than a property of the workload alone. Running the same kind of OCSF network-activity workload over a million rows on the MOAR reference stack, single host, median of four trials with the coefficient of variation reported next to every latency, the noise floor moves around by engine and by query in a way you'd have no way to see from the latencies on their own. A plain count over the table varies by about 10% on DuckDB, 10% on Trino, and 11% on ClickHouse, while StarRocks holds that same count to roughly 1%, so StarRocks is the steadiest of the four on the small-batch counts. A needle lookup on dst_port=3389 sits at 3% on DuckDB, 6% on Trino, 8% on ClickHouse, and again 1% on StarRocks. The group-by on dst_port is tighter and more even, around 7% on DuckDB and Trino, 5% on ClickHouse, and 11% on StarRocks, the one cell where StarRocks is the loose one. And the high-cardinality distinct over src_ip is the noisiest cell of all for the two embedded-style engines, 14% on DuckDB and 17% on Trino, against 6% on ClickHouse and 2% on StarRocks. No engine is uniformly the calmest, the noise floor depends on which engine is reading and which query it's running, which is exactly why the CV has to be reported per engine and per cell rather than assumed once for the whole test.

I've since run the same four engines up a hundredfold, to a hundred million rows on the same single host, and a second rung is enough to start seeing the per-engine floor take a shape rather than read as one scattered reading. The count holds tight as the data grows for the two engines that were already steady, 2% on ClickHouse and 2% on StarRocks, while DuckDB loosens to 9% and Trino sits at 4%, so the calm engines stay calm where it counts. The interesting move is the needle lookup on dst_port=3389, which is where DuckDB's timing wobbles most: its CV climbs from 3% at a million rows to 16% at a hundred million, the noisiest cell anywhere in the run, while the other three hold that selective point lookup in low single digits (4% on Trino, 5% on ClickHouse, 3% on StarRocks). The dst_port group-by tightens to 2% on DuckDB and lands at 3% / 6% / 6% across Trino, ClickHouse, and StarRocks. The high-cardinality distinct over src_ip is where the cost of pushing scale becomes its own datapoint, because Trino errored on it outright at a hundred million rows and produced no number at all, while DuckDB ran it at 9%, ClickHouse at 8%, and StarRocks at 9%, so the place an engine stops being measurable is itself a reading about that engine. Two rungs don't make a ladder, but they're enough to settle the direction of the question and to show the shape starting to form: StarRocks and ClickHouse hold low single-digit CVs on the scan and count work, DuckDB's selective point lookup is its noisy corner, and the floor is engine-and-query-and-scale specific rather than a single number you can quote once. The full 1M/10M/100M ladder run for every engine, and a second host to separate the machine from the engine, remain the larger open follow-up; quoting any one engine's CV as the test's noise floor would still be wrong.

Why the floor is the deliverable

The error bar is the honest part.

The thing I keep returning to is that the coefficient of variation is the most honest number in the whole report, and it's the one that vendor benchmarks almost never include, which I don't think is a coincidence. A wide error bar is an admission that the comparison isn't decisive yet, and admissions don't sell engines, so the genre has converged on the clean single ratio that conceals exactly the information a buyer needs. An independent measurement layer earns its keep precisely by refusing that, by scaling the data until the gap clears the noise and then publishing the noise next to the gap so the reader can audit the call rather than take it on faith.

I'd go further and say a speed benchmark with no reported variance isn't a weaker version of a good benchmark, it's a different and more dangerous object, because it carries all the authority of a measurement with none of the substance, and the more polished the slide the more convincing the artifact. The fix is cheap, which is the frustrating part. Running each query thirty times instead of once and printing the CV costs minutes, and it's the difference between a number a careful reader can trust and a number that should never have left the lab.

So the lesson I took off this run is almost old-fashioned: find your noise floor before you report a difference, scale until the signal clears it, and show the floor in the writeup so nobody has to take your word for it. The reason a security-data practice can credibly stand between vendors and buyers isn't that it runs faster benchmarks, it's that it runs benchmarks honest enough to tell you when the difference they found wasn't real, and that honesty starts with the unglamorous act of knowing how much your own measurements wobble.

Evidence: Tier B (first-party, reproducible; single machine). Measured in the SDW Lab on a scale ladder (1M / 10M / 100M synthetic network-connection events, written once to Parquet, read by chDB across repeated trials). Coefficient of variation up to 55% at 10M rows on sub-100ms micro-queries, ~19% at 1M, ~5% at 10M on heavier queries, ~4% at 100M. The CV magnitudes are this host's; the noise-floor principle is what transfers. A cross-engine cut (2026-06-07, MOAR reference stack, single host) confirms the floor is engine-specific: over a 1,000,000-row OCSF network_activity table, median of four trials with CV per cell, count(*) varied 10% / 10% / 11% / 1% across DuckDB / Trino / ClickHouse / StarRocks, the dst_port=3389 needle 3% / 6% / 8% / 1%, the dst_port group-by 7% / 7% / 5% / 11%, and the src_ip distinct 14% / 17% / 6% / 2%. A second rung at 100,000,000 rows on the same host (median of four trials) gives count(*) 9% / 4% / 2% / 2%, the dst_port=3389 needle 16% / 4% / 5% / 3% (DuckDB's noisiest cell), the dst_port group-by 2% / 3% / 6% / 6%, and the src_ip distinct 9% / err / 8% / 9% — Trino errored on the heavy high-cardinality distinct at 100M and returned no number, itself a reading about where that engine stops being measurable. Two scale points per engine, not the full 1M/10M/100M ladder and not a second host, which remain the larger open follow-up. Methodology is published with the lab.

Know your noise floor before you report a difference.

A speedup with no error bar around it is a coin flip dressed up as a measurement. The independent layer worth paying for scales the data until the signal clears the noise, reports the coefficient of variation next to every number, and tells you when the difference it found wasn't real.