Security Data Works

Writing · Independent measurement

Why vendor benchmarks are the only benchmarks.

"You agree not to access or use an Offering to analyze, test, characterize, inspect, or monitor its availability, performance, or functionality for competitive purposes."

Splunk General Terms, Section 1.2(v)

That clause is in your contract today if you run Splunk, and Oracle has carried a version of it since the 1980s. It is not universal, though: Snowflake and Databricks leave benchmarking unrestricted. The implication for security architecture decisions is structural, and it explains a pattern most CISOs notice but rarely connect to its cause.

What the clause actually does

The customer can't run the comparison.

Section 1.2(v) of Splunk's General Terms (the agreement every Splunk Enterprise customer is bound by) prohibits using the product to analyze, test, characterize, inspect, or monitor its performance for competitive purposes. The Splunk Software License Agreement adds Section 3(f), which prohibits providing benchmark results to third parties without prior written consent. Both clauses bind, and Section 1.2(v) is the broader of the two because it restricts the act of running the comparison and not only the act of publishing the result.

The practical effect is that a Splunk customer who wants to evaluate whether ClickHouse, StarRocks, or Trino could run their security analytical workload faster (on their own data, on their own hardware, in their own environment) is contractually prohibited from doing it, and prohibited is the operative word here rather than merely discouraged.

This isn't a quirk of Splunk's contract drafting. Oracle's database licensing has carried a benchmarks clause since the original "DeWitt clause" in the 1980s, named after David DeWitt, the database researcher Oracle prevented from publishing comparative benchmark results, and proprietary data infrastructure has been written this way for forty years. The newer open-platform vendors are the exception worth noting: Snowflake's Acceptable Use Policy does not restrict benchmark testing, and Databricks' Master Cloud Services Agreement carries no comparable competitive-testing clause. The schema-on-read SIEM you would be migrating away from is far more likely to bind you than the lakehouse you would be migrating toward.

The structural consequence

Every published benchmark is vendor-funded by design.

When customers can't legally produce comparative benchmarks against their current platform, the only benchmarks that exist are the ones produced by vendors who don't have the bound platform's contract restricting them. A startup competing with Splunk publishes "we run security workloads X× faster than Splunk" because the startup never agreed to Splunk's General Terms. The Splunk customer who could verify or refute the claim with their own data did agree, and is locked out.

Look at the published security-tool benchmark coverage with that lens, and almost every "X% faster than Splunk" benchmark you've read came from a vendor whose product appears favorably in the result, while the counter-benchmarks from Splunk were produced internally and published as customer-facing studies rather than as independent verification. The class of benchmark that doesn't exist (and structurally cannot exist under the current contract regime) is the customer-driven head-to-head on real workloads, published openly, so the architects who would benefit most from independent measurement are the ones contractually prohibited from producing it.

This produces a recognizable failure mode in architecture decisions. CISOs evaluating SIEM modernization compare vendor-published claims against vendor-published claims. The most quantitatively rigorous number available is whichever vendor most recently funded a benchmark study. Analyst reports synthesize vendor-supplied performance characterizations because they don't have an independent source either, so decisions get made on the layer of measurement that happens to be available rather than the layer that would actually be useful.

A worked example from adjacent infrastructure

The Kafka cost-calculator wars.

The clearest contemporary case I've seen of this failure mode comes from outside security, from the streaming-infrastructure market, where the contract-regime is looser but the vendor-benchmark dynamic is identical. The case is worth walking through because the mechanics are the same as security; only the artifact is different. Stanislav Kozlovski, an Apache Kafka committer and ex-Confluent engineer (six years, including time on the Kafka Serverless team), published a forensic analysis in November 2024 of WarpStream's vendor cost calculator. Kozlovski's read-through of the calculator's methodology (what assumptions it locked in, where the math drifted, what the visible code actually did) produced one of the cleanest worked examples of vendor-benchmark mechanics available in public writing right now.

The summary, with sources at the bottom of this section: WarpStream's calculator, in the version Kozlovski analyzed, made Kafka look ~3× more expensive than what an experienced operator would actually pay on AWS for the same workload (1 GiB/s sustained, 7-day retention, replication factor 3). The drift came from five mechanical sources, each individually defensible and collectively directional: the calculator selected r4.4xlarge instances when r4.xlarge would have sufficed at the modeled load; it assumed gp2 storage at $0.10/GiB when sc1 HDD at $0.015/GiB was appropriate for the warm-tier workload; it silently shifted the compression ratio assumption from 5:1 to 4:1 in a quiet update, inflating Kafka costs ~25% with no change to the WarpStream side; it provisioned 60% free space on each broker disk where 50% had been the prior default, with no operational justification documented; and the visible JavaScript multiplied replication cost by an extra (RF-1) factor, double-billing replication traffic at RF=3. Kozlovski's fair AWS Kafka cost for the same workload, before any optimization, came to ~$337k/month against the calculator's $1.264M, a 3.75× delta. With KIP-405 Tiered Storage (~3× disk reduction, ~4× compute reduction) and KIP-392 Fetch From Follower (eliminates consumer cross-AZ traffic), the optimized Kafka cost dropped to ~$167k/month. The calculator's headline "WarpStream is 10× cheaper than Kafka" claim survived in AWS only at 1.3× and reversed sign in GCP (Kafka 21% cheaper) and Azure (Kafka 77% cheaper) at the same workload.

Whether WarpStream is or isn't a good product is beside the point, because the issue is that the calculator was produced by the vendor who benefits from the comparison, by a methodology the vendor controls, with no independent reviewer who could have pushed back on any of the five mechanical drifts before the artifact shipped, so that five small directional choices, each individually plausible, compounded into a 3× overstatement of the comparison cost. None of the drifts required malice, since they required only the ordinary gravity of incentive, and this is the same failure mode that Section 1.2(v) protects in security, where the customer who could verify can't, the vendor who could be wrong faces no structural check, and the published number becomes the industry reference anyway.

What's instructive is Kozlovski's response. Rather than publish a counter-narrative, he shipped a tool, the AKalculator, with the stated principle "incentivized to show the lowest possible cost for Apache Kafka that's also realistic" and the methodology fully disclosed: every assumption visible, every default toggleable, every cost line item attributed to a specific price-sheet entry. The AKalculator's bias is the opposite of the vendor calculator's, and Kozlovski says so on the page. That disclosure is the layer that distinguishes a methodology-open artifact from a methodology-closed one. It doesn't make Kozlovski's tool neutral (nothing is) but it lets the reader adjust, which is what closed methodology cannot.

Translate the same dynamic to security and the structural problem becomes obvious. When the "ClickHouse is X× cheaper than Splunk" benchmark drops, every one of the five Kafka-calculator drift modes is available to the publisher (instance sizing, storage class, compression assumption, free-space allocation, replication accounting) plus the security-specific ones (ingest-source mix, query-suite composition, retention-schema-tax accounting, OCSF normalization cost). The Splunk customer who could run the comparison and see which choices were defensible is contractually prohibited from doing it. The practitioner who could publish a counter-tool with disclosed methodology (the equivalent of an AKalculator for security data platforms) faces the same Section 1.2(v) restriction that prevents the customer from running the original comparison. The structural fix here isn't another counter-benchmark from a differently-incentivized vendor, because what's needed is an independent measurement layer that doesn't share either side's incentive in the first place.

Sources, all Tier B (Apache Kafka committer + ex-Confluent engineer + tool published with fully disclosed methodology): Stanislav Kozlovski, "The Brutal Truth about Apache Kafka Cost Calculators" (long-form analysis); "no one will tell you the real cost of Kafka" (cross-AZ networking deep dive); AKalculator (the counter-tool with disclosed methodology); KIP-405 (Tiered Storage) and KIP-392 (Fetch From Follower) as the operator-controllable cost-reduction primitives Kozlovski's analysis quantifies.

What independent measurement requires

Methodology open. Reference implementation gated. Reviewer named.

The instinct to fix vendor-benchmark bias is "open-source the benchmark." Push the methodology to GitHub, push the Docker Compose definitions to GitHub, push the data generators to GitHub, and let the community verify. That instinct is right about the methodology, which should be open, but it's wrong about the reference implementation, because the executable artifact, in a comparison set that includes commercial software with restrictive licensing, can't ship publicly without putting either the publisher or the downloader in contract violation.

What does work is publishing the methodology, hardware spec, query suite, and result openly under a practitioner brand, while the reference implementation is shared under one-page mutual NDA with engagement prospects and qualifying reviewers, and the work goes through an annual external review by a named practitioner with relevant standing (security data engineer, OCSF contributor, analyst-firm researcher) under NDA, with their signoff published on the public methodology page. That structure preserves the audit trail that makes the result credible without forcing the publisher to violate the licensing terms of the platforms they're characterizing.

The closest analogy is the TPC benchmarks (TPC-H, TPC-DS) and MLPerf. Both publish methodology openly. Neither distributes a turn-key reference implementation that a random customer can clone and run against arbitrary commercial software. Both rely on qualified-implementer audit cycles and independent third-party review. Security data has been waiting for the equivalent layer for the last decade and hasn't gotten it; the contract structure that suppresses it is part of why.

There's a fourth requirement I'd add that the TPC analogy understates, and I'd add it because my own lab embarrassed me into it: the benchmark has to verify the answer, not just the clock. Open methodology is no protection against an unverified result, and a performance test without a correctness gate can certify the wrong number as a win the moment the wrong engine happens to be the fast one. I watched exactly that nearly happen when one engine returned a filtered count tens of rows short of the others on the identical Parquet, silently, and the only thing that caught it was a cross-engine answer-equality check running before the timing (the full case is its own essay). So the structure isn't only methodology-open, reference-gated, reviewer-named; it's also answer-verified, because the contract regime is one reason you can't take a published number on faith and speed-without-correctness is the other, and the second one applies even to the benchmark you ran yourself.

What this means for architecture decisions

Three things to do with this.

One: read your platform's actual contract. Splunk's clause language is published at splunk.com/legal/splunk-general-terms — Section 1.2(v), in plain language, about ten minutes to find. If you run Snowflake or Databricks instead, check the equivalent terms (snowflake.com/legal, databricks.com/legal) and confirm for yourself that no such competitive-testing restriction is there. Knowing what you can and can't do with your own platform under your existing license is the precondition for any architecture conversation that involves comparison.

Two: discount vendor-published comparative benchmarks accordingly. "X is 10× faster than your SIEM" reads as a marketing claim rather than a measurement when the customer can't independently verify it, which doesn't make the underlying performance claim false but does shift the burden of proof to whatever independent measurement source you can find, and that pool is small and getting smaller as the contract regime tightens.

Three: when an independent benchmark is offered, look at how the artifact is structured. Does the methodology actually publish openly? Is the result reproducible by someone who runs the reference implementation under appropriate licensing? Is there a named external reviewer who has audited the methodology under NDA and signed off publicly? Those three questions are what separate independent measurement from vendor marketing dressed up as independent measurement, and most published security-tool benchmark coverage I've looked at fails at least one of them.

The comparison your team is structurally locked out of.

Independent benchmarks against your real workload, methodology open on the lab, reference implementation under one-page mutual NDA, annual external reviewer named publicly. The layer of measurement the current contract regime makes hard to produce in-house, produced in a way that respects every license involved.