H3-PERFORMANCE-01 · Tier B · 4.5/5
ClickHouse runs hunting-shaped aggregations 21–62× faster than a schema-on-read SIEM, 46.8× on the five-query average.
On a 10-million-event synthetic Zeek conn corpus, identical hardware and queries, ClickHouse native MergeTree
runs the hunting-shaped aggregations 21–62× faster than OpenSearch as the schema-on-read foil (top-source-IPs
21×, port-scan 62×), 46.8× on the five-query average — answer-equality verified, a single-node measurement so
far (Beelink 5800H, WSL2; Tier B, 2026-06-10). The average hides a clean split by query shape: the index
actually wins the simple lookups (protocol-distribution 3.4×, long-duration 1.8× in OpenSearch's favor). The
open-format Iceberg engines on the same workload came in at 3.6–10.1×, which is the price of keeping the storage
layer swappable. The columnar store under ZSTD-22 compresses 8.2× over raw JSON — the layout and the codec
together, not the codec alone. Methodology and result are public; the
reference implementation is shared under NDA with engagement prospects and qualifying reviewers.
You can see how this evidence rolls up into a recommendation on the worked scorecard.
The counter-position runs along two lines. First, "ClickHouse isn't a SIEM" — true, but the comparison is
on query-engine performance against the same workload, not against the broader SIEM feature set. Second,
"the workload was selected to favor columnar storage." The workload is a Zeek deployment shape, and the
result generalizes most cleanly to security workloads dominated by a small number of recurring queries.
A workload dominated by ad-hoc full-text search across raw events would narrow the gap.
What would change the answer: an independent benchmark on a workload with a meaningfully different shape
where the schema-on-read engine closes the gap, or evidence that the ClickHouse result fails to reproduce
on different hardware.
H1-COST-02 · Tier C · 3/5
Modern security data platforms cut downstream licensing costs 50–80%.
The 50–80% reduction is what a security data pipeline platform (Cribl, Tenzir, Vector, or equivalent)
saves on the SIEM licensing line by filtering, downsampling, and routing at ingest, before the per-GB
clock starts running. I anchor the magnitude to my own first-party MOAR cost model — a transparent
$/effective-GB build re-run against client data — and to the pipeline vendors' own published customer
reductions, not to a survey.
The counter-position is that "downstream licensing" isn't the only cost. The pipeline platform itself
carries an operational cost, and engineering time spent on routing logic is spend you'll feel. That's
correct; the 50–80% is the licensing line item, not the all-in TCO. Most engagements still come out
net-positive because the SIEM licensing line was the unforced error.
What would change the answer. A SIEM vendor moving to consumption pricing that prices the offset through,
or a pipeline platform pricing model that captures the savings on its own side.
H1-PLATFORM-01 · Tier B · 4.5/5
Iceberg + Dremio + Polaris is the strongest open-stack baseline for security data.
Apache Iceberg as the table format, Dremio as the query engine, and Polaris as the catalog (the metadata
layer that tells the engine what tables exist and where their data lives) is the configuration with the
most production-deployment evidence behind it. Netflix runs Iceberg at multi-petabyte scale; Insider
reports a 90% reduction in S3 storage cost after migrating onto this stack; Apple, AWS, Databricks, and
Snowflake have all announced first-class Iceberg support since early 2025.
The counter-position: Delta Lake (the Databricks-led table format) has comparable or better tooling
inside the Databricks ecosystem, and Apache Hudi has its adherents in heavy-streaming use cases. The
recommendation is workload-conditional, not categorical. But for the security-data workloads this practice
most often sees, the Iceberg stack ends up with the strongest combination of production validation and
engine portability.
What would change the answer: convergence of the table-format ecosystem (an active hypothesis on its own;
see contradictions below), or evidence that a Delta-native security-data deployment outperforms the Iceberg
equivalent on a comparable workload.
H-ARROW-01 · Tier B · 4/5
Apache Arrow has become the in-memory standard for cross-engine security data exchange.
Arrow is the columnar in-memory format engines use to hand data to each other without re-serialization.
The 2025–2026 adoption signal is consolidating: ClickHouse Inc. shipped an official ADBC driver in
February 2026 (the ClickHouse/adbc_clickhouse repo, replacing the prior community-driven
driver); Trino, DuckDB, and Polars use Arrow natively; Snowflake and Databricks return Arrow on the wire;
Elastic 8.16 added Arrow as a response format over REST. The ADBC Driver
Foundry (the adbc-drivers GitHub org) is the coordination point for cross-engine driver
development.
The counter-position is that Arrow doesn't replace JDBC or ODBC for legacy tools, and not every engine
reads or writes Arrow natively; some integrations still go through row-based intermediates with the
per-row serialization tax. That's correct; the claim is about trajectory, not universal adoption. The
operational read: "favor engines that speak Arrow on the wire," not "ban any engine that doesn't."
What would change the answer: a competing in-memory format gaining traction (no current evidence), or
Arrow's per-batch overhead at extreme cardinality proving worse than alternatives on a representative
security workload.
H-AI-ASYMMETRY-01 · Tier B · 4.8/5 · Settled
AI-enabled offense is maturing 2–3× faster than AI-targeted defense.
Anthropic's Claude Mythos preview demonstrated autonomous vulnerability discovery at a 72.4% exploit
success rate on the Firefox SpiderMonkey engine, fully autonomous, no human guidance per target. The
AISLE response replicated the result with eight small open-weight models in zero-shot API calls, settling
the working assumption that the moat is the system, not the frontier model. Mandiant M-Trends 2026
records negative mean-time-to-exploit: exploitation is now arriving seven days before patch
release. CrowdStrike clocks its fastest recorded attacker breakout at 27 seconds from initial access to lateral movement.
GTG-1002, the first publicly confirmed AI-orchestrated state-sponsored campaign, was 80–90% AI-executed
across roughly 30 victims.
The counter-position is that defenders also benefit from AI tooling, and the asymmetry is temporary. Both
parts are true. The asymmetry is temporary; the working window is roughly 2024–2027. But "temporary" on
this scale means defenders need to compress years of detection-engineering modernization into a few
quarters, and the maturity gap between offensive and defensive AI tooling is wide enough that I treat
this as a planning constraint, not a watch-list item.
What would change the answer: a defensive AI breakthrough that produces measurable detection-cadence gains
across multiple production deployments, or evidence that the offensive results don't generalize beyond the
Mythos / AISLE benchmark workload.
H-COST-09 · Tier A · 5/5
Tiered storage cuts 55–90% of long-tail security data cost.
Splitting security data across hot, warm, and cold storage tiers (hot on local high-performance storage,
warm on cheaper attached storage, cold on S3 Standard, archival on S3 Glacier or equivalent) reduces the
long-tail cost of retention by 55–90% in production environments. Netflix reports 70–80% of their storage
in cold tiers; Insider documents a 90% S3 cost reduction after tiering; Kafka 3.0+ supports tiered storage
natively, which has materially changed the operational story for streaming security telemetry.
The counter-position is the freshness trade-off. Warm and cold tiers have higher first-byte latency than
hot tiers; queries that span tier boundaries pay a complexity tax. For interactive threat hunting on
recent data, the tiering boundary needs to land further back than for compliance retention. Most
engagements end up with a tier boundary at 7–30 days; some land at 72 hours where the analyst hunting
window is short.
What would change the answer: object-storage pricing collapsing to the point where tiering's operational
complexity isn't worth the savings, or query-engine improvements that make cross-tier queries effectively
free.
H3-INTEGRATION-03 · Tier B · 4/5
OCSF earns its keep as a normalization hygiene baseline, not as the schema everyone finally agrees on.
OCSF (the Open Cybersecurity Schema Framework) is the shared event-shape standard adopted by 180+
organizations, including AWS Security Lake, Cisco, Sumo Logic, IBM, Cloudflare, and many of the EDR
vendors, and ITU-T Study Group 17 is standardizing it as Recommendation X.icd-schemas, with member states
backing it for ratification in December 2025 and adoption targeted for mid-2026. Where I think OCSF actually
pays off is narrower than the adoption numbers suggest: as a fixed reference shape that a pipeline can be
graded against, it catches the silent field-mapping failures that schema-on-read hides until a detection
quietly stops firing, and in one recursive-mapping pipeline I watched that conformance reference take an
LLM's mapping accuracy from roughly 80% to 95% precisely because the model had something to be checked
against. That machine-grounding value is real, and it's the part I'd build on. I'm more bearish on OCSF as a
human lingua franca, the version where every vendor converges its analyst-facing semantics onto one shared
model, because when I mapped six source schemas (CIM, UDM, ASIM, ECS, OpenTelemetry, Zeek) into OCSF 1.8.0
the same five normalization gaps recurred across all of them, and those gaps are OCSF's own (required
severity_id with no source field to fill it, missing disposition on several activity classes, one-to-many
change routing), so the mapping stays expert labor rather than collapsing into a standard a SOC analyst can
wield by hand. Sigma plays the matching role one layer up, keeping detection logic portable across engines so
a schema change and an engine change don't both force a detection rewrite.
The honest counter-position isn't scale. OCSF runs in production at multi-petabyte-per-day volumes
today. It's that an open schema doesn't by itself defeat lock-in: the governance and authorization layers
above the schema are where the remaining vendor capture lives, and OCSF says nothing about those. An open
event shape sitting on a closed catalog is still a closed system.
What would change the answer: a competing schema gaining vendor-consortium momentum, or OCSF normalization
proving materially lossy on a major source class at scale, forcing per-vendor escape hatches that
re-introduce the lock-in the schema was meant to remove.
H-SIGMA-01 · Tier B · 3.5/5 · Caveat
Sigma extends portability to the detection layer. Strong for atomic rules, bounded for correlation.
Open data formats and an open schema move security telemetry across tools without translation tax; what
they don't move is the detection content itself, which is still authored in whatever query language the
engine happens to speak. Sigma fills that gap: a YAML rule
that compiles to ClickHouse SQL, Splunk SPL, Sentinel KQL, or Elastic ESQL via pySigma. SigmaHQ ships
thousands of community rules; vendor adoption is broadening (Anvilogic, Panther, hand-rolled patterns in
production SOCs).
Caveat: portability is strong for atomic detections (single-event matching), weaker for
stateful correlation (sliding windows, sequence matching, deduplication). Vendor-native languages tend to
be more expressive than Sigma's intermediate representation, and lossy conversions exist. Sigma 2.0
correlation extensions are not yet proven at production scale. The honest framing is "detection
portability is real, but bounded — author the atomic layer in Sigma, accept that the correlation
layer may stay engine-specific."
What would change the answer: a competing detection standard gains vendor-consortium momentum, or Sigma
2.0 correlation proves unworkable on production-scale stateful detections.
H-IMPL-01 · Tier D · 2/5 · Reasoned
Streaming architectures cost meaningfully more to operate than batch equivalents.
Real-time streaming architectures cost more to operate than equivalent batch architectures, and the
premium shows up in three places: specialized staffing (streaming competence is scarcer and more
expensive to hire and retain than batch ETL skills), infrastructure redundancy (always-on Kafka and Flink
clusters rather than scheduled spot batch), and incident-management surface (an always-on pipeline has
more that can fail unattended than a batch job that simply reruns). The exact multiplier depends on the
estate; the direction is consistent.
Caveat: this is a reasoned operational claim from the mechanism above, not a measured
multiplier, and I've dropped the specific premiums some write-ups attach here (a DORA staffing figure,
IDC / Enterprise Data Quarterly incident rates) because they don't survive a primary-source check, so the
direction stands on the mechanism, not a cited number. Security workloads have particular shapes —
high-cardinality entity resolution, bursty incident-driven query loads, regulatory retention — that may
shift the gap in either direction. A security-specific TCO comparison (streaming SIEM vs batch lake on the
same workload) is on the research backlog.
The counter-position is that detection latency requirements force streaming for some workloads regardless
of cost. True; the implication isn't "don't stream" — it's "stream what needs streaming, batch the rest,
and don't pretend the operational cost differential isn't real."
What would change the answer: a published security-specific TCO comparison — either confirming the general
data-engineering finding or showing the security workload shape narrows the gap.
H-NDR-FEDERATION-01 · Tier B · 4/5
Federated search architecture determines NDR platform stickiness.
The network-detection-and-response (NDR) market is consolidating around platforms that can federate
search across multiple data sources without forcing centralization first. The argument is capability-led,
not cost-led: federated query enables cross-site joins, data sovereignty (EU data stays in EU
jurisdictions), 3.6–10.1× query performance against the right data platform, and 93–99.9% wide-area-network
traffic reduction by querying data where it lives rather than shipping it home. Centralized SIEMs cannot
deliver any of these at any price.
More than 50 federated security implementations are publicly documented. ExtraHop is an early Security
Lake federation partner; AWS Security Lake plus Athena is becoming the lowest-effort bridge for shops
already on AWS; CrowdStrike has not yet shipped a standardized federation API, which is a competitive
opening for the platforms that have.
What would change the answer: a centralized SIEM vendor solving the cross-site / data-sovereignty problem
without forcing centralization (no current evidence this is happening), or evidence that federation
performance overhead at scale is worse than the early benchmarks suggest.