Security Data Works

Open notebook

Architecture research, in the open.

The active hypotheses, the contradictions where prior positions had to be revised, and the method in practice. The practice keeps a public update log because the willingness to be wrong on the record is the credibility move. Programs that never update their priors aren't testing them.

How positions form

Evidence tiers. Update on contact.

Every claim I put in a recommendation gets graded by the kind of evidence it rests on. The tiers determine whether a claim makes it into a recommendation at all, and how loudly I'll defend it under pushback.

  • Tier A is production evidence: a named operator running the system at scale and reporting what works and what doesn't. Netflix on Iceberg at multi-petabyte scale. Insider's 90% S3 cost reduction. My own engagements count even when the client is anonymized, but a published architecture with no named operator and no one to ask about production reality does not. Tier A claims anchor recommendations.
  • Tier B is peer-reviewed research, analyst reports from Gartner / Forrester / IDC, expert consensus across multiple independent sources, or a controlled first-party benchmark. My ClickHouse result on the Zeek analytical-workload benchmark — 46.8× on the five-query average, 21–62× on the hunting-shaped aggregations, single-node — sits here until a named production deployment corroborates it.
  • Tier C is expert opinion or framework inclusion. Useful but partial, usually a triangulation ingredient rather than the primary source.
  • Tier D is vendor marketing. Excluded by default unless corroborated by Tier A or B evidence.

The update rule: positions evolve when new evidence overturns them. Twenty-two contradictions are tracked across the portfolio, each one a place where a prior position got revised on contact with new data. Several of those changed what's in the matrix, what gets recommended in engagements, or how a benchmark gets framed. Surfacing them publicly is part of the deal. The alternative is a research practice that moves the goalposts and hopes no one notices.

Vendor claims get held to the same evidence standard as everyone else's. A vendor who publishes a benchmark with methodology and reproduction code (which is rare) earns Tier B, and reaches Tier A only when a named production deployment stands behind the result. A vendor who publishes a number with no methodology gets a Tier D rating until something independent corroborates it. This page is updated quarterly; significant new evidence gets folded in faster.

Read more on the method →. The contradiction-hunting workflow, the five patterns 25 hypotheses (the methodology-validation sample) revealed about vendor claims, and the four practices that move the floor on architecture decisions.

Active research — anchor hypotheses

Anchor hypotheses, with the evidence each one rests on.

Selected from the 70+ hypothesis portfolio (the 25 carried through full methodology validation are described on the methodology page). These are the claims most likely to show up in an engagement recommendation, including one per foundational standard (Arrow, Iceberg, OCSF, Sigma). Each card names what would change the answer.

H3-PERFORMANCE-01 · Tier B · 4.5/5

ClickHouse runs hunting-shaped aggregations 21–62× faster than a schema-on-read SIEM, 46.8× on the five-query average.

On a 10-million-event synthetic Zeek conn corpus, identical hardware and queries, ClickHouse native MergeTree runs the hunting-shaped aggregations 21–62× faster than OpenSearch as the schema-on-read foil (top-source-IPs 21×, port-scan 62×), 46.8× on the five-query average — answer-equality verified, a single-node measurement so far (Beelink 5800H, WSL2; Tier B, 2026-06-10). The average hides a clean split by query shape: the index actually wins the simple lookups (protocol-distribution 3.4×, long-duration 1.8× in OpenSearch's favor). The open-format Iceberg engines on the same workload came in at 3.6–10.1×, which is the price of keeping the storage layer swappable. The columnar store under ZSTD-22 compresses 8.2× over raw JSON — the layout and the codec together, not the codec alone. Methodology and result are public; the reference implementation is shared under NDA with engagement prospects and qualifying reviewers. You can see how this evidence rolls up into a recommendation on the worked scorecard.

The counter-position runs along two lines. First, "ClickHouse isn't a SIEM" — true, but the comparison is on query-engine performance against the same workload, not against the broader SIEM feature set. Second, "the workload was selected to favor columnar storage." The workload is a Zeek deployment shape, and the result generalizes most cleanly to security workloads dominated by a small number of recurring queries. A workload dominated by ad-hoc full-text search across raw events would narrow the gap.

What would change the answer: an independent benchmark on a workload with a meaningfully different shape where the schema-on-read engine closes the gap, or evidence that the ClickHouse result fails to reproduce on different hardware.

H1-COST-02 · Tier C · 3/5

Modern security data platforms cut downstream licensing costs 50–80%.

The 50–80% reduction is what a security data pipeline platform (Cribl, Tenzir, Vector, or equivalent) saves on the SIEM licensing line by filtering, downsampling, and routing at ingest, before the per-GB clock starts running. I anchor the magnitude to my own first-party MOAR cost model — a transparent $/effective-GB build re-run against client data — and to the pipeline vendors' own published customer reductions, not to a survey.

The counter-position is that "downstream licensing" isn't the only cost. The pipeline platform itself carries an operational cost, and engineering time spent on routing logic is spend you'll feel. That's correct; the 50–80% is the licensing line item, not the all-in TCO. Most engagements still come out net-positive because the SIEM licensing line was the unforced error.

What would change the answer. A SIEM vendor moving to consumption pricing that prices the offset through, or a pipeline platform pricing model that captures the savings on its own side.

H1-PLATFORM-01 · Tier B · 4.5/5

Iceberg + Dremio + Polaris is the strongest open-stack baseline for security data.

Apache Iceberg as the table format, Dremio as the query engine, and Polaris as the catalog (the metadata layer that tells the engine what tables exist and where their data lives) is the configuration with the most production-deployment evidence behind it. Netflix runs Iceberg at multi-petabyte scale; Insider reports a 90% reduction in S3 storage cost after migrating onto this stack; Apple, AWS, Databricks, and Snowflake have all announced first-class Iceberg support since early 2025.

The counter-position: Delta Lake (the Databricks-led table format) has comparable or better tooling inside the Databricks ecosystem, and Apache Hudi has its adherents in heavy-streaming use cases. The recommendation is workload-conditional, not categorical. But for the security-data workloads this practice most often sees, the Iceberg stack ends up with the strongest combination of production validation and engine portability.

What would change the answer: convergence of the table-format ecosystem (an active hypothesis on its own; see contradictions below), or evidence that a Delta-native security-data deployment outperforms the Iceberg equivalent on a comparable workload.

H-ARROW-01 · Tier B · 4/5

Apache Arrow has become the in-memory standard for cross-engine security data exchange.

Arrow is the columnar in-memory format engines use to hand data to each other without re-serialization. The 2025–2026 adoption signal is consolidating: ClickHouse Inc. shipped an official ADBC driver in February 2026 (the ClickHouse/adbc_clickhouse repo, replacing the prior community-driven driver); Trino, DuckDB, and Polars use Arrow natively; Snowflake and Databricks return Arrow on the wire; Elastic 8.16 added Arrow as a response format over REST. The ADBC Driver Foundry (the adbc-drivers GitHub org) is the coordination point for cross-engine driver development.

The counter-position is that Arrow doesn't replace JDBC or ODBC for legacy tools, and not every engine reads or writes Arrow natively; some integrations still go through row-based intermediates with the per-row serialization tax. That's correct; the claim is about trajectory, not universal adoption. The operational read: "favor engines that speak Arrow on the wire," not "ban any engine that doesn't."

What would change the answer: a competing in-memory format gaining traction (no current evidence), or Arrow's per-batch overhead at extreme cardinality proving worse than alternatives on a representative security workload.

H-AI-ASYMMETRY-01 · Tier B · 4.8/5 · Settled

AI-enabled offense is maturing 2–3× faster than AI-targeted defense.

Anthropic's Claude Mythos preview demonstrated autonomous vulnerability discovery at a 72.4% exploit success rate on the Firefox SpiderMonkey engine, fully autonomous, no human guidance per target. The AISLE response replicated the result with eight small open-weight models in zero-shot API calls, settling the working assumption that the moat is the system, not the frontier model. Mandiant M-Trends 2026 records negative mean-time-to-exploit: exploitation is now arriving seven days before patch release. CrowdStrike clocks its fastest recorded attacker breakout at 27 seconds from initial access to lateral movement. GTG-1002, the first publicly confirmed AI-orchestrated state-sponsored campaign, was 80–90% AI-executed across roughly 30 victims.

The counter-position is that defenders also benefit from AI tooling, and the asymmetry is temporary. Both parts are true. The asymmetry is temporary; the working window is roughly 2024–2027. But "temporary" on this scale means defenders need to compress years of detection-engineering modernization into a few quarters, and the maturity gap between offensive and defensive AI tooling is wide enough that I treat this as a planning constraint, not a watch-list item.

What would change the answer: a defensive AI breakthrough that produces measurable detection-cadence gains across multiple production deployments, or evidence that the offensive results don't generalize beyond the Mythos / AISLE benchmark workload.

H-COST-09 · Tier A · 5/5

Tiered storage cuts 55–90% of long-tail security data cost.

Splitting security data across hot, warm, and cold storage tiers (hot on local high-performance storage, warm on cheaper attached storage, cold on S3 Standard, archival on S3 Glacier or equivalent) reduces the long-tail cost of retention by 55–90% in production environments. Netflix reports 70–80% of their storage in cold tiers; Insider documents a 90% S3 cost reduction after tiering; Kafka 3.0+ supports tiered storage natively, which has materially changed the operational story for streaming security telemetry.

The counter-position is the freshness trade-off. Warm and cold tiers have higher first-byte latency than hot tiers; queries that span tier boundaries pay a complexity tax. For interactive threat hunting on recent data, the tiering boundary needs to land further back than for compliance retention. Most engagements end up with a tier boundary at 7–30 days; some land at 72 hours where the analyst hunting window is short.

What would change the answer: object-storage pricing collapsing to the point where tiering's operational complexity isn't worth the savings, or query-engine improvements that make cross-tier queries effectively free.

H3-INTEGRATION-03 · Tier B · 4/5

OCSF earns its keep as a normalization hygiene baseline, not as the schema everyone finally agrees on.

OCSF (the Open Cybersecurity Schema Framework) is the shared event-shape standard adopted by 180+ organizations, including AWS Security Lake, Cisco, Sumo Logic, IBM, Cloudflare, and many of the EDR vendors, and ITU-T Study Group 17 is standardizing it as Recommendation X.icd-schemas, with member states backing it for ratification in December 2025 and adoption targeted for mid-2026. Where I think OCSF actually pays off is narrower than the adoption numbers suggest: as a fixed reference shape that a pipeline can be graded against, it catches the silent field-mapping failures that schema-on-read hides until a detection quietly stops firing, and in one recursive-mapping pipeline I watched that conformance reference take an LLM's mapping accuracy from roughly 80% to 95% precisely because the model had something to be checked against. That machine-grounding value is real, and it's the part I'd build on. I'm more bearish on OCSF as a human lingua franca, the version where every vendor converges its analyst-facing semantics onto one shared model, because when I mapped six source schemas (CIM, UDM, ASIM, ECS, OpenTelemetry, Zeek) into OCSF 1.8.0 the same five normalization gaps recurred across all of them, and those gaps are OCSF's own (required severity_id with no source field to fill it, missing disposition on several activity classes, one-to-many change routing), so the mapping stays expert labor rather than collapsing into a standard a SOC analyst can wield by hand. Sigma plays the matching role one layer up, keeping detection logic portable across engines so a schema change and an engine change don't both force a detection rewrite.

The honest counter-position isn't scale. OCSF runs in production at multi-petabyte-per-day volumes today. It's that an open schema doesn't by itself defeat lock-in: the governance and authorization layers above the schema are where the remaining vendor capture lives, and OCSF says nothing about those. An open event shape sitting on a closed catalog is still a closed system.

What would change the answer: a competing schema gaining vendor-consortium momentum, or OCSF normalization proving materially lossy on a major source class at scale, forcing per-vendor escape hatches that re-introduce the lock-in the schema was meant to remove.

H-SIGMA-01 · Tier B · 3.5/5 · Caveat

Sigma extends portability to the detection layer. Strong for atomic rules, bounded for correlation.

Open data formats and an open schema move security telemetry across tools without translation tax; what they don't move is the detection content itself, which is still authored in whatever query language the engine happens to speak. Sigma fills that gap: a YAML rule that compiles to ClickHouse SQL, Splunk SPL, Sentinel KQL, or Elastic ESQL via pySigma. SigmaHQ ships thousands of community rules; vendor adoption is broadening (Anvilogic, Panther, hand-rolled patterns in production SOCs).

Caveat: portability is strong for atomic detections (single-event matching), weaker for stateful correlation (sliding windows, sequence matching, deduplication). Vendor-native languages tend to be more expressive than Sigma's intermediate representation, and lossy conversions exist. Sigma 2.0 correlation extensions are not yet proven at production scale. The honest framing is "detection portability is real, but bounded — author the atomic layer in Sigma, accept that the correlation layer may stay engine-specific."

What would change the answer: a competing detection standard gains vendor-consortium momentum, or Sigma 2.0 correlation proves unworkable on production-scale stateful detections.

H-IMPL-01 · Tier D · 2/5 · Reasoned

Streaming architectures cost meaningfully more to operate than batch equivalents.

Real-time streaming architectures cost more to operate than equivalent batch architectures, and the premium shows up in three places: specialized staffing (streaming competence is scarcer and more expensive to hire and retain than batch ETL skills), infrastructure redundancy (always-on Kafka and Flink clusters rather than scheduled spot batch), and incident-management surface (an always-on pipeline has more that can fail unattended than a batch job that simply reruns). The exact multiplier depends on the estate; the direction is consistent.

Caveat: this is a reasoned operational claim from the mechanism above, not a measured multiplier, and I've dropped the specific premiums some write-ups attach here (a DORA staffing figure, IDC / Enterprise Data Quarterly incident rates) because they don't survive a primary-source check, so the direction stands on the mechanism, not a cited number. Security workloads have particular shapes — high-cardinality entity resolution, bursty incident-driven query loads, regulatory retention — that may shift the gap in either direction. A security-specific TCO comparison (streaming SIEM vs batch lake on the same workload) is on the research backlog.

The counter-position is that detection latency requirements force streaming for some workloads regardless of cost. True; the implication isn't "don't stream" — it's "stream what needs streaming, batch the rest, and don't pretend the operational cost differential isn't real."

What would change the answer: a published security-specific TCO comparison — either confirming the general data-engineering finding or showing the security workload shape narrows the gap.

H-NDR-FEDERATION-01 · Tier B · 4/5

Federated search architecture determines NDR platform stickiness.

The network-detection-and-response (NDR) market is consolidating around platforms that can federate search across multiple data sources without forcing centralization first. The argument is capability-led, not cost-led: federated query enables cross-site joins, data sovereignty (EU data stays in EU jurisdictions), 3.6–10.1× query performance against the right data platform, and 93–99.9% wide-area-network traffic reduction by querying data where it lives rather than shipping it home. Centralized SIEMs cannot deliver any of these at any price.

More than 50 federated security implementations are publicly documented. ExtraHop is an early Security Lake federation partner; AWS Security Lake plus Athena is becoming the lowest-effort bridge for shops already on AWS; CrowdStrike has not yet shipped a standardized federation API, which is a competitive opening for the platforms that have.

What would change the answer: a centralized SIEM vendor solving the cross-site / data-sovereignty problem without forcing centralization (no current evidence this is happening), or evidence that federation performance overhead at scale is worse than the early benchmarks suggest.

Deeper analyses

Selected hypotheses extended into long-form deep dives.

Each links back to its anchor hypothesis above; each carries the same evidence-tier framing, with more space for the underlying argument and the conditions that would change it.

Things I changed my mind on

Two contradictions, with the evidence that forced the update.

Twenty-two contradictions are tracked across the research portfolio. Two are surfaced here in full prose; two more in summary. Each one is a place where a recommendation got revised because a specific person's specific argument wouldn't fit the prior frame.

ClickHouse is cheap. ClickHouse is expensive. The baseline determines which.

For most of 2025 my recommendation language was "ClickHouse is cost-effective for security workloads" — defensible at 30–90% cost reduction against per-GB-ingested licensing models. That language broke in May 2026 in a conversation with Lipyeow Lim, a Distinguished Engineer at Databricks. His framing: "ClickHouse is expensive" — when the comparison baseline isn't legacy SIEM licensing but Iceberg-on-S3 with a separate query engine.

From inside the lakehouse view, ClickHouse's MergeTree format duplicates data already sittable in open formats on S3, requires always-on compute that can't scale independently of storage, demands fat-memory nodes for joins, multiplies storage for replication, and rivals Snowflake on managed pricing at sustained TB/day workloads. None of that is wrong. The prior framing was wrong because it assumed the baseline.

Updated position: ClickHouse is cheap versus per-GB SIEM licensing, comparable versus Snowflake or Databricks SQL, expensive versus Iceberg-on-S3 with a swappable engine. The recommendation now names the baseline. Hot tier still favors ClickHouse; decentralized open-format still favors the lakehouse. The question has two correct answers, and which one wins depends on what you're optimizing for, which is the kind of trade-off the worked scorecard resolves against a named baseline.

Centralized SIEM versus federated lake — it's a capabilities argument, not a cost argument.

The early framing of the centralized-vs-federated debate ran through cost: "the federated lake is cheaper, here are the savings." The savings turned out to be more modest than the early modeling suggested — somewhere between 2% (cost-neutral at 10 GB/day) and 64% (at 1 TB/day) on full TCO, per G-Cloud 14 validated analysis from April 2026. The earlier 360–725× framing was storage-only, not full TCO, and got revised down accordingly.

The reframe: federated query wins on capabilities a centralized SIEM cannot deliver at any price. Cross-site joins across regions where the data legally cannot leave its jurisdiction. Data sovereignty for multi-region operators in regulated industries. 3.6–10.1× query performance against the right data platform. 93–99.9% wide-area-network traffic reduction by querying data where it lives rather than shipping it home. None of these are cost arguments — they're things the centralized architecture cannot do at the architecture level.

The updated recommendation framework is a 4-stage maturity model. Stage 0 is pure SIEM (baseline, zero savings, low complexity). Stage 1 is tiered retention (40–60% savings, low-to-medium complexity). Stage 2 is federated query (60–80% savings, medium complexity). Stage 3–4 is data lake primary (80–95% savings, high complexity). The right stage depends on security operations maturity and engineering team capabilities, not the cost optimization in isolation. Documented in ADR-002.

Two more, in brief

  • Spark 4.1 Real-Time Mode reframes the "RisingWave is the only path to streaming simplicity" narrative. Earlier framing positioned RisingWave as the single best answer to streaming complexity. Spark 4.1's Real-Time Mode, shipped February 2026, doesn't refute the RisingWave case but adds a credible alternative for shops already on the Spark ecosystem. The recommendation now branches by existing investment.
  • Iceberg and Delta Lake are converging, not displacing each other. Earlier framing leaned toward "Iceberg displaces Delta" based on the trajectory of major-vendor announcements. The reality through 2025–2026 has been convergence — both formats remain viable, both are gaining vendor support, and the recommendation depends on the surrounding ecosystem (Databricks vs the everything-else stack) rather than a categorical choice.

What's coming

The page is alive. Here's what's in flight.

Quarterly hypothesis confidence updates as evidence accumulates, with the anchor hypotheses above each carrying a "last reviewed" date and any tier movement noted in-line. New contradictions surfaced as they're written and defensible — several drafts are in the queue, including a security-specific TCO comparison for streaming versus batch architectures (which would resolve the caveat on H-IMPL-01 in either direction).

The next planned benchmark is a catalog comparison — Polaris, Nessie, Unity, Hive Metastore, Glue — scheduled for Q3 2026, with results landing on the lab page when complete. The Q4 topic is candidate-stage; the leaders are a query-engine bake-off on a longer-tail security workload, a Kafka-to-Iceberg latency characterization, and a federated-query stack comparison. Topic announcements come a quarter ahead.

These claims are testable on your environment.

The fastest test is to run the published benchmark methodology on your own workload — full code and spec, with a one-page TCO and performance readout you produce yourself. Or see how these hypotheses roll up into a recommendation on the worked scorecard, or read the program POV that connects them into a single argument.