Security Data Works

Open notebook

Architecture research, in the open.

The active hypotheses, the contradictions where prior positions had to be revised, and the method in practice. The practice keeps a public update log because the willingness to be wrong on the record is the credibility move — programs that never update their priors are programs that aren't actually testing them.

How positions form

Evidence tiers. Update on contact.

Every claim I put in a recommendation gets graded by the kind of evidence it rests on. The tiers determine whether a claim makes it into a recommendation at all, and how loudly I'll defend it under pushback.

Tier A is production-deployment evidence: someone running the system at scale and reporting the metrics. Netflix on Iceberg at multi-petabyte scale. Insider's 90% S3 cost reduction. The 145× ClickHouse result on the Zeek analytical-workload benchmark. Tier A claims load-bear. Tier B is peer-reviewed research, analyst reports from Gartner / Forrester / IDC, or expert consensus across multiple independent sources. Tier C is expert opinion or framework inclusion — useful but partial; usually a triangulation ingredient, rarely the load-bearing source. Tier D is vendor marketing. Excluded by default unless corroborated by Tier A or B evidence.

The update rule: positions evolve when new evidence overturns them. Twenty-two contradictions are documented below, each one a place where a prior position got revised on contact with new data. Several of those changed what's in the matrix, what gets recommended in engagements, or how a benchmark gets framed. Surfacing them publicly is part of the deal — the alternative is a research practice that quietly moves the goalposts and hopes no one notices.

Vendor claims get held to the same evidence standard as everyone else's. A vendor who publishes a benchmark with methodology and reproduction code (which is rare) gets evaluated as Tier A or B depending on how independent the workload selection looks. A vendor who publishes a number with no methodology gets a Tier D rating until something independent corroborates it. This page is updated quarterly; significant new evidence gets folded in faster.

Read more on the method → — the contradiction-hunting workflow, the five patterns 25 hypotheses revealed about vendor claims, and the four practices that move the floor on architecture decisions.

Active research — anchor hypotheses

Eight load-bearing claims, with the evidence each one rests on.

Selected from the 70+ hypothesis portfolio. These are the claims most likely to show up in an engagement recommendation. Each card names what would change the answer.

H3-PERFORMANCE-01 · Tier A · 4.5/5

ClickHouse runs the same security workload 145× faster than the dominant schema-on-read SIEM.

On a 10-million-event Zeek workload, identical hardware and queries, ClickHouse Native completes the suite in 0.19 seconds against the schema-on-read SIEM's 27.52 seconds. ZSTD-22 compression delivers 8.2× reduction on top. Methodology and result are public; the reference implementation is shared under NDA with engagement prospects and qualifying reviewers.

The counter-position runs along two lines. First: "ClickHouse isn't a SIEM" — true, but the comparison is on query-engine performance against the same workload, not against the broader SIEM feature set. Second: "the workload was selected to favor columnar storage" — the workload is a real Zeek deployment shape, but the result generalizes most cleanly to security workloads dominated by a small number of recurring queries. A workload dominated by ad-hoc full-text search across raw events would narrow the gap.

What would change the answer: an independent benchmark on a workload with a meaningfully different shape where the schema-on-read engine closes the gap, or evidence that the ClickHouse result fails to reproduce on different hardware.

H1-COST-02 · Tier B · 4/5

Modern security data platforms cut downstream licensing costs 50–80%.

Across fifteen-plus independent sources — Gartner advisories, Forrester reports, practitioner case studies, vendor-customer testimonials with verifiable specifics — the consistent finding is a 50–80% reduction in downstream SIEM licensing when a security data pipeline platform (Cribl, Tenzir, Vector, or equivalent) sits between source telemetry and the SIEM. The savings come from filtering, downsampling, and routing decisions made at ingest rather than after the per-GB clock has already started running.

The counter-position is that "downstream licensing" isn't the only cost — the pipeline platform itself carries an operational cost, and engineering time spent on routing logic is real spend. That's correct; the 50–80% is the licensing line item, not the all-in TCO. Most engagements still come out net-positive because the SIEM licensing line was the unforced error.

What would change the answer: a SIEM vendor moving to consumption pricing that prices the offset through, or a pipeline platform pricing model that captures the savings on its own side.

H1-PLATFORM-01 · Tier A · 4.5/5

Iceberg + Dremio + Polaris is the strongest open-stack baseline for security data.

Apache Iceberg as the table format, Dremio as the query engine, and Polaris as the catalog (the metadata layer that tells the engine what tables exist and where their data lives) is the configuration with the most production-deployment evidence behind it. Netflix runs Iceberg at multi-petabyte scale; Insider reports a 90% reduction in S3 storage cost after migrating onto this stack; Apple, AWS, Databricks, and Snowflake have all announced first-class Iceberg support since early 2025.

The counter-position: Delta Lake (the Databricks-led table format) has comparable or better tooling inside the Databricks ecosystem; Apache Hudi has its adherents in heavy-streaming use cases. The recommendation is workload-conditional, not categorical — but for the security-data workloads this practice most often sees, the Iceberg stack ends up with the strongest combination of production validation and engine portability.

What would change the answer: convergence of the table-format ecosystem (an active hypothesis on its own — see contradictions below), or evidence that a Delta-native security-data deployment outperforms the Iceberg equivalent on a comparable workload.

H-AI-ASYMMETRY-01 · Tier A · 4.8/5 · Settled

AI-enabled offense is maturing 2–3× faster than AI-targeted defense.

Anthropic's Claude Mythos preview demonstrated autonomous vulnerability discovery at a 72.4% exploit success rate on the Firefox SpiderMonkey engine — fully autonomous, no human guidance per target. The AISLE response replicated the result with eight small open-weight models in zero-shot API calls, settling the working assumption that the moat is the system, not the frontier model. Mandiant M-Trends 2026 records negative mean-time-to-exploit: exploitation is now landing seven days before patch release. CrowdStrike clocks attacker breakout at 51 seconds from initial access to lateral movement. GTG-1002 — the first publicly confirmed AI-orchestrated state-sponsored campaign — was 80–90% AI-executed across roughly 30 victims.

The counter-position is that defenders also benefit from AI tooling, and the asymmetry is temporary. Both parts are true. The asymmetry is temporary — the working window is roughly 2024–2027. But "temporary" on this scale means defenders need to compress years of detection-engineering modernization into a few quarters, and the maturity gap between offensive and defensive AI tooling is wide enough that I treat this as a planning constraint, not a watch-list item.

What would change the answer: a defensive AI breakthrough that produces measurable detection-cadence gains across multiple production deployments, or evidence that the offensive results don't generalize beyond the Mythos / AISLE benchmark workload.

H-COST-09 · Tier A · 5/5

Tiered storage cuts 55–90% of long-tail security data cost.

Splitting security data across hot, warm, and cold storage tiers — with hot kept on local high-performance storage, warm on cheaper attached storage, cold on S3 Standard, and archival on S3 Glacier or equivalent — reduces the long-tail cost of retention by 55–90% in production environments. Netflix reports 70–80% of their storage in cold tiers; Insider documents a 90% S3 cost reduction after tiering; Kafka 3.0+ supports tiered storage natively, which has materially changed the operational story for streaming security telemetry.

The counter-position is the freshness trade-off. Warm and cold tiers have higher first-byte latency than hot tiers; queries that span tier boundaries pay a complexity tax. For interactive threat hunting on recent data, the tiering boundary needs to land further back than for compliance retention. Most engagements end up with a tier boundary at 7–30 days; some land at 72 hours where the analyst hunting window is short.

What would change the answer: object-storage pricing collapsing to the point where tiering's operational complexity isn't worth the savings, or query-engine improvements that make cross-tier queries effectively free.

H3-INTEGRATION-03 · Tier B · 4/5

OCSF is the multi-vendor schema convergence that actually has a chance.

OCSF (the Open Cybersecurity Schema Framework) is the shared event-shape standard adopted by 180+ organizations, including AWS Security Lake, Cisco, Sumo Logic, IBM, Cloudflare, and many of the EDR vendors. ITU-T Study Group 17 confirmed OCSF as the basis for forthcoming international standards work in April 2026. The practical effect is that security data captured in OCSF format moves between tools without translation overhead — a shift from the prior decade's per-vendor schema lock-in.

The counter-position: vendor adoption claims at the petabyte-per-day scale lack independent verification (this is one of the documented contradictions below). The standard is real and adoption is broadening; the production-scale claims are softer evidence than the standardization trajectory itself. Caveat applied where load-bearing.

What would change the answer: a competing schema gaining vendor consortium momentum, or independent verification of the petabyte-scale OCSF deployment claims (either confirming or refuting them would update the recommendation).

H-IMPL-01 · Tier B · 4/5 · Caveat

Streaming architectures cost 2.5–3× more to operate than batch equivalents.

Real-time streaming architectures incur 2.5–3× higher operational costs than equivalent batch architectures, broken down across specialized staffing (DORA reports a 2.7× staffing premium for streaming competence), infrastructure redundancy (1.5–2× costs from running always-on Kafka and Flink clusters rather than scheduled spot batch), and incident management complexity (3–4× higher annual incident rates, per IDC and Enterprise Data Quarterly tracking).

Caveat: the underlying evidence comes from general data engineering deployments, not security-specific TCO studies. Security workloads have particular shapes — high-cardinality entity resolution, bursty incident-driven query loads, regulatory retention requirements — that may shift the ratio in either direction. A security-specific TCO comparison (streaming SIEM vs batch lake on the same workload) is on the research backlog.

The counter-position is that detection latency requirements force streaming for some workloads regardless of cost. True; the implication isn't "don't stream" — it's "stream what needs streaming, batch the rest, and don't pretend the operational cost differential isn't real."

What would change the answer: a published security-specific TCO comparison — either confirming the general data-engineering finding or showing the security workload shape narrows the gap.

H-NDR-FEDERATION-01 · Tier B · 4/5

Federated search architecture determines NDR platform stickiness.

The network-detection-and-response (NDR) market is consolidating around platforms that can federate search across multiple data sources without forcing centralization first. The argument is capability-led, not cost-led: federated query enables cross-site joins, data sovereignty (EU data stays in EU jurisdictions), 10–145× query performance against the right data platform, and 93–99.9% wide-area-network traffic reduction by querying data where it lives rather than shipping it home. Centralized SIEMs cannot deliver any of these at any price.

More than 50 federated security implementations are publicly documented. ExtraHop is an early Security Lake federation partner; AWS Security Lake plus Athena is becoming the lowest-effort bridge for shops already on AWS; CrowdStrike has not yet shipped a standardized federation API, which is a competitive opening for the platforms that have.

What would change the answer: a centralized SIEM vendor solving the cross-site / data-sovereignty problem without forcing centralization (no current evidence this is happening), or evidence that federation performance overhead at scale is worse than the early benchmarks suggest.

Deeper analyses

Selected hypotheses extended into long-form deep dives.

Each links back to its anchor hypothesis above; each carries the same evidence-tier framing, with more space for the underlying argument and the conditions that would change it.

Things I changed my mind on

Three contradictions, with the evidence that forced the update.

Twenty-two contradictions are tracked across the research portfolio. Three are surfaced here in full prose; two more in summary. Each one is a place where a recommendation got revised because a specific person's specific argument wouldn't fit the prior frame.

ClickHouse is cheap. ClickHouse is expensive. The baseline determines which.

For most of 2025 my recommendation language was "ClickHouse is cost-effective for security workloads" — defensible at 30–90% cost reduction against per-GB-ingested licensing models. That language broke in May 2026 in a conversation with Lipyeow Lim, a Distinguished Engineer at Databricks. His framing: "ClickHouse is expensive" — when the comparison baseline isn't legacy SIEM licensing but Iceberg-on-S3 with a separate query engine.

From inside the lakehouse view, ClickHouse's MergeTree format duplicates data already sittable in open formats on S3, requires always-on compute that can't scale independently of storage, demands fat-memory nodes for joins, multiplies storage for replication, and rivals Snowflake on managed pricing at sustained TB/day workloads. None of that is wrong. The prior framing was wrong because it assumed the baseline.

Updated position: ClickHouse is cheap versus per-GB SIEM licensing, comparable versus Snowflake or Databricks SQL, expensive versus Iceberg-on-S3 with a swappable engine. The recommendation now names the baseline. Hot tier still favors ClickHouse; decentralized open-format still favors the lakehouse. The question has two correct answers, and which one wins depends on what you're optimizing for.

Centralized SIEM versus federated lake — it's a capabilities argument, not a cost argument.

The early framing of the centralized-vs-federated debate ran through cost: "the federated lake is cheaper, here are the savings." The savings turned out to be more modest than the early modeling suggested — somewhere between 2% (cost-neutral at 10 GB/day) and 64% (at 1 TB/day) on full TCO, per G-Cloud 14 validated analysis from April 2026. The earlier 360–725× framing was storage-only, not full TCO, and got revised down accordingly.

The reframe: federated query wins on capabilities a centralized SIEM cannot deliver at any price. Cross-site joins across regions where the data legally cannot leave its jurisdiction. Data sovereignty for multi-region operators in regulated industries. 10–145× query performance against the right data platform. 93–99.9% wide-area-network traffic reduction by querying data where it lives rather than shipping it home. None of these are cost arguments — they're things the centralized architecture cannot do at the architecture level.

The updated recommendation framework is a 4-stage maturity model. Stage 0 is pure SIEM (baseline, zero savings, low complexity). Stage 1 is tiered retention (40–60% savings, low-to-medium complexity). Stage 2 is federated query (60–80% savings, medium complexity). Stage 3–4 is data lake primary (80–95% savings, high complexity). The right stage depends on security operations maturity and engineering team capabilities, not the cost optimization in isolation. Documented in ADR-002.

OCSF petabyte-scale claims have an evidence gap.

For a draft blog post in early 2026 the title read "OCSF at Petabyte Scale" — leaning on AWS Security Lake's marketing references to "petabytes of data," CISA's roughly 1 PB/day production deployment, customer names like Siemens, Sony Music, IPG, and DataBahn's "99% OCSF compliance at enterprise scale." That's a lot of corroboration, and the standardization trajectory itself (ITU-T recognition March 2026, 180+ organizations adopting) is genuinely strong evidence.

The gap surfaced through systematic verification work in March 2026. AWS provides no specific customer volume metrics for OCSF specifically. The Observe Inc. 1 PB/day claim is observability, not security. DataBahn's compliance claim is schema compliance, not volume throughput. No Jepsen-style independent audit of OCSF normalization at scale exists. The evidence behind the petabyte-scale framing was substantially Tier C-D — vendor marketing without published volume metrics — even though the underlying standardization trajectory was Tier A.

The blog post title got a caveat. The hypothesis confidence on H3-INTEGRATION-03 dropped from 4/5 to 3/5 on the production-scale dimension specifically; the standardization-trajectory dimension stayed at 4/5. The recommendation didn't change — OCSF is still the multi-vendor schema with the strongest momentum — but the specific load-bearing claim about scale got separated from the broader claim about adoption. This is the everyday work of empirical skepticism: not "OCSF is wrong," but "the part of the claim that was vendor-marketing needs to stop carrying load it can't bear."

Two more, in brief

  • Spark 4.1 Real-Time Mode reframes the "RisingWave is the only path to streaming simplicity" narrative. Earlier framing positioned RisingWave as the single best answer to streaming complexity. Spark 4.1's Real-Time Mode, shipped February 2026, doesn't refute the RisingWave case but adds a credible alternative for shops already on the Spark ecosystem. The recommendation now branches by existing investment.
  • Iceberg and Delta Lake are converging, not displacing each other. Earlier framing leaned toward "Iceberg displaces Delta" based on the trajectory of major-vendor announcements. The reality through 2025–2026 has been convergence — both formats remain viable, both are gaining vendor support, and the recommendation depends on the surrounding ecosystem (Databricks vs the everything-else stack) rather than a categorical choice.

The method, in practice

How an architecture decision actually gets made.

Architecture decisions in security data tend to fail in predictable places. The wrong decision rarely shows up as a single bad choice; it shows up as an early decision that quietly determined later ones. The method below is what I run through in an engagement, in roughly this order.

Foundational decisions first.

Where will the security data infrastructure live? In a dedicated environment isolated from corporate IT, on shared corporate infrastructure, or with a managed services provider? That single choice eliminates whole categories of downstream options. Shared corporate infrastructure, for example, requires a fine-grained access control model that some catalog options simply don't ship — the decision cascades. What table format will the data sit in (Iceberg, Delta Lake, Hudi)? What catalog will track it (Polaris, Nessie, Unity, Hive Metastore, Glue)? What query engine will read it (ClickHouse, Trino, Dremio, StarRocks, DuckDB)? Each foundational choice narrows the candidate set sharply; I run them as hard filters before getting to softer scoring.

Workload sizing as a hard gate.

Data volume per day, growth rate, source count, retention horizon, budget. These are the numbers that eliminate options that simply can't scale to the workload at hand — DuckDB drops out at hyperscale, serverless query engines drop out at steady-state high-volume workloads, ClickHouse Cloud crosses an unfavorable cost threshold past a certain TB/day. Sizing isn't a soft preference; it's the gate that decides whether a vendor can even be in the conversation.

Organizational fit as elimination criteria.

Team size and capability shape what's operable. A two-person data engineering team can't carry the same operational load as a twenty-person team; a vendor stack that requires deep Kafka and Flink expertise is not the right fit for a shop that hasn't built that bench. Vendor tolerance — open-source-only, commercial acceptable, legacy stack required — eliminates whole categories. Cloud environment (single-cloud, multi-cloud, on-prem). Data sovereignty constraints (US-only, GDPR, region-specific residency). Each of these is a gate, not a preference.

Use-case fit as soft scoring.

Once the hard filters have done their work, the surviving candidates get scored against the use cases the organization is actually optimizing for — threat detection, compliance retention, incident response, threat hunting, federated analytics. Use cases are scored, not filtered, because most environments need to support multiple in parallel, with different weights. This is where the recommendation finally narrows to the three to five candidates worth a deeper look.

The working version of this method — the one with the weighted scoring, the criteria-by-criterion reasoning, the vendor-claim-vs-shipped-reality deltas, and the recommended bundles per workload archetype — is the paid Capability Matrix offering. The structure exists publicly; the scored output is the paid product.

Why now

Four shifts that make today's data-platform decisions different.

The threat side is moving faster than defenders can absorb.

Offensive AI is commoditized. The AISLE replication of the Claude Mythos exploit work used 5.1-billion-parameter open-weight models — meaning the moat is the orchestration system, not the frontier model anyone could license. Mandiant M-Trends 2026 records negative mean-time-to-exploit: exploitation now lands seven days before patch release. CrowdStrike's measured attacker breakout time is 51 seconds from initial access to lateral movement. GTG-1002 — the first publicly confirmed AI-orchestrated state-sponsored campaign — was 80–90% AI-executed across roughly 30 victims. Defenders have to detect more, faster, on the same headcount; the data platform underneath the SOC is what decides whether that's possible.

Query performance has flipped, by orders of magnitude.

Structured data on columnar storage, scanned through vectorized engines that can prune partitions, beats schema-on-read inverted-index architectures on most security workloads — by orders of magnitude, not percentages. The ClickHouse-vs-schema-on-read-SIEM benchmark on the same 10-million-event Zeek workload comes in 145× faster on the ClickHouse side, with public methodology and an NDA-gated reference implementation. The schema-on-read architecture was sized for a different cost-of-query era; that era ended somewhere between 2022 and 2024 and most security data programs haven't rebuilt around the new economics yet.

Storage cost has flipped too.

Object storage (S3, MinIO, equivalent) plus columnar formats (Parquet) plus open table formats (Iceberg) breaks the per-GB-ingested SIEM economic model. ZSTD-22 compression delivers around 8.2× reduction in the practice's benchmark; production lakehouse references — Netflix, Insider, Huntress — operate at multi-petabyte scale on costs current SIEM customers can't access. The trade-off is data freshness: first-byte latency for cold-tier queries is meaningfully higher than for hot-tier queries, which leads directly to the next thread.

Detection scale and the direction of travel.

Stream-processing engines — Apache Flink, Spark Structured Streaming, RisingWave — can run thousands of near-real-time detections at the latency budgets that SIEM correlation engines support dozens of. The open engineering question is the maturity of streaming writes into Iceberg: how fresh can lakehouse data be before stream-processing offsets dominate? The answer is changing every few quarters as new approaches ship (Apache Iceberg streaming writes, Tabular's commercial layer, Spark 4.1 Real-Time Mode, DuckLake's claimed 105× streaming improvement). The mid-term direction is more decentralization, not less: agentic federated query over source-retained data, with semantic understanding driving query plans across distributed data platforms. Platform choices made today need to be portable to that future. That portability — open formats, neutral catalogs, replaceable engines — is the architectural argument I keep returning to.

What's coming

The page is alive. Here's what's in flight.

Quarterly hypothesis confidence updates as evidence accumulates, with the eight cards above each carrying a "last reviewed" date and any tier movement noted in-line. New contradictions surfaced as they're written and defensible — several drafts are in the queue, including a security-specific TCO comparison for streaming versus batch architectures (which would resolve the caveat on H-IMPL-01 in either direction).

The next planned benchmark is a catalog comparison — Polaris, Nessie, Unity, Hive Metastore, Glue — scheduled for Q3 2026, with results landing on the lab page when complete. The Q4 topic is candidate-stage; the leaders are a query-engine bake-off on a longer-tail security workload, a Kafka-to-Iceberg latency characterization, and a federated-query stack comparison. Topic announcements come a quarter ahead.

The research is the receipts. The thesis is the program.

The program POV that connects these claims, and the engagements that put them to work on your data.