The bridge thesis

Security data is a data-engineering problem.

And that's the good news: data engineering already standardized most of it. Your team is already doing the work, pulling data out of the SIEM and wrangling it in tools they trust, just without the open standards the rest of the field agreed on. The highest-leverage move isn't building a lakehouse; it's requiring those standards (Iceberg, Arrow, OCSF) in everything you buy and run, then adopting them one tool at a time. Demand the standards and the rest follows — the parts get swappable, the migration gets incremental, and the lock-in the vendors count on goes away. The fair-broker practice below, a public lab and a versioned matrix and disclosure-forward conflict handling, is what measures who actually honors those standards versus who only claims to. The adoption risk is real too: a stack of six components only pays back if the team can run it, and architects who bridge data engineering and security operations are scarce. The lab and the matrix answer the technical question; the migration assessment answers the operational one before money is committed.

TL;DR

What's on this page

Why this practice exists. The data-engineering conference where I was the only security person in the room, and the structural gap that became this practice.
Three pillars. Trustworthy, Well-connected, Performant. Each a testable structural commitment with named evidence.
Open standards at every layer. Arrow, Iceberg, OCSF, Sigma. The four standards that keep the layers swappable, one per layer.
The method. Empirical skepticism, evidence tiers, update on contact with new data. Twenty-two documented contradictions on /research.
Four operating commitments. Compensation, placement, method, review. The fair-broker stance put into practice.
Foundation + three projects. MOAR (Modular Open Architecture, the open, swappable alternative to a single-vendor SIEM, delivered as the data platform), DetectFlow (detection at scale), MLOps-hunting (research). Foundation gates everything downstream.
Reference architectures. Patterns from client work, distilled and published through this site and CISA JCDC.

The three pillars

Three commitments. Each one is testable.

The pillars aren't aspirational language. They're structural commitments. Each one defines a property the data platform has to demonstrate, with evidence the program produces and updates over time.

Trustworthy

What this catches is the source feed that has been silently dropping events for three days while no dashboard alerts on it, the schema drift that makes half a detection rule's references resolve to nothing, and the asset database that is authoritative for ownership in 80% of cases, with a different source disagreeing on the remaining 20% and no one reconciling the two. The data is instrumented, validated, and lineage-traceable, so completeness, freshness, and schema conformance are measured per source and failures show up before analysts notice them in their queries.

The evidence is per-source data health reports running continuously, covering completeness, freshness, schema conformance, and OCSF conformance. (OCSF is the open standard for security data schemas; conformance means a given source's events match the shared shape rather than the vendor's proprietary format.) The artifact isn't "the vendor told us their pipeline is reliable." It's "we measured it over the last 30 days, and here is what we saw."

Well-connected

An analyst writes a hunt joining endpoint telemetry to the asset database and gets wrong results because the asset identifier field means three different things to three different tools, or the CMDB says 50,000 assets while the EDR sees 47,000 and the vulnerability scanner sees 52,000 and the delta gets papered over rather than named. Well-connected is the property that closes that gap: entities (assets, users, applications, configurations) resolve cleanly across sources, the data catalog knows which source is authoritative for which attribute with confidence and freshness scoring, and joins do what their JOIN clauses claim they do.

What evidence looks like is cross-tool gap analysis. When the sources disagree, the delta is documented, the authoritative source per attribute is determined with confidence and freshness scoring, and the coverage holes are explicit. Without that cross-tool view, "well-connected" is an assertion, not a property.

Performant

The data platform has to meet two distinct latency regimes on the same data. Sub-second detection and response (operational), and petabyte-scale historical threat hunting (analytical). The petabyte regime rests on the documented production deployments below (Cloudflare, Comcast, and Pinterest publish theirs), not on my lab, which measures at smaller scale. The single-engine assumption (one tool, one platform, accept the compromise on either side) is the regression that has been getting unbundled across the industry since 2022.

Most performance disappointment in security data isn't the vendor missing the spec. It's the spec being measured on a workload that doesn't match production. The brochure benchmark uses synthetic queries and idealized data shapes; the actual workload is dirty, skewed, and 40% of it is the same handful of queries running on different time ranges.

The evidence is reproducible benchmarks against the actual workload, not against the brochure. On a 10-million-event Zeek corpus, identical hardware, ClickHouse ran 46.8× faster than the dominant schema-on-read SIEM on the five-query average, with the gap widening to 21–62× on the hunting-shaped aggregations (the index still wins the simple lookups), answer-equality verified, with 8.2× compression — a single-node Tier B result so far, with cluster and concurrent-load behavior the honest open extension. The number is a tier-gap measurement, where the schema-on-read architecture hits a wall this workload can rerun; the engine you land on inside the lakehouse tier is a separate decision that turns on catalog, concurrency, and operational cost, which is the trade-off the worked scorecard resolves against a named baseline. Methodology in the lab; reference implementation under NDA. Deeper read on what that means for petabyte-scale hunting.

Intellectual roots

The framing isn’t invented here. The first two pillars are the security reading of data-centricity, Dave McComb’s case for converging on one shared, extensible data model instead of letting every tool keep its own. Well-connected and trustworthy come straight from that single shared model. Performant is what security has to add: at petabyte volume the model has to be fast as well as coherent, which is why the platform here is a columnar lakehouse rather than a knowledge graph. The same lineage supplies the migration path. McComb’s incremental, stealth approach to retiring a legacy system is the same move as leaving a legacy SIEM one workload at a time, rather than a big-bang replacement.

Open standards

Open standards at every layer.

Performant only stays performant if the layers stay swappable. What makes a layer swappable without re-platforming everything above it is an open standard at that layer. Four carry the weight, one per layer.

Apache Arrow. The in-memory columnar format engines use to hand data to each other without serialization tax. Data layer.
Apache Iceberg. The open table format that lets ClickHouse, Trino, Dremio, and Spark read the same files on object storage. Data layer.
OCSF. A shared event schema, so a source's events have one shape across tools rather than a per-vendor format. Schema layer. I treat it as a normalization-hygiene baseline that earns its keep, not as the schema everyone finally agrees on or a semantic backbone that fixes meaning, since conformance checks the shape of an event and not what its fields actually mean. ITU-T Study Group 17 standardized OCSF as Recommendation X.icd-schemas; member states backed it for ratification in December 2025 and adoption followed in April 2026, so schema stability now carries multilateral institutional backing.
Sigma. Portable detection logic at the analysis layer. A Sigma rule compiles to ClickHouse SQL, Splunk SPL, or Sentinel KQL, so the detection content survives an engine change rather than being rewritten with it.

The point is the analysis layer specifically. Open formats below it (Arrow, Iceberg) and an open schema (OCSF) make the data portable. Without an open detection standard, the detections themselves are still locked to whatever engine wrote them, and re-authoring a detection backlog is the migration cost that actually stalls re-platforming. Sigma's portability is real but bounded; correlation and stateful detections are less cleanly portable than single-event rules, so I treat the standards-portability claim as Tier B, not absolute. The foundational-standards deep-dive works through each layer and where its standard holds.

The method

Empirical skepticism. Evidence tiers. Update on contact with new data.

Every claim I put in a recommendation is graded by the kind of evidence it rests on, from production-deployment evidence (Tier A) down to vendor marketing (Tier D, excluded by default unless independently corroborated). I don't reproduce the full rubric here. The tier definitions and the current grade on every active hypothesis live on the research page, next to the claims they actually grade.

The update rule is simple. Positions evolve when new evidence overturns them. That update history is kept publicly on /research as a "things I changed my mind on" log, twenty-two contradictions documented and counting. The willingness to be wrong on the record is the credibility move; programs that never update their priors are programs that aren't actually testing them.

The full benchmark methodology, results, and reproducibility statement live on the lab page. Anyone can re-run it on their own workload.

If a future benchmark on a different workload reverses the result, I update. Vendor neutrality is a consequence of empirical skepticism, not the goal.

The four operating commitments

Compensation. No reseller margins. No commission on any product recommended.

Placement. No vendor-paid placements; the matrix scores against the disclosure, not around it.

Method. Benchmark spec, workload, and results are public on the lab; only the executable artifact is NDA-gated.

Review. Annual external review of the published results. First review Q4 2026, reviewer named on disclosures when it completes.

Already proven at scale

Published reference architectures.

The thesis above claims the data platform has to earn trust empirically. The teams below already ran the experiment. Internet/cloud scale, regulated industries, security-specific deployments, each one paired with the engine the workload runs on. Full reference architectures with pipelines, hero outcomes, and trade-offs live on the references page.

Internet & cloud scale

Cloudflare

ClickHouse

Internet & cloud scale

Comcast

Snowflake

Internet & cloud scale

Trino / Presto

Regulated industries

Standard Chartered

Databricks

Regulated industries

Bank Hapoalim

Trino (via Starburst)

Regulated industries

DNB

DuckDB (via Ibis)

Security-specific deployments

Palo Alto Networks

RisingWave

Security-specific deployments

RunReveal

ClickHouse

Security-specific deployments

Ziggiz.ai

Databricks

See the full reference catalog →

Foundation + three projects

The validation program is the gate. The three projects are optional, mix-and-match.

The projects here are the depth tracks (the architecture and research work the thesis rests on), and the scoped, fixed-price ways to buy that work are the Service Offerings on the engagements page: the foundation gate is sold as Data Health Validation and MOAR as MOAR Architecture Design, while DetectFlow and MLOps-enabled hunting inform engagements today without yet being standalone offerings.

Production · Gate

The foundation: data health, broadly defined.

A two-stage sequence that gates everything downstream. You don't ship detection content, hunting workflows, or machine-learning models on top of a foundation you haven't validated.

Stage one is per-source data health. Continuous reports establishing that what's in the lakehouse matches what each source promised, on completeness, freshness, schema conformance, and OCSF conformance.

Stage two is cross-tool gap analysis. When the asset database, the EDR, and the vulnerability scanner disagree about what's on the network, the delta is named, the authoritative source per attribute is determined, and the coverage holes are explicit. The cross-tool view is where assurance lives. A program that produces clean per-source reports but never reconciles across tools cannot defend the claim that its connected dataset is complete.

Read the foundation deep-dive →

Project 1

Production · Flagship

MOAR. Modular Open Architecture data infrastructure.

The data platform itself. An Iceberg-based lakehouse, a vendor-neutral catalog (Polaris, Nessie, or Hive Metastore), purpose-fit query engines selected against the workload (ClickHouse, Trino, Dremio, or StarRocks), vendor-neutral routing (Tenzir, Vector, or Cribl), and the assembly that lets the same data serve sub-second detection and petabyte-scale hunting, where the petabyte end is proven by the published deployments on the references page rather than by my lab, whose own runs are single-node at up to a billion rows so far.

When to pick it. SIEM cost spiral, retention pain, multi-region or regulated query needs, or any infrastructure decision where vendor neutrality and open formats matter more than path-of-least-resistance.

Read the MOAR project →

Project 2

Production · Depth

DetectFlow. Detection at thousands-of-rules scale, without the operational debt.

Detection-as-code, CI/CD pipelines for detection content, telemetry feedback loops, automated regression testing. Each rule is a versioned, tested, deployable artifact whose performance and false-positive rate are continuously measured. The differentiator is scale. Most detection programs cap out at hundreds of rules because of operational debt; DetectFlow is the discipline that lets a SOC carry thousands without collapsing under maintenance load.

When to pick it. Detection backlog growing faster than the team can maintain; analyst time consumed by tuning rather than hunting; incident retros showing detections that should have fired and didn't.

Read the DetectFlow project →

Project 3

Research · ~12–24 mo

MLOps-enabled model threat hunting.

Models surface anomalies and prioritize hunts; MLOps manages the model lifecycle (training, drift detection, retraining, evaluation) the same way DetectFlow manages the detection lifecycle. The point is not "AI in the SOC." It's treating models as production artifacts with the discipline that detection content has earned.

Maturity: leading-edge. I haven't seen an incumbent doing this well at petabyte scale, though that is a read of the field rather than a lab-measured claim. Near-term posture is thought-leadership track first; service line follows once the foundation work has produced reference clients.

Read the MLOps-hunting project →

Reference architectures — in public

Working patterns published openly.

What gets discovered in client engagements gets distilled, anonymized and generalized, into reference architectures the rest of the community can use. Each architecture pairs the platform choices (lakehouse, catalog, engine, routing) with the matrix scores that justified them, and the failure modes that surfaced in production. Methodology open; reproducible against your own workload.

Two channels for distribution. First, through Security Data Works directly: the reference architectures catalog on this site, the matrix scoring that anchors each pattern, and the engagement deliverables that map the pattern to a client environment. Second, through the CISA Joint Cyber Defense Collaborative: working patterns contributed upstream to the federal cyber-defense community, where the fair-broker thesis lands at industry scale.

The bet is that reference architectures compound as public goods. A pattern that lets one bank avoid a multi-million-dollar vendor lock-in lets a peer bank do the same; the methodology that surfaces a silent telemetry-loss regression in one SOC surfaces the same class of problem in another. Hoarding working architectures inside proprietary engagements is the consulting incentive; publishing them is the fair-broker one.

The takeaway

“The next Splunk” is already here; it’s just not evenly distributed.

Open standards let you pick data engineering tools that match your work. The category isn’t “which product replaces Splunk” — it’s composing data-engineering tools per workload on open standards you own, with evidence for which engine for which job.

You need the evidence it can do the job.

Or read the same line the way an optimist would, which might be the truer way. For the first time, every tool we need is already built and proven, the open formats, the fast engines, the shared schema, the cheap storage underneath, so the constraint was never really the technology. We are at an extraordinary moment, with everything in hand to improve life at a scale that wasn’t possible before, in security and far past it. The capability is already here; what’s left is the will to use it well.

Put the thesis to work on your data.

A 30-minute discovery call is the fastest way to find out whether the fair-broker approach fits your environment — and which of the three offerings it points to. Or read the research the thesis rests on.

Book a 30-min discovery call → Read the research