Security Data Works

Why this practice exists

I was the only one in the room with my hand down.

In 2023 I was sitting in a data-engineering talk when Joe Reis asked the room who was using the modern data stack. Hands went up everywhere, and mine didn't, because I didn't know what he meant. I looked around and worked out that I was the only person from cybersecurity in the room. Joe moved straight on without defining the term, which was unusual for him, and he could move on because everyone else already knew. So I sat there with a question the rest of the audience didn't have: what is a modern data stack, and is security so far behind that we don't even know what we're missing?

The gap I walked into

Data engineering had solved security's problems five years earlier.

It turned out to be the second one. As Joe described how miserable the old way had been, the cost pressure, the scaling walls, the vendor lock-in, the struggle to run flexible analytics across messy sources, he was describing my job. A mid-sized enterprise generates a couple hundred gigabytes to a few terabytes of security telemetry a day and watches it grow 25-60% a year depending on the estate; the SIEM is priced per gigabyte, so the bill scales linearly with the threat surface; and there's a separate tool for correlation, for orchestration, for threat intel, for storage, each holding its own copy of the data. The modern data stack Joe was describing was the answer the data world had built to exactly that problem: separate the storage from the compute so each scales on its own, keep the data in open formats so any engine can read it, compose the pieces instead of buying one monolith. The data-engineering community had hit those walls around 2015, built that answer, and moved on. Security hadn't started.

The deeper gap wasn't the tooling, though, it was the culture, and that's the part that has stayed with me. Data engineering had built a culture of evidence: it benchmarks in public, with TPC-H and TPC-DS and ClickBench comparing dozens of engines on results anyone can read; it shares architectures in conference talks and GitHub repos; it documents what failed, not just what shipped. When a data engineer wants to know whether DuckDB or ClickHouse fits a workload, they run the benchmark and read the numbers. When a security architect wants to know whether their SIEM is the bottleneck, they read the vendor's data sheet. One of those is a measurement and the other is a brochure, and security had come to treat the brochure as if it were the measurement. I'm not going to pretend data engineering has no blind spots, because it has plenty, but it had built a transparency around architecture decisions that security has never had, and I left that talk wanting to know why.

So I made a decision in that room, which was to go and actually learn data engineering, not skim blog posts about it but understand the architecture from the inside and work out which parts of it transferred to security. Two years later the thing that transferred wasn't a tool, it was a short list of principles the modern data stack rests on: separate storage from compute so each scales and gets paid for on its own; keep the data in open table formats so you can change engines without re-ingesting; compose modular, best-of-breed components instead of buying one monolith; and choose between them on benchmarked evidence rather than on the vendor's promise. None of those is exotic in data engineering, and every one of them cuts against the way security buys infrastructure, which is monolithic, proprietary, priced per gigabyte, and sold on a demo. The distance between those two lists is the whole problem, and closing it is the work.

Why the practitioner can't tell the story

Security mostly can't show its work, and three structural things keep it that way.

The honest answer, the one it took me a couple of years of digging to be sure of, is that the security practitioner mostly can't show their work, and three structural things keep it that way.

The first is that they're locked out of measurement. Splunk's General Terms, Section 1.2(v), tell every customer they agree not to use the product to “analyze, test, characterize, inspect, or monitor its availability, performance, or functionality for competitive purposes.” Read that plainly — a security team that wants to know whether ClickHouse or Trino could run their analytical workload faster, on their own data, on their own hardware, is contractually barred from finding out. It isn't a Splunk quirk. Snowflake, Databricks, and Oracle carry comparable language, and Oracle's traces back to the 1980s DeWitt clause, named for the database researcher Oracle stopped from publishing comparative results. The consequence compounds, because when customers can't legally benchmark, the only security benchmarks that exist come from vendors whose own products don't restrict them, so every “10× faster than your SIEM” figure you've read was funded, by structural design, by a party that comes out ahead in it. The architects who would gain the most from independent measurement are precisely the ones forbidden to produce it.

Sit in the room where SIEM modernization actually gets decided and you can watch what that produces. A CISO is comparing four platforms, each with a faster-than-the-incumbent number on its data sheet, each produced by the vendor that benefits from it, and none of them independently reproducible, because the customer who could reproduce them is bound by the clause that prohibits exactly that. The most rigorous-looking figure in the room is whichever vendor most recently funded a study, and the decision gets made on the measurement that happens to be available rather than the measurement that would be useful. I've written the long version of this on the economics of independent measurement; the short version is that the measurement layer security needs structurally cannot exist under the contracts security operates under.

The fix isn't to open-source a benchmark and tell everyone to run it, because the moment the comparison set includes commercial software under a restrictive license, shipping a turn-key reference implementation puts either the publisher or the downloader in breach. What works is the shape data engineering already settled on with TPC-H, TPC-DS, and MLPerf: publish the methodology, the hardware spec, the query suite, and the results in the open; share the executable artifact under a one-page mutual NDA with qualified reviewers; and put a named external reviewer on the methodology each year, with their signoff published. Security has been waiting for that layer for a decade, and the contract regime is most of why it never arrived.

The second is that the knowledge is tribal, and nowhere does that show more clearly than in the parsing layer that turns a vendor's raw logs into fields a SIEM can search. That layer is owned by no one in the chain, and it breaks where no one is looking. PAN-OS firewall syslog ships with no timezone value at all, and the tribal knowledge is that you set every firewall to UTC so the assumption downstream holds. That works until someone stands up a new pair of firewalls on local time, and I once watched a regulated utility end up with an 18-hour spread across its estate, a fabricated incident timeline that still rendered green on every dashboard. In 2023 I submitted a pull request to Palo Alto's own community Splunk app fixing well over a hundred broken and missing field extractions; it was never merged, and the repository has since been archived. I don't read that as malice, I read it as incentives: nobody was paid to merge it, the data being a little wrong in the SIEM isn't a problem the vendor experiences in its own console, and the integrators who know where every vendor hides the bodies are paid to keep that knowledge private. It isn't only Palo Alto, either; Splunk stamps Zeek logs with the wrong timestamp by default, taking the write time instead of the event-start time security actually needs, and Tenable's most useful vulnerability context arrives buried in an unparsed nested field, present and useless at the same time, and I hit and fixed both, so the shape is structural rather than one vendor's neglect. The parsing layer nobody owns is the full version of that story.

The third is that there's no forum where the architecture conversation even happens for security. The structures we do have for sharing, the CISA Joint Cyber Defense Collaborative, the ISACs, the Chatham-House-rule groups, are organized around threat intelligence and incident response, and they're genuinely good at that, and they matter. But they aren't built around data architecture, so a security architect who wants to compare notes on lakehouse design or OCSF mapping has no equivalent of a Subsurface or a Data Council to walk into. Data engineering talks about its architecture in public; security talks about its adversaries in private, which is the right instinct for indicators of compromise and the wrong one for infrastructure decisions, and the result is that the architecture lessons never compound across the industry the way the threat intel does.

I ended up doing something about that gap in a small way, by convening a periodic group of security-data architects who compare notes on lakehouse design, OCSF mapping, and the internals of formats like Iceberg and DuckLake, because the venue I wanted to attend didn't exist and the only way to get one was to start it. It's a peer forum and not a sales channel, and the fact that it had to be built from nothing is telling: data engineering has Subsurface and Data Council and a dozen others, and security's architects have been improvising in direct messages and hallway conversations.

AI just made it urgent

You can't put an agent on top of a parsing layer no one maintains.

For a long time you could leave that foundation cracked and get away with it, because a human analyst would eventually notice the timezone was off or the useful field was buried, and would patch around it. AI takes that slack away. An agentic SOC, a natural-language-to-query layer, an automated triage pipeline, all of it is exactly as good as the data underneath it, and security's data is the messy, unparsed, time-broken layer nobody owns. The numbers already show the strain: in the work I've tracked, natural-language-to-KQL translation produces syntactically clean queries 97 to 99% of the time and returns the correct result set only around 58% of the time, and the gap is the data, not the model. You can't put an agent on top of a parsing layer no one maintains and trust what it tells you, because the agent has no instinct that the timestamp might be a lie; it reasons over what it's handed, and what it's handed is the second-class feed. So the urgency the whole market feels about AI in security is real, and it points, whether vendors say so or not, straight at the data-quality reckoning security has deferred for a decade. AI doesn't let you skip the foundation; it raises the cost of not having built one.

It takes two hard things

Threat hunters are becoming data scientists without quite noticing.

The fix needs someone who is both a data engineer and a security practitioner, which is a strange and slightly unreasonable thing to ask for, because each of those is hard on its own and the combination is rare. The data scientist who's never run a SOC doesn't know that an alert's timestamp is a claim that has to be earned, or that the “user” field means three different things in three different tools. The security expert who's never built a pipeline doesn't know that storage and compute can scale separately, or that an open table format lets you change engines without re-ingesting a byte. Each side has a blind spot exactly where the other has a reflex, so the person who carries both can read a board that neither half can read alone. Why would anyone sign up for that? Because the bridge between them is exactly where the value sits, and because the crossing has already started, on its own, from the security side. Threat hunters are turning into data scientists without quite noticing it happen: they're living in Jupyter notebooks, building features over telemetry, versioning detections as code, reaching for MLflow to make a hunt reproducible six months later. I've mapped that path in the work on MLOps for threat hunters and moving from notebooks to reproducible artifacts. The honest move is to name the trend and lean into it, to treat the hunter who's learning to think like a data scientist as the future of the role rather than a curiosity, because that person is the one who can finally tell security's data story in a language the rest of the data world already speaks.

Where it matters most

In regulated industries, security is either the “no” or the thing that lets you move.

All of this matters most in the regulated industries, the banks and exchanges and utilities and hospitals, because that's where security sits squarely between caution and innovation. Done badly, the security team is the “no”: the reason the cloud migration stalls another quarter, the reason the AI pilot can't get clean data, the reason a new analytics project waits on a review that never quite finishes. Done on a modern data foundation, security becomes the thing that lets a regulated business move at all, because the governance and the lineage and the data-residency controls are built into the platform instead of bolted on afterward, and a CISO can say yes to the cloud and the model and the new data source because the evidence that they're safe is already being produced.

The trouble is that the regulated incumbents are the most rooted in their legacy SIEM, because the switching cost lives everywhere at once. I know a regulated market-infrastructure operator whose management is so anchored in Splunk, with so much built on top of it, that there's no real appetite to move: every detection, every dashboard, every runbook, every analyst's muscle memory assumes it. A rip-and-replace pitched into that room dies on contact, and it should, because the risk of a big-bang cutover in an environment like that is genuinely unacceptable.

Which is why the migration that actually works isn't a replacement at all, it's an incremental, almost stealthy one. This is Dave McComb's incremental, stealth approach to retiring a legacy system, applied to the SIEM: you move one workload onto open architecture you own, prove it on real data, keep the lights on, and let the new foundation grow underneath the old system rather than pitching it against the old system. The sequence even has an order that works. You start with the workload the legacy SIEM does worst and charges most for, the long-retention hunting and the analytical queries that scan months of data, which is exactly where per-gigabyte licensing and ninety-day retention windows hurt the most and exactly where a columnar lakehouse wins by the widest margin. The real-time detection path, the part the SOC trusts and can't afford to have break, moves last or not at all. You're not betting the business on a cutover, you're moving the cheapest-to-move and highest-value workload first, proving it, and letting the evidence buy the next step. I've watched the federated version of this hold up in a regulated, multi-region environment, where a security-data architecture spanning two distributed sites validated twenty million OCSF events with every cross-site join completing in under ten seconds and WAN bandwidth cut by better than ninety percent, which is the sort of result that moves a skeptical CISO precisely because it comes from production rather than a vendor's slide. I've written the operational version of this as the federated rollout playbook and the hidden costs of migration; the point here is that the path off a 2015-era SIEM in a regulated shop is a sequence, not a switch.

What standing in the gap looks like

Evidence you can check, not opinion you have to trust.

All of which is why I built the practice the way I did, around evidence rather than opinion. Security Data Works exists to make public the things the rest of the chain has reason to keep to itself. The benchmark methodology is open on the lab page, the reference implementation is shared under a one-page mutual NDA so it can include commercial software without putting anyone in breach, and an external reviewer is named each year with their signoff published, so a number I publish is something you can check rather than something you have to trust. The first number through that process is a plain one: on a ten-million-event Zeek workload, on identical hardware, an open columnar engine ran a five-query security analytical suite 46.8 times faster on average than the dominant schema-on-read SIEM, with the margin running 21 to 62× on the hunting-shaped aggregations and the SIEM's index winning the simple lookups, answer-equality verified on a single node, with 8.2× compression, and the methodology and caveats are published so you can argue with it line by line. The capability matrix scores the components of an open architecture against that same evidence, and it scores against the disclosure rather than around it, so an active partnership with one vendor doesn't move a competitor's score. The four operating commitments behind all of it are dull on purpose: no reseller margins, no vendor-paid placements, methodology public with only the executable artifact gated, and an annual external review. None of that is glamorous, and none of it resolves the structural problem overnight, but it is what standing in the gap requires rather than just describing it.

The hand goes up either way

The question is whether security crosses on its own terms.

I think back to that room and the one hand that stayed down, which was mine, and what I notice now is that security's hand is going up whether security likes it or not. Cost will force it, and AI will force it faster. The only question left is whether security crosses the bridge on its own terms, practitioner-owned, evidence-based, one workload at a time, or gets dragged across late and expensively by a vendor's roadmap and a renewal it was never allowed to benchmark. Someone has to stand in the gap and say the quiet parts out loud, with numbers, without something else to sell you, and the absence of anyone doing that is most of why security has stayed a decade behind the data world it could have been learning from the whole time.

The next time someone asks a room of security architects who's running a modern data stack, I'd like more than one hand to go up.

This is the thesis. Here's the evidence it rests on.

The three pillars, the open standards, and the four operating commitments are on the thesis page. The benchmark receipts are in the lab. Or book a 30-minute call and put it to work on your own data.