Security Data Works

Position piece

Agentic analysis reinvented Kimball. It skipped the measurement.

Spend any time in the data-engineering corners of the internet right now and you'll watch the same observation get made over and over. The AI labs publish careful guidance on how to feed an agent, which comes down to structuring the context, disclosing it a piece at a time, and giving the model a clean governed view of the data instead of a raw swamp, and a chorus of data engineers replies that this is just Ralph Kimball's dimensional modeling from the 1990s, rediscovered because the consumer changed from a BI analyst to an agent while the requirement stayed put. The observation is basically right, and I think it's worth taking seriously rather than dunking on, because the people making it are correct that structured, governed, progressively-disclosed data is what lets an agent reason instead of hallucinating its way across a data lake.

Reading time: about 22 minutes. Evidence tier: B where I lean on published work (the text-to-SQL literature, the MCP security research, the two infrastructure-vendor announcements) and C where I'm characterizing a community consensus or a vendor claim I can't put a number on. I'll flag the extrapolations. And I should say plainly that I came to the contrarian half of this through my own grounding work, so read the last two-thirds as an argument I have a stake in rather than a neutral survey. Where I want to push is not on whether the consensus is right, since it mostly is, but on the one step it skips, which turns out to be the step that matters most when the data in question is evidence.

Giving the consensus its due

The consensus is mostly right, and worth saying so.

Start with what's true, because it's most of the argument and skipping past it would be cheap. An agent dropped onto a raw warehouse and asked to write SQL hallucinates: it joins the wrong tables, misreads obscure column names, and forgets the business logic that says you filter the cancelled orders before you sum the revenue. The fix the consensus reaches for is the one human analysts reached for thirty years ago, which is to separate the measurable events from the descriptive context, model the events as facts and the context as dimensions, and let the consumer pull only the slice it needs. That a large language model benefits from the same shape a BI tool benefited from is not a coincidence, since both are pattern-matchers that do better against clean, well-typed, well-labelled structure than against a swamp.

The strongest evidence for this isn't a vibe, it's the text-to-SQL benchmark literature. On BIRD and Spider, the public datasets that test natural-language-to-SQL against real schemas, even tuned frontier models drop sharply when you point them at large enterprise schemas, and the prevailing fix in the research is schema linking: algorithmically prune the schema down to the handful of tables and columns relevant to the question before the model ever writes a line of SQL. That requirement is the same one the dimensional modelers are describing from the other direction. The model can't navigate the swamp; it needs a pre-filtered, well-typed subset, which is what a star schema is. The BIRD ceiling sits around 81 to 82 percent execution accuracy against a roughly 93 percent human baseline, and even that overstates real reliability, so the message is not "models can't do this" but "models do this far better against structure than against raw breadth."

The rest of the consensus follows from there and holds up. You want a semantic layer so the model calls a governed definition of "net revenue" or "beaconing" instead of guessing the formula from raw tables. You want a protocol like MCP to connect the agent to that layer without hand-building a brittle integration per source, with the caveat the security researchers have already raised, that MCP widens the supply-chain surface and needs real governance rather than blind trust (Narajala and Habler's 2025 enterprise-MCP-security paper is the careful version of that point, tier B). And for the relationship-heavy questions that dimensional modeling handles badly, like tracing a multi-stage intrusion across hosts and domains, you want a knowledge graph and the LLM-built GraphRAG tooling that now populates one semi-automatically. None of that is wrong. If a prospect told me they were building toward exactly this stack, I'd tell them they were pointed the right way.

So this isn't a takedown. The consensus has the destination right, and the people repeating it are better-read than the eye-rolling implies. My disagreement is narrow and it's about one thing the victory lap skips, which only becomes visible when you stop treating the data as a dashboard input and start treating it as evidence.

Three loose spots

Where the consensus gets loose.

Before the main point, three smaller corrections, because they're the places the consensus is misquotable and a careful security architect should not absorb the loose version.

Schema-on-read isn't the villain. Cost and unstructured-ness are.

The consensus often phrases the lesson as "schema-on-read failed, the lake became a swamp, so we're going back to schema-on-write." Two different things get collapsed there. Parsing fields at query time is a latency-and-cost choice; the swamp is what you get from storage that is unstructured and uncosted, and a lakehouse is still object storage underneath. The schema-on-read SIEM didn't fail because it parsed late, it failed because the per-gigabyte economics and the full-scan latency made the common queries unaffordable, and an Iceberg lake filled carelessly swamps just as fast as any other. The requirement to write down is "structured and costed at ingest," not "read-time schema bad." If you let the loose version stand, a vendor will happily quote it back at you to sell schema-on-write as a silver bullet, when the property you actually bought was discipline about structure and cost.

Chain of custody comes from the table format, not from adopting Data Vault.

Because security data is evidence, the consensus reaches for the heaviest historical-modeling pattern it knows, which is Data Vault: model each entity as a hub, hang its descriptive attributes off dated satellites, and let the satellite history stand in as the audit trail. Data Vault is a defensible modeling choice for estates that already run it, but adopting it specifically to obtain chain-of-custody buys a lot of extra JOIN work, since every reconstruction now stitches hubs to links to satellites, in exchange for a property the table format hands you directly. The evidence requirement is append-only with "what you knew and when you knew it," and Iceberg's append-only snapshots plus V3 row lineage already satisfy it, so the modeling tax pays a second time for history the metadata already carries. I worked through the row-lineage version of this in row lineage as the missing CDC primitive. The short version: get the forensic property from the format, and spend your modeling complexity where it actually pays back.

"Semantic layer" and "ontology" are not the same word.

The consensus uses "semantic" in two senses and slides between them without noticing. There's the dbt-MetricFlow / AtScale / Cube sense, where a semantic layer standardizes metric definitions so the agent asks for "net revenue" and gets a governed formula, and there's the formal-ontology sense, where you declare the classes and relationships that can exist and what they mean, so a reasoner can tell that a particular statement is impossible. Both are useful and they are not interchangeable. A metrics semantic layer keeps the agent from inventing a formula; it does nothing to tell you whether a field that fits the OCSF shape actually means what the schema says it means. Keeping the two apart matters for the next section, because the measurement I care about lives in the formal sense, not the metrics one.

The skipped step

The measurement the victory lap skips.

Here's where I get off the consensus train. The bleeding-edge version of the agentic-data argument is that you no longer hand-author the ontology and the knowledge graph at all: you point a reasoning model at your raw logs and threat intel, it deduces the classes and relationships, and it populates the graph on its own. The pitch calls this the holy grail, and the only downside it admits is the odd duplicate node, the model creating one entity for "Microsoft" and another for "MSFT" that you later dedupe. For a marketing dashboard, fine. For data that is going to be someone's evidence, that framing is dangerous, because it describes the wrong failure mode.

The failure that matters isn't a duplicate you can spot and merge. It's a confident wrong edge that fails silently. When I've run model-built and crosswalk-built mappings against a typed ontology, the errors that survive aren't garbled, they're plausible: a mapping that crosses a type boundary, an actor mapped where the schema expects an object, a relationship asserted in the wrong direction, all of it reading perfectly well to a human reviewer because a confident-but-wrong model produces exactly the output that looks right. In my own grounding corpus a measurable share of mappings came back sound-but-wrong, consistent enough to pass a reasoner's basic checks and still wrong, with a meaningful coverage gap where the mapping simply didn't reach the part of the schema it claimed to. A wrong IP-to-threat-actor edge doesn't announce itself. It just quietly corrupts the hunt, and worse, it can manufacture a false attribution that looks like signal.

The answer is not to distrust the model and crawl back to hand-authoring everything. The answer is to put a deductive check after the generation. Give a mapping two independent groundings, the type of the schema path it targets and the semantic type of the source field itself, write down which types are disjoint (an actor is not an object, a port is not a host), and a reasoner can derive that a mapping crossing those types is a contradiction and fail the build on it. It derives that from the ontology rather than from whether the code reads plausibly, and there's no model in that loop, so it returns the same verdict whether a frontier model, a small local model on air-gapped hardware, or a tired engineer at 2am produced the mapping. That property is what the rising-tide story can't give you: a deductive check doesn't get better or worse with the model, because the model isn't in it. I run an OWL reasoner (ELK, driven through ROBOT) over a hand-authored disjointness layer, and within its scope it does the one thing a human review pass structurally cannot, which is to fail a mapping that looks correct.

This isn't only theory, and I've felt the consequence first-hand. Years ago I found a vendor's own SIEM app silently mis-mapping the vendor's own logs, fields that never reached the data models, never reached the correlation searches, invisible across the entire install base because it rode along in the integration the vendor shipped. I fixed a chunk of it in a public pull request to that vendor's app, which is the kind of war story every practitioner has a version of. What's new is that the silent failure now has a deterministic catcher: the same class of error I found by hand is a thing a reasoner can score on a real shipped integration, repeatably, with the gap between what the vendor claims it maps and what it actually maps written down as a number. I work through the mapping mechanics in LLM-assisted OCSF mapping and the silent-failure pattern in the field-mapping anti-pattern.

The cleanest instance

The cleanest instance: AI-generated integration code.

The abstract version of this argument is "trust the generation only after a deductive check." The concrete version that's easiest to reason about right now is the one playing out in security-data integration, where the same accuracy hinge decides whether a much-hyped pattern is transformative or merely useful. It's worth walking through carefully, because it's the cleanest worked example of the constrained-inference economics in the whole pillar.

In November 2025 two infrastructure vendors signaled the move, two weeks apart, from vendors that don't usually compete. Tenzir announced an MCP server on November 11 whose pitch is that you paste a single log sample (EDR, cloud service, identity provider, anything) and the server generates a complete parser, an OCSF mapping, a test suite, and a deployable package, so you get a production-ready integration in one conversation, which Tenzir frames as "100% hands-off keyboard." Their CEO Matthias Vallentin puts it more strongly, saying "the power dynamic just flipped," because vendors no longer control the integration catalog and customers do. Databricks announced an MCP Catalog four days earlier, a central registry for MCP servers wired into Unity Catalog governance, which points at a different use for the same protocol — the control plane for AI agents operating on governed enterprise data rather than a chat enhancement. That governance-catalog thread belongs to a different argument (the agentic-SOC reality-versus-marketing question I take up in what's real vs marketed in the agentic SOC), so I'll leave Databricks there and stay with the Tenzir case, because the parser-generation claim is the one that tests grounding directly.

Worth being precise about what MCP is, since the word gets used loosely. The Model Context Protocol is the standard Anthropic open-sourced in November 2024 to solve a connector explosion: every AI application needs to reach external systems (databases, APIs, file stores, SaaS), and before MCP every vendor built proprietary connectors, so Anthropic wrote a Google Calendar connector for Claude, OpenAI wrote a different one for ChatGPT, the same data source integrated twice, then four times and eight times as you add AI vendors. MCP standardizes the interface: build one server for your data, and any MCP-compatible client can use it. The original use case is the obvious one, Claude reading your Slack or ChatGPT querying your Postgres, which is table stakes for enterprise AI agents and the protocol's reason for existing. What changed in late 2025 is that infrastructure vendors, not GenAI application vendors, started using MCP for something other than chat, and the Tenzir move is MCP as the orchestration layer for AI-generated data pipelines rather than a smarter chatbot. The distinction that matters is control-plane versus chat: chat makes a human's questions easier, while the control-plane version puts the AI in the path that generates and operates the integration code your detections run on.

The economics are where this gets interesting, and they turn entirely on accuracy. The old workflow for OCSF integration ran 2 to 5 days for an experienced engineer per vendor source: read vendor documentation, map fields manually, write parser code, write tests, deploy, debug, plus the ongoing maintenance burden when the vendor changes their schema. The new workflow Tenzir is claiming runs in hours, with regeneration replacing manual updates when schemas drift. If that's accurate, it amounts to roughly a 10× shift in the economics of security-data integration, and that figure is doing a lot of work, because the whole reason it would matter is that integration is most of the job. Roughly 80 percent of security-data-engineering time is integration grunt work — vendor API connectors, schema mapping, parser maintenance, test-case generation — which leaves architecture as the other 20 percent. A 10× cut to the 80 percent is the difference between a vendor's roadmap deciding when your specific log source gets first-class support and you generating the parser the afternoon you need it. But if the accuracy is closer to 80 percent rather than production-grade, the gain is closer to 2 to 3×, which is real but a good deal less transformative, because the 20 percent you have to correct by hand is exactly the hard 20 percent, and correcting AI-generated parser code that no human wrote is not free.

That accuracy hinge is the same deductive-verifiability thesis pointed at integration code instead of an ontology mapping. The question isn't whether the model can produce a parser that looks right — it plainly can, and the demo is convincing — but whether there's a check that fails the parser that looks right and is wrong. Without that check, the 80-percent and the 95-percent worlds are indistinguishable from the demo, and you find out which one you're in when a detection silently stops firing.

The integration-edition silent error

The three failure modes are the silent-error class, integration edition.

The errors that would matter here aren't the parser that crashes, which you'd notice. They're the integration-specific instance of the same sound-but-wrong class from the ontology section: plausible, shipped, and quietly off. Three of them are worth naming because each one is a place the demo passes and production doesn't.

The first is the 80-percent-solution problem. AI might generate parsers that work for 80 percent of common log samples and fail on the edge cases that are most of the actual difficulty: nested JSON structures, vendor-specific fields with unstable semantics (Okta's debugContext field is a classic example, where the shape and meaning shift across event types), multi-format logs from the same vendor across different products with incompatible schemas, and the enrichment logic that goes beyond field mapping into joining context from other sources. If AI-generated parsers need 20 percent manual correction in production, the productivity gain shrinks from 10× to 2 to 3×, which is still useful even if it stops short of transformative.

The second is the production-reliability problem. Generating a parser in a demo is easy; running one reliably in production is the hard part, because that's where you need error handling for malformed logs and unexpected fields, performance characterization across the 10 GB/day to 10 TB/day spectrum, and schema-drift detection — when the vendor changes format, does the parser break loudly or silently? — along with observability for debugging AI-generated code that no human wrote. If the generated parser fails in production and requires manual debugging by an engineer who didn't write it, the maintenance burden returns at higher cognitive cost than the manual baseline, because reading and repairing unfamiliar machine-written code is slower than fixing code you authored.

The third is the OCSF-complexity problem, and it's the one that's the cleanest match to the silent-edge failure from the ontology argument. OCSF has 40+ event classes, 200+ attributes, and inheritance hierarchies, and mapping vendor logs to it requires understanding which event class actually applies — Authentication versus Account Change versus User Access Management, which are distinct classes that vendors freely mix in their own logs — plus how to handle vendor-specific fields OCSF has no slot for, plus the enrichment logic again. If AI consistently picks the wrong OCSF class, downstream analytics break in subtle ways: detection rules don't match because the events are filed under the wrong category, and the failure mode is "we missed the alert" rather than "the parser crashed." It's the familiar garbage-in, garbage-out problem except that the garbage is OCSF-shaped and harder to spot, which is precisely the property that makes a deductive check worth more than a demo. A misclassification into the wrong OCSF class is a type-crossing the same way an actor-mapped-as-object is a type-crossing, and it's the same reasoner that could fail it.

Why customer control is the durable part

The air-gapped local model, and why customer control is the durable part.

There's a part of the Tenzir framing I think is right and durable independent of whether the parser-generation accuracy holds up, which is the power-dynamic point. When the integration catalog is something a customer can generate rather than something a vendor ships on its own roadmap, the customer controls the integration timeline, and that's a real shift in who holds leverage. Vallentin's "the power dynamic just flipped" is a vendor's line, so discount it accordingly, but the underlying claim — that customers controlling the integration catalog is more valuable than vendors controlling it — is one I'd make on the merits.

The version of that I can stand behind is the one that doesn't phone home. A self-hosted MCP server over your own lake, run by a model you control, is the posture that actually delivers customer control, as against a SaaS MCP that routes your integration work through someone else's endpoint and calls it customer-controlled. And the practitioner-owned version is demonstrable now, not a someday. There's a worked reference for building the server yourself in the build-your-own-MCP example, and in the companion MOAR stack a local model (Ollama gemma) ran a code-action hunt over the lakehouse end to end — wrote the SQL, read the result, answered — with the local model as the only endpoint, fully air-gapped. That's the proof I care about, because it's the fair-broker posture made concrete: the analytical loop closed over your own data with no frontier API in the path and nothing leaving the estate. It's also the setup for the next point, that the deductive check returns the same verdict whether a frontier model or that little air-gapped gemma produced the mapping, because the check isn't a model.

Why it's a security point

Evidence doesn't get the BI tolerance.

The reason this lands differently in security than in business intelligence is the cost of being quietly wrong. In BI, a dropped record or a slightly miscalculated dimension means a revenue dashboard is off by a fraction of a percent, someone notices at quarter-end, and it gets reconciled. In security the same data is the thing an investigation stands on, so an altered timestamp, a dropped log, or a field mapped the wrong direction doesn't shade a dashboard, it breaks a case. The autonomous-ontology consensus carries the BI tolerance for silent error into a domain that can't afford it, and that mismatch is the whole argument. A method whose errors are plausible and silent is fine where the consequence is a rounding error and disqualifying where the consequence is a missed intrusion or a wrong accusation. The same logic governs the parser case: an 80-percent parser is fine for a dashboard feed and not fine for the data a detection fires on.

That's also why the buyer for this is not the dashboard owner but the accountability layer, the person who answers to an auditor or a board when a control everyone assumed was covered turns out to have been silently blind. They are the ones who actually need the mapping verified rather than asserted, because they're the ones who carry the budget line when it fails. The deductive check is the thing you can hand that person: not a louder claim that the data is grounded, but a measurement of whether the grounding holds, model-independent and re-runnable, with the honest gaps named.

This is the part of the practice I'd rather be judged on than any single architecture opinion. The fair-broker stance only means something if the brokering is measured, so the field-mapping check sits inside the same discipline as the rest of the work: score what's actually shipped against what's claimed, write down the delta, and let the buyer see it. I make the broader version of that argument in independent measurement. The agentic-data consensus is the setup; the measurement is the position.

The falsification ledger

What would change my mind, stated so it can.

Because the AI-generated-parser claim is a vendor claim I can't yet check, I'd rather hold it as a falsifiable position than an opinion, so here's the ledger in standing form. What would move my confidence up: production case studies showing OCSF mapping accuracy above 90 percent on diverse sources; a second wave of vendors beyond Tenzir adopting MCP for data engineering rather than just chat, which would signal a pattern instead of a single bet; and independent benchmarks on parser accuracy across difficult log formats, run by someone without a product to sell. What would move it down: testing showing accuracy below 70 percent and requiring extensive manual correction; no production adoption beyond the announcements; or vendor positioning that quietly retreats from "100% hands-off" to "starting point requiring manual completion," which would be honest but would also reframe the value proposition from transformative to merely useful. The hypothesis is testable, and the lab roadmap covers it: install the server, run it against real security log sources (Okta, CrowdStrike EDR, AWS CloudTrail, and more), measure OCSF mapping accuracy above and below the 90 percent threshold, document the failure modes specifically, and compare against the manual mapping effort for the same sources. Results will appear either way, because the value of publishing is being on the record whichever way it lands.

Honest limits

What the deductive check doesn't do, and where I'm biased.

I'd undercut my own point if I let the check sound like more than it is. It catches actor-versus-object type-crossing, the contradiction you can derive from disjointness, and it does not yet catch the subtler lossiness, an enum collapsed to a coarser one or a structure quietly flattened, which is a different axis I won't claim it covers. The disjointness layer is hand-authored and I've adjudicated it over a modest set of artifacts so far, so a mapping scored over a much wider region of the schema carries lower confidence until that layer is extended, and someone has to do that authoring, which is the ontological judgment that doesn't come for free. The catch rate I trust most was measured against injected corruptions rather than a held-out set of confirmed human errors, so it's a strong first-pass validation, not a field trial. None of that retreats from the core claim, which is only that a deductive check fails on a mapping that looks correct where an inspection-based check can't, and that this property is worth the most exactly where the data is evidence.

And the bias disclosure I promised at the top. The consensus I'm describing is partly an echo chamber, including the conversations I have with my own tools, which tend to hand my framing back to me as if they'd arrived at it independently, so I try not to mistake agreement-by-mirror for corroboration. I also have a stake: the deductive check is the measurement my practice is built around, so I'm not a neutral party on whether it matters. Read the last two-thirds with that in mind. The thing I'd ask you to take from it isn't "use my tool," it's the narrower claim that for autonomously-built mappings and machine-written integration code on evidence-grade data, trusting the generation without a deductive check is the trap, because the errors are sound-but-wrong and silent, and silent-wrong is precisely what evidence work can't carry.

The narrow disagreement

We agree on the destination.

The agentic-data consensus and I want the same thing: grounded, structured, governed data that an agent can reason over without inventing its way into a wrong answer. Most of the way there we're walking the same road, and the people pointing out that this is Kimball again are doing useful work, because the industry does keep relearning that you can't point a model at a swamp and expect sense. The disagreement is narrow and it's about the last step. The consensus treats a grounded ontology as a thing you build and then trust, and increasingly as a thing a model builds and you trust, while I think for data that's going to be someone's evidence you build it and then prove it, with a check that doesn't run on a model's confidence. The same goes for the parser the model writes you.

So if you're standing up the stack the consensus describes, build it. Structure the data, write the semantic layer, wire the MCP server, let the model draft the mappings and the parsers and the graph. Then, before any of it carries weight in an investigation, put the deductive check behind it and measure whether the grounding actually holds, because the difference between a dashboard and a case is whether you measured that you arrived or trusted that you did. As for the AI-generated-integration pattern, I'd call it a promising one worth watching rather than a validated architectural shift — the capability is proven and the buyer side is untested, which is a reason to run the experiment, not to bet the integration strategy on the claim or to ignore it. For evidence, I'd measure.