Security Data Works

Practitioner deep-dive

What your data means vs what shape it is.

If you read the flagship essay in this series, you've already got the uncomfortable picture: a detection can compile clean, deploy without a complaint, and quietly match nothing for months, because somewhere upstream a field got mapped to a path that doesn't exist in the data it's actually running against. I won't re-tell that whole story here. I want to pull on one thread inside it, the thread I think explains why the failure is so common and so hard to see, which is that the tooling we trust to tell us a mapping is right only ever checks the shape of the data, and the shape and the meaning are two different things, and a mapping can get the shape exactly right while getting the meaning exactly wrong.

That's the part that fools everyone, me included. When a field validates, when it's present and it's the right type and the schema check goes green, it feels like correctness. The pipeline told you it's fine. But the schema check can't see the one thing that actually matters for your detection, which is whether the field means what your rule assumes it means, so a green check is reassurance about the wrong question. I want to make that distinction concrete, because once you can see it you start spotting it everywhere, and I'll show you eight real examples a check I built surfaced in a real corpus where the shape was fine and the meaning was crossed.

Grounding method · a term used two ways

Two different things both called a semantic layer The phrase semantic layer gets used in two conflated senses. Sense A is the business-intelligence metrics layer: consistent metric definitions and lineage over a warehouse, with category examples such as dbt, Cube, and AtScale. Sense B is formal entity grounding, or ontology: asserting what a thing is and how it relates at the level of identity, using D3FEND artifacts, OWL classes, and disjointness assertions. The two are genuinely different. Machine grounding for security data needs Sense B; a metrics layer in Sense A does not give you entity grounding. Sense A — the metrics semantic layer Consistent metric definitions and lineage over a warehouse. category examples: dbt · Cube · AtScale defines: revenue = sum(net_amount) tracks: which tables a metric is built from guarantees: one definition across dashboards Answers: how is this number computed, and consistently across reports? Sense B — formal entity grounding Asserting what a thing IS and how it relates, at identity. D3FEND artifacts · OWL classes · disjointness asserts: a UserAccount is an actor, not a File relates: Process executed-from File (not = File) enables: a reasoner can find a contradiction Answers: what kind of thing is this, and can the field hold it at all? Machine grounding for security data needs Sense B. A metrics layer keeps your numbers consistent; it does not tell you what kind of thing a field holds, so it gives you no entity grounding.
"Semantic layer" gets used for two genuinely different things. Sense A is the business-intelligence metrics layer — consistent metric definitions and lineage over a warehouse, the category dbt, Cube, and AtScale sit in — and it answers how a number is computed and keeps that computation consistent across reports. Sense B is formal entity grounding, the ontology sense: asserting what a thing is and how it relates at the level of identity, through D3FEND artifacts, OWL classes, and disjointness. Machine grounding for security data needs Sense B, because a metrics layer keeps your aggregates consistent without telling you what kind of thing a field holds, which is the question entity grounding answers and the one the reasoner checks.

Shape is one thing

Shape is one thing, meaning is another.

Start with what a schema actually gives you. If you're on a modern stack your normalized events are probably in OCSF, the open shared shape for security events, and OCSF gives you a shape: a category, a class, an attribute name, a type. It says a network-connection event has a dst_endpoint with these sub-fields, that dst_endpoint.ip holds an IP address, that src_endpoint is its counterpart. That's genuinely useful, because it means your content gets written once against dst_endpoint.ip instead of once per vendor, and a normalization schema that does that well is worth having. I'm not down on OCSF as a schema. I just want to be precise about what kind of thing it is, because the precision is where the trouble hides.

The shape is "there's a field here called dst_ip and it holds an IP address." The meaning is "this is the address the connection went to, not the one it came from." Those are different claims, and only the first one is something a schema can check for you. When some integration decides that a vendor's field becomes OCSF's dst_endpoint.ip, it's making a claim about meaning, that the thing the vendor called a destination is the same kind of thing OCSF means by the destination endpoint, and that claim can be true or false independent of whether the field validates. If the vendor's field was actually the source address, the mapping is wrong, but it's wrong in a way nothing in the shape check can catch: the field is present, it's an IP, it's the right type, the schema goes green. The detection downstream inherits a backwards source-and-destination and never says a word about it.

This is what I mean by semantics, which is just a slightly academic word for what the data means rather than what shape it's in. A schema validates the shape. It does not validate the meaning, and it can't, because meaning isn't a property of the field's type, it's a property of what the field is about in the world, and there's no type signature for "this is the address the connection went to." That gap, between the shape a schema can check and the meaning it can't, is where the silent failure lives. Schema-conformance feels like correctness because it's the only check most pipelines run, and what it actually tells you is narrower than that, the absence of one specific kind of complaint.

Eight from a real corpus

A mapping that fits the shape and crosses the meaning.

Let me make this less abstract, because I think the examples are more convincing than the argument. I built a check, a deductive gate I describe in the next essay, that looks at a field mapping and instead of asking "is this the right shape" asks "do these two things actually mean the same kind of thing." It works by grounding each mapping two independent ways: once by the OCSF path it maps to, and once by what the source field itself actually means, and then it asks a reasoner, the program that works out what follows from what you've told it, whether the two groundings can both be true at once. When the source field and the target path describe different kinds of thing, a user where a process belongs, a file path where a process name belongs, the two groundings contradict and the gate fires. No machine learning anywhere in it, just logic over definitions.

I ran it against a real crosswalk corpus, six schemas mapped into OCSF, CIM and UDM and ASIM and ECS and OpenTelemetry and Zeek, 925 mapping rows of the kind real people hand-build when they onboard sources. These weren't synthetic mistakes I planted; they were genuine hand-written mappings, the everyday work. And the gate flagged about eight of them as cases where the shape was fine and the meaning had crossed, the exact failure I've been describing, sitting in a real corpus where every one of those rows had presumably passed whatever shape check the authors ran. Here's what they were, because the specifics carry the case better than any summary.

Three of the eight were an application mapped to a destination host. A field that names a service or an app, service.name, app, target.application, got mapped into dst_endpoint, which is the shape OCSF uses for the host a connection went to. The shape fits, because an endpoint and an application both end up looking like an entity with a name, and the type validates. But an application is not a host. The thing running on a machine is a different kind of thing from the machine it runs on, and a detection that keys on dst_endpoint expecting "the host the connection reached" will, for these rows, get an application name instead, and either silently miss or silently match the wrong population, with nothing in the pipeline flagging that an app got filed where a host was expected.

Another three were UDM's principal used as a host. UDM's principal is the actor in an event, the entity that did the thing, and three rows mapped a principal's attribute, principal.mac, into src_endpoint.mac, the source host's hardware address. Sometimes the actor and the source host are the same machine and you'd get away with it, but the mapping asserts they're always the same kind of thing, and they aren't: the principal is who acted, the source endpoint is where the connection came from, and collapsing one into the other is a meaning-crossing that the shape check can't see because a MAC address is a MAC address whichever entity it belongs to.

The last two are the ones I find clearest. A file path mapped to process.name, which takes the path of a file and files it as the name of a process, two related things, because a process does run from a file, but not the same thing, and a rule matching on process name that's actually getting a file path will not match what it thinks it's matching. And a Zeek HTTP host-header mapped to a URL, where the Host: header (a hostname) got filed as a full URL (a located resource), again related, again not the same kind of thing, again invisible to a check that only asks whether the field holds a string of the right type. In every one of these the shape was correct. The field was present, it was the right type, it would have validated against the schema all day. What was wrong was the meaning, and the only reason I can show you these eight is that the gate checks the thing the schema can't.

Why it slips through

Why this isn't already caught for you.

The honest answer to "why doesn't my stack already catch this" is that the reference material doesn't carry the information a checker would need. To catch a meaning-crossing, a tool has to know that two kinds of thing genuinely can't be the same individual, that a process is not a user account, a host is not an application, a file path is not a process name, even when those things are closely related. That assertion, that two kinds can't be the same thing, is called disjointness, and it's the thing a reasoner objects on. Without it, nothing in the map says a process can't also be a user account, so a checker has no grounds to complain when you map one to the other, and the wrong mapping sails through exactly the way it does in your pipeline today.

D3FEND, MITRE's map of defensive techniques and the digital artifacts they act on, is the one real formal ontology in this stack, by which I mean an agreed, machine-readable map of what a process and a file and a credential actually are and how they relate. It's the natural place to ground OCSF fields into, because it already defines those artifacts. But off the shelf it ships thin on exactly the assertions a checker needs, only a handful of disjointness pairs in the whole ontology and none among the core artifacts your OCSF objects map to, so even with D3FEND in the loop a reasoner has nothing to object to. The gate I ran adds that missing layer itself, which is why it could catch the eight crossings, and it's the subject of the next essay in this series. I'll keep the boundaries honest here: this is Tier B evidence, my own groundings and my own judgment about which artifacts are disjoint, run over a single corpus, and the eight flags are plausible coarse mappings worth a human's review, not confirmed-wrong-by-the-author errors. I'd rather you take them as the class of mistake the shape check misses, made visible, than as a defect count.

What to do with this

Ask the second question by hand.

The reframe is small and I think it holds: a green schema check tells you the shape is right and tells you nothing about whether the meaning is, and the gap between those is where a detection goes quietly dark. You don't need my gate to start using that distinction. The next time a mapping validates, you can ask the second question by hand, is the thing on the left actually the same kind of thing as the field on the right, an actor or a target, a host or the application running on it, a process or the file it ran from, and a surprising amount of the time the answer is no and the shape check never told you.

If you want to check it mechanically, the pieces are all open and you can use them today, and I've packaged the working version at the gate directory in security-data-that-works, pointed at the ocsf-mapping-fidelity corpus. OCSF gives you the shared shape. D3FEND gives you the formal map of what the artifacts are. Sigma and the pySigma OCSF pipeline are where your detection content meets that shape. ROBOT and ELK are the reasoner toolchain that does the working-out, and they run in seconds on a laptop, not a cluster. The honest gaps I hit, D3FEND shipping thin on disjointness among them, are tracked as open issues in those projects, so when you find a coarse mapping like the eight above the move is to take the fix upstream, file the missing disjointness assertion against D3FEND or the mapping correction against the relevant crosswalk, rather than patch it privately and let the next person hit the same crossing. The thing that makes this fixable for everyone is that the map is shared, so a correction you contribute once is a meaning-crossing nobody else has to discover after the breach.