Security Data Works

OCSF adoption

Six schemas into OCSF: the mapping is the hard part.

What field-level mapping of Splunk CIM, Google Chronicle UDM, Microsoft Sentinel ASIM, Elastic ECS, OpenTelemetry, and Zeek into OCSF 1.8.0 reveals about the standard's real adoption cost, and its own limits.

What this is based on: six field-level crosswalks I built against OCSF 1.8.0, one per source, for Splunk CIM 6.0, Google Chronicle UDM, Microsoft Sentinel ASIM, Elastic ECS 9.4.0, OpenTelemetry, and Zeek. Every source field routes to an OCSF attribute path, every path validated against the public OCSF schema and real mapper fixtures (Cisco ASA→OCSF JSON on the network classes). The Zeek mapping is my own open-source contribution: executable mapping logic running at the sensor, validated on real packet capture rather than a paper field table.

MOAR security-data architecture: Source data flows through Ingest, Store, and Analysis to security Tasks, built on Apache Arrow and Apache Iceberg for data, OCSF for schema normalization, and Sigma for portable detection logic, with a Matrix-scored toolset at each layer.
OCSF is the schema layer of the open stack — the six field-level mappings below are what make that layer real.

Adoption of the Open Cybersecurity Schema Framework rarely stalls at the question the debates fixate on, because teams don't spend their political capital arguing whether OCSF is a good idea; most of them already accept that a shared event schema beats a thousand private ones. The work dies one step later, at mapping — the unglamorous job of getting your sources into OCSF field by field, with the enum bridges and the lossy edges and the required attributes your logs never carried. That is the step where the standard meets the actual data, and it is where the people who define OCSF and the people who have to adopt it stop speaking the same language.

I spent the past stretch doing that job in the open, against OCSF 1.8.0, for six source schemas: Splunk's Common Information Model 6.0, Google Chronicle's Unified Data Model, Microsoft Sentinel's Advanced Security Information Model, Elastic Common Schema 9.4.0, OpenTelemetry's logs data model and semantic conventions, and Zeek. Each got a field-level crosswalk in which every source field is routed to an OCSF attribute path, every mapping carries a confidence label, and every lossy seam is named rather than smoothed over, and I checked the attribute paths against the published 1.8.0 schema and, for the network classes, against the JSON a real mapper actually emits. What came out is less a set of lookup tables than a map of where normalization is clean, where it collapses, and why — and the why turns out to be more about OCSF than about any of the six sources.

The architecture spectrum

The sources arrange themselves by architecture.

If you put the six sources on a line by how they are built, the mapping difficulty falls out of the architecture before you map a single field.

At one end sits Splunk CIM, which is flat, search-time, and alias-based: a CIM "data model" is just a set of field names that any sourcetype can claim by tagging itself, so a logon and a firewall deny end up in the same flat namespace of src, dest, user, and action. CIM is the hardest source to map well, and not because Splunk did anything wrong, but because the model is twenty years of pragmatic search convenience and it happens to make the structural assumptions OCSF does not. CIM carries no severity on most of its models, so OCSF's required severity_id has to be invented for something like two-thirds of the telemetry; its single action field conflates what happened with what the policy decided; and its flat src_*/dest_* namespace has to be lifted into OCSF's nested src_endpoint and dst_endpoint objects. Mapping CIM to OCSF is mostly a sequence of judgment calls, which is exactly the thing you cannot encode once and then trust.

Microsoft's ASIM sits in the middle, and it is the most interesting of the six because it is a genuine hybrid. It keeps CIM's flat, prefixed field names (SrcIpAddr, TargetUsername, DvcAction), so the flat-to-nested lift comes back, but it normalizes everything CIM left raw, carrying a native EventSeverity, splitting operation (EventType) from verdict (DvcAction), pre-splitting vendor and product, and typing its hashes per algorithm. So ASIM pays CIM's mechanical tax while escaping CIM's judgment taxes, and it adds one move neither neighbor makes: it tags the type of every identifier — whether a user ID is a SID, an Okta ID, an AWS ID, or an Entra object ID — which is actually richer than OCSF can represent, so that provenance falls into the unmapped bag on the way in.

At the far end are two schemas that share OCSF's own DNA. Google Chronicle's UDM and Elastic's ECS are both typed, nested, and built in the same era for the same purpose as OCSF, and they are the easiest on-ramps by a wide margin. UDM hands you severity_id natively on every event, which is the gap CIM forced you to invent, and ECS goes one better on classification, because its event.kind / event.category / event.type / event.outcome quartet is the cleanest routing key of any source I mapped and it pre-declares which event types are legal for each category, so the classification both routes the event to an OCSF class and validates it at the same time. If I had to stand up a green-field OCSF pipeline tomorrow I would start from UDM or ECS, and the crosswalks say so in field-level detail.

OpenTelemetry is not really on this line at all, and saying that plainly is half the value of having mapped it. OTel is an observability transport with a semantic-convention registry alongside it, so it has no security event taxonomy and no field that says what kind of event a record is, which means mapping it to OCSF is two jobs stacked together: an envelope mapping that takes the logs data model (SeverityNumber, Timestamp, Resource) onto OCSF's class-agnostic spine, plus an attribute mapping that takes the network.*, http.*, and process.* semantic conventions (many of them inherited from ECS, after Elastic donated it to OpenTelemetry) onto OCSF objects, with the entire class-selection problem left unguided in between. OTel will tell you that a connection happened and how severe the log line was, but it will not tell you whether you are looking at a network event, an authentication, or a detection, and that unguided gap is itself one of the more interesting findings.

Zeek is the sixth, and the one that taught me the most, partly because I built and open-sourced its OCSF mapping myself. Zeek is connection-centric rather than event-centric: it emits one record per protocol transaction across a dozen log files — conn.log, dns.log, http.log, ssl.log, files.log — all stitched together by a shared connection uid, and it is the only source in the set backed by executable mapping logic that runs at the sensor and is validated on real packet capture rather than on a paper field table. Its base logs carry no severity and no verdict, because Zeek's job is observation rather than adjudication, and only notice.log, its alerting layer, produces anything that maps to a finding, so the crosswalk ends up being a fairly precise picture of what a network sensor's view of the world can and cannot see.

The empty cells

Reading the gaps in the matrix.

If you lay all six mappings against OCSF as the spine, with OCSF attributes down the rows and the six sources across the columns and each cell holding that source's corresponding field, the table tells you more by what is missing than by what is filled, which is why that field-centric matrix is the one artifact I would put in front of someone deciding whether OCSF is worth the cost.

The emptiness comes in two kinds that mean opposite things. Some cells are empty because a source is blind: Zeek's columns for the authentication, process, and account-change classes are empty all the way down, because a passive network sensor cannot see a logon, a process launch, or an IAM change, and that is honest blindness you would never have asked Zeek to cover in the first place. OpenTelemetry's authentication column is nearly as sparse, for the deeper reason that OTel has no authentication semantics to offer.

The other kind of emptiness is the one that matters, because some rows are empty across every source, and once a row is blank for all six it has stopped telling you anything about the sources and started telling you about OCSF.

Key visual

Network Activity (4001), six sources against one OCSF spine.

Here is one class of the field matrix. OCSF Network Activity, class_uid 4001, is the spine; each column is one source; each row is one OCSF attribute, with the source's corresponding field in the cell. A means the source has no field for that OCSF attribute, so the column cannot reach that part of OCSF. A trailing ~ flags a low-confidence or inferred mapping. Required attributes are marked req. Read across a row to see how the six sources each reach (or fail to reach) the same field; read down a column to see how much of the class a source can populate. These network mappings are fixture-validated against Cisco ASA→OCSF JSON.

OCSF attribute Splunk CIM Google UDM Microsoft ASIM Elastic ECS OpenTelemetry Zeek
activity_id req action (enum) security_result.action (enum) DvcAction+EventSubType (enum) event.type (enum) — (default Traffic) conn_state (enum)~
severity_id req security_result.severity (enum) EventSeverity (enum) event.severity~ SeverityNumber (enum) — (invent)
time req _time metadata.event_timestamp EventStartTime @timestamp Timestamp (ns→ms) ts (s→ms)
metadata req vendor_product (split) metadata.product_name/vendor_name EventProduct/EventVendor observer.* Resource/EventName Zeek/Corelight (construct)
src_endpoint or dst_endpoint req src/dest principal.*/target.* Src*/Dst* source.*/destination.* source.*/destination.* id.orig_*/id.resp_*
category_uid / class_uid req const const const const const const
src_endpoint.ip src/src_ip principal.ip SrcIpAddr source.ip/source.address source.address / client.address~ id.orig_h
src_endpoint.port src_port principal.port SrcPortNumber source.port source.port~ id.orig_p
dst_endpoint.ip dest/dest_ip target.ip DstIpAddr destination.ip/destination.address destination.address / server.address~ id.resp_h
dst_endpoint.port dest_port target.port DstPortNumber destination.port destination.port~ id.resp_p
connection_info.protocol_name transport/protocol network.ip_protocol (enum) NetworkProtocol (enum) network.transport/network.iana_number network.transport~ proto (+derive num)
connection_info.direction_id direction (enum) network.direction (enum) NetworkDirection (enum)~ network.direction (enum)~ network.io.direction~ local_orig/local_resp (derive)~
connection_info.uid session_id/flow_id network.session_id NetworkSessionId network.community_id~ uid (join key)
connection_info.community_uid network.community_id~ community_id (1:1)
traffic.bytes_in bytes_in network.received_bytes DstBytes destination.bytes — (in metrics) resp_bytes
traffic.bytes_out bytes_out network.sent_bytes SrcBytes source.bytes — (in metrics) orig_bytes
status_id action~ security_result.action~ EventResult~ event.outcome~
duration duration (×1000) network.session_duration (×1000) NetworkDuration duration (×1000)
firewall_rule.uid / .name rule~ NetworkRuleName/NetworkRuleNumber rule.name/rule.id~
src_endpoint.intermediate_ips[] src_translated_ip~ intermediary.ip/.hostname~ SrcNatIpAddr~ source.nat.ip/.port~
device dvc/dvc_ip~ observer.ip/.hostname / intermediary.*~ Dvc~ host.*~ Resourcehost.*

The unmapped tail for this class, the meaningful source fields that have no OCSF home at all and land in unmapped / enrichments:

  • CIM: src_translated_ip/src_translated_port/dest_translated_ip/dest_translated_port (NAT — no OCSF symmetric pre/post pair); *_zone (src_zone/dest_zone/dvc_zone — admin labels, not the fixed boundary_id enum).
  • UDM: network.parent_session_id (no first-class parent-session link); BROADCAST/MULTICAST direction (no OCSF direction_id slot); NAT context (intermediary cleaner on input, still no OCSF pair).
  • ASIM: SrcNatIpAddr/DstNatIpAddr/*NatPortNumber (NAT, no OCSF pair); SrcZone/DstZone/DvcZone; Local/External directions (no OCSF slot); Encrypt/Decrypt/VPNroute DvcAction (belongs in Tunnel Activity 4014).
  • ECS: source.nat.*/destination.nat.* (NAT, no OCSF pair); observer.ingress.zone/egress.zone; over-rich network.direction (7 values → OCSF's 4, external collapses); multi-valued event.category/event.type overflow (["network","session"]); network.community_id when it can't claim connection_info.uid.
  • OTel: TraceId/SpanId (trace context); client.*/server.* second coordinate system (vantage vs packet-direction); byte/packet counters (live in OTel metrics, not log/span attributes); no flow lifecycle (Open/Close/Reset have no OTel source); no verdict/zone/NAT (OTel isn't a firewall).
  • Zeek: history (per-packet flag-history string ShADadFf — no OCSF native, lands in connection_info.history/unmapped); conn_state nuance beyond activity_id (the 13-state TCP vocabulary collapses to 7 lifecycle slots, so the orig-vs-resp-aborted distinction RSTO/RSTR survives only in status_detail); local_orig/local_resp (the local/remote pair derives direction but has no symmetric OCSF home); orig_ip_bytes/resp_ip_bytes (IP-layer counts vs OCSF's payload-only traffic.bytes); missed_bytes (capture-loss counter). The seam Zeek alone carries: the cross-log uid re-stitch — conn/dns/http/ssl rows share a uidconnection_info.uid but OCSF holds them as separate events with no native re-assembly, so reconstructing one flow is the consumer's query-time job.

This is one class of seven. The full matrix spans Authentication (3002), Network Activity (4001), HTTP Activity (4002), Process Activity (1007), File System Activity (1001), Account Change (3001), and Detection Finding (2004), each with its own column-by-column table and unmapped tail.

The five seams

The seams that keep recurring.

Five seams kept recurring as I worked through the crosswalks, mostly independent of which source I was mapping, and together they were the most useful thing the six produced, because they are not bugs in CIM or gaps in UDM so much as the shape of OCSF 1.8.0 itself.

The first is severity, and it is the cleanest illustration of the pattern. OCSF makes severity_id a required attribute on its authentication, network, HTTP, process, and detection classes, but severity is a judgment a source either makes or it doesn't: CIM doesn't, on most models, so you invent it; ECS technically has an event.severity field, but it is a source-defined integer with no controlled vocabulary, so you are inventing again with only a number to anchor to; ASIM gives you four buckets and no "critical," while UDM gives you the full range natively. In the Cisco ASA output I validated the network mappings against, a denied connection had been assigned severity_id 5, Critical, while the allowed connections came through as Informational, and that assignment was the mapper author's convention rather than anything the firewall actually reported, so the same required field arrives free from one source, coarse from another, and fabricated from a third, which means the severity on an OCSF event can tell you as much about the mapper as about the threat and the downstream consumer has no way to know which.

The second seam inverts the usual direction of loss, which is why I keep coming back to it. OCSF's network, HTTP, process, and file activity classes have no disposition_id, so there is no first-class place to record whether the firewall allowed or blocked, whether the proxy permitted or denied, or whether the EDR quarantined, even though that verdict is the single field a SOC analyst filters on and these are the busiest event classes in the schema. UDM, ASIM, and ECS all separate the verdict cleanly from the operation, which hands the mapper exactly the thing OCSF cannot receive, so the cleanest-modeled sources carry their verdict all the way to the destination and then lose it there. Disposition lives only on the findings classes, which is part of why deny-heavy firewall and proxy telemetry sometimes reads better as a detection than as the network event it actually is, a modeling contortion that the missing field quietly forces on you.

The third seam is change, and it starts from the fact that there is no single OCSF "change" class. An account modification, a group-membership edit, a registry write, and a policy update scatter across Account Change, Group Management, Entity Management, Device Config State Change, and the registry classes, so every source that logs "something was modified" has to be routed into the right class by reading a sibling field, and that routing is the seam. CIM routes on a free-string object_category that varies by vendor, which is guesswork, whereas UDM, ASIM, and ECS route on closed enumerations, which is reliable, but none of them escapes the fact that OCSF deliberately decomposed what those sources kept as a single surface, and the decomposition is now the adopter's problem.

The fourth seam is mechanical but relentless, and it is the flat-to-nested lift. Flat sources like CIM and ASIM name their fields src_ip and SrcIpAddr, and OCSF wants them nested inside src_endpoint, actor, and user objects, while nested sources skip the lift entirely because ECS's source.ip becomes src_endpoint.ip as a rename rather than a restructure, so the same content lands as the difference between an afternoon and a week of mapper engineering, decided by an architectural choice the source made years before anyone thought about OCSF.

The fifth seam is the subtlest, and ASIM is the one that surfaced it: identifier provenance. OCSF gives you user.uid and user.sid as separate typed attributes, but it has no general way to record which directory issued an identifier, whereas ASIM tags every ID with its type — SID, Entra object ID, Okta ID, AWS ID — and ECS makes similar distinctions, so OCSF flattens all of them back into user.uid and the provenance drops into the unmapped bag. Here the source is richer than the standard and the mapping loses information on the way up, which is the reverse of the loss you normally expect from normalization, and it is worth saying out loud because it complicates the tidy story that mapping into a richer schema is always a gain.

Zeek added a sixth that the SIEM schemas never raised, because its whole model rests on the connection uid that stitches a flow's records together across log files, and OCSF's one-event-per-record model has no first-class cross-event key. You can map uid to connection_info.uid and preserve the value, but re-stitching the flow — reassembling the DNS lookup, the TLS handshake, and the HTTP request that all belong to one connection — becomes the consumer's problem. Zeek also exposed OCSF's missing certificate class, because its x509.log describes a certificate seen on the wire and OCSF models certificates as an object that has to ride inside some other event with no class of its own, which is a genuine gap in the standard and the kind of thing worth taking upstream rather than mapping around.

I had derived those seams by reading crosswalks rather than running them, so to test whether they were real or just a tidy story I put two sources I had not mapped before, Okta's System Log and CrowdStrike's detection stream, through a small benchmark that scores every documented field into OCSF 1.8.0 as typed, coerced, or unmapped, with each target validated against the published schema so that a mapping cannot invent a home that is not there. The same seams came back. Okta's login event lands 58% of its fields on a typed attribute and loses the other 42% to coercion or the unmapped bag, and what it loses is what the seams predict: the seven-value outcome enum narrowing to three, the risk and threat signals sitting in a free-form map OCSF has no typed home for, the proxy chain flattening to a bare list of IPs, and the anonymizing-proxy flag that OCSF's endpoint object does not carry at all. CrowdStrike's detections map better, at 70% typed, because file hashes and command lines are well modeled, but its ATT&CK tactic and technique arrive as flat strings where OCSF wants an id and a name both, and its multi-action response bitmask collapses to a single disposition.

What the running version showed that the paper ones could not was the distance between what the schema allows and what a shipped mapper does. OCSF 1.8.0 has a typed or coercible home for 36 of Okta's 50 login fields, but Okta's own open-source reference mapper carries only 18 of them into the event, leaving the autonomous-system, ISP, and credential-type fields, all of which feed real detections, unmapped even though the schema would hold them. So the seam is in the standard and also in the gap between the standard and the integration, and that second gap, which the benchmark now measures, is the expert-gating this piece is about. The benchmark, its per-field mappings, and the twenty-three of thirty-one named detections that lose a field they depend on are in the SDW Lab ocsf-mapping-fidelity benchmark.

A second mapper made the same point from the other side, and it is the cleaner example because it is the shipped one for the source I know best (Tier B, single host, synthetic corpus; Tenzir 6.0.0, library commit 671e049, against OCSF 1.8.0). I ran Tenzir's published zeek::ocsf::map unedited over a seeded Zeek conn corpus, and on the surface it does the job: it picks the right OCSF class on every record (100%, Network Activity 4001) and lands most of the values it carries (92%). Field-level scoring stops there and would call that a pass, but it does not derive activity_id from Zeek's conn_state, so the activity classification — open versus close versus reset versus fail, the very thing a consumer filters connections on — is wrong on 83% of records, and a few fields like history, service, and the connection uid fall into unmapped rather than a typed home. The harder number is the one next to it: of the four common sources I tried (Zeek, CloudTrail, Sysmon, generic auth), the shipped library produces a usable OCSF mapping for exactly one. There is no CloudTrail-management-events operator in the shipped Amazon package, the Sysmon mapping expects raw Windows EVTX/XML and refuses the pre-parsed JSON most EDR shippers carry, and a generic auth source has no mapping at all. So "maps to OCSF" turns out to be a coverage claim and not a fidelity guarantee, and once you measure both you find the binding constraint is availability before it is per-field accuracy: a mapping that gets the class right and the activity wrong is still ahead of three sources that ship no consumable mapping to grade in the first place. These figures ride with that library commit, so re-run on a newer release before repeating them. The mapping-fidelity bench is published on the lab, and the coverage-before-fidelity finding is the kind of mapping evidence that feeds a vendor's Capability Matrix score.

Scan-in-place

The scan-in-place objection deserves a straight answer.

There is a respectable argument that all of this is wasted effort, and it comes from the scan-in-place camp, of which scanner.dev is the cleanest expression: skip normalization and the lakehouse entirely, index your raw logs in S3, and scan them where they sit, because if mapping into OCSF is this expensive and this lossy, why pay for it at all.

The honest answer is that scan-in-place wins for search and loses for most of what depends on structure. When you want to hunt through a month of raw logs for an indicator, scanning the unstructured data in place is faster and cheaper than the pipeline it replaces, and the scan-in-place vendors are right about that, but correlation across sources, detection-as-code that has to reference a stable field, and any agent or model reasoning over security events all need the data to mean the same thing from one source to the next, which is the one thing raw logs do not do. When teams default to scan-in-place the reason is almost never that they have decided structure is worthless; it is that mapping into a structured schema is too hard, and that is a cost you can actually lower, so the cost of mapping sits right at the center of the architectural debate: bring it down and the structured side of the tradeoff becomes winnable, leave it where it is and scan-in-place keeps winning by default.

Where adoption dies

The on-ramp is expert-gated, and that is the bottleneck.

Which brings me back to where adoption dies. The community that produces OCSF skews toward standards authors and research-grade contributors, while the population that has to consume it is detection engineers and SOC teams who do not have a schema maintainer's fluency and should not need one, and because there is no accessible, validated, open library of mappings to start from, vendor after vendor and most large SOCs rebuild the same source-to-OCSF mappings privately, each paying the expensive part over again and sharing none of it. The expert-gating is the real bottleneck, and it stays invisible because each team pays it alone. It is the same bottleneck I described in the field-mapping anti-pattern, seen from the standard's side rather than the project plan's.

Six crosswalks do not fix that, but they do show the shape of the fix, because the mapping work turns out to be legible: it reduces to a small, nameable set of seams, much the same handful each time and most of them OCSF's own rather than any source's, it is validatable against the published schema and against real mapper output, and it does not require anyone's product, since the durable assets are the validated mappings, an OCSF validator, and a set of fixtures, none of which a vendor needs to own. The version of this that matters is open and neutral, governed where the standard is governed, so that the canonical mapping library belongs to the community rather than to any one vendor's lead-gen, which is why I built the Zeek mapping the way I did — open source, running at the sensor, validated on real traffic, contributed rather than held — and the other five crosswalks are the same bet in paper form, with the field-centric matrix as the single page that makes the whole thing legible to someone weighing whether OCSF is worth the cost.

That cost is real, and pretending otherwise is how standards lose credibility with the practitioners they need, so the honest pitch is the narrower and more durable one: the hard part of OCSF is the mapping, that mapping is a finite and nameable problem rather than an open-ended one, and the seams that remain are mostly the standard's own to close. The certificate class is missing, the disposition field is absent on exactly the classes where the telemetry is densest, and the severity contract asks for a judgment most sources never made, so naming those and mapping around them in the open is what lets the on-ramp get a little less expert-gated for the next team, which is, in the end, the kind of schema adoption that has actually tended to stick.