Security Data Works

Project 1 · Foundational standards

Four open contracts make the lakehouse work.

The lakehouse rests on four open standards, and each one is a different kind of contract. Apache Iceberg is a contract between the producers and consumers of a table. Append data, change the schema, evolve the partitioning, all without coordinating with the readers. Apache Arrow is a contract between processes that share the same columnar bytes without copying. OCSF is a contract between the security tools that emit events and the ones that consume them. Sigma is a contract between a written analytic and the backend that runs it. Each contract abstracts over one specific thing so the layer below it can change without breaking the layer above. That is the whole point of the architecture.

The shape of the argument

A contract is not a file format. A file format is the implementation.

Most explanations of Iceberg start with "it's a table format that stores Parquet files in object storage." That is true, and it is the wrong place to begin. Parquet is the implementation. The contract is the metadata above it: the snapshots, manifests, schema and partition spec that producers commit to and consumers trust. The reason an engine swap doesn't require re-ingesting is that the engine reads the contract, not the files directly.

I find this framing more useful than "open table format" because it sets up what the same argument looks like one layer up (OCSF: a schema contract between tools) and one layer down (Arrow: a memory contract between processes). Once you see the four standards as four contracts at four different scopes, the architecture stops feeling like an alphabet of acronyms and starts feeling like a layered design with one repeated idea.

Framing credit: I owe the "Iceberg as a contract" articulation to Streambased and Roman Kolesnev (2024), who use this language to explain why streaming-ingestion vendors can target Iceberg as a producer interface without coordinating with the analytics engines that read from it.

Iceberg · the table-format contract

Producers write through the contract. Consumers read through the contract. Files are abstracted from both sides.

The diagram below is the worked example for this whole series. Read it left to right: producers on one side perform four kinds of operation; consumers on the other side perform four different kinds of operation; neither side touches the files in object storage directly. The metadata in the middle (snapshot, manifest list, manifest, schema-and-partition spec) is the only interface either side commits to.

Producers Append rows Schema evolution Partition evolution Overwrite snapshot Write through the contract Contract · metadata Snapshot current pointer · ID · timestamp Manifest list snapshot → manifests Manifest file paths · stats · bounds Schema + Partition spec field IDs · evolution rules Consumers Snapshot read Time travel Predicate pushdown Schema enforcement Read through the contract Implementation · the contract abstracts over file layout Parquetpart-001 Parquetpart-002 Parquetpart-003 Parquetpart-004 Parquetpart-005 Parquetpart-006 Parquetpart-007 Object storage · S3 · MinIO · GCS
Figure 1. Iceberg as a contract. The metadata in the centre is what producers and consumers commit to; the Parquet files below are implementation, abstracted from both sides. Framing: Streambased · Roman Kolesnev (2024).

What each layer of the contract is doing.

Snapshot

A snapshot is the table's current state: an identifier, a timestamp, and a pointer to a single manifest list. Time travel works because a snapshot is immutable. A consumer can ask for "snapshot 4732883" three weeks from now and get exactly the same bytes back, even if the table has been overwritten in the meantime.

Manifest list

A manifest list collects all the manifest files that participate in a single snapshot, with partition-range summaries for each. This is the layer a query engine consults to skip whole partitions before it even looks at file-level stats.

Manifest

A manifest lists individual data files with per-file metadata: row counts, column-level upper and lower bounds, null counts. Predicate pushdown ("skip files where src_ip is outside 10.0.0.0/8") happens here. The query engine never opens a Parquet file whose manifest says nothing matches.

Schema + Partition spec

Schema is field-ID-based, not name-based; rename a column and consumer queries keep working because they bind to the ID, not the string. Partition spec can evolve too: today's data partitions by day, last year's by month, and the same query reads both because the manifest list resolves which spec applies to which file. This is where the "no downtime, no rewrites" claim is paid for.

Where the contract has limits.

The contract is real, but it isn't free. Three honest limits to keep in view:

  • Catalog choice still binds you. The contract describes what the metadata says, not who hosts it. A Glue-cataloged Iceberg table is still tied to AWS; a Unity-cataloged one to Databricks. The Databricks/Iceberg analysis walks through where the lock-in surface actually lives.
  • Maintenance is real work. Manifests grow, small files accumulate, snapshots need expiring. Engines that promise "auto-optimize" do this maintenance on your behalf, and the bill for it is real even if you don't run the job yourself.
  • V3 changed several of these tiers. Variant types, deletion vectors, and row lineage all landed in the Iceberg 1.8–1.10 releases through 2025–2026. The diagram above is the V2 / V3 shape; the V3 thesis-shift piece covers what moved.

Sources for the diagram: the Apache Iceberg specification (Tier A, Apache Software Foundation) and Streambased's "Iceberg as a contract" framing (Tier B, practitioner). I have used the contract metaphor with the framing author's blessing; the diagram is original to this site.

The other three contracts

Arrow, OCSF, and Sigma. The same pattern at three different scopes.

The Iceberg diagram is the worked example. The other three foundational standards repeat the same shape (producer side, contract in the middle, consumer side, implementation abstracted) but at different scopes. Each gets its own visual below — the framing first, then the diagram that delivers it.

Arrow · the in-memory contract

Between processes that share columnar bytes.

The contract is the in-memory layout: a known column representation, known null bitmap, known buffer alignment. Two processes on the same machine, or two services across an Arrow Flight connection, can hand off a batch without serializing or copying. The implementation is per-language libraries (pyarrow, arrow-cpp, arrow-rs); the contract is what makes them interchangeable.

Foundational standard · in-memory format

Apache Arrow zero-copy sharing versus copy-and-convert On the left, moving data between two engines means serializing it to a buffer and deserializing it back, paying a CPU and memory tax each time. On the right, both engines point at one shared Arrow columnar buffer, so the data is shared with no copy. Copy & convert Engine A (Python / pandas) serialize to buffer deserialize to memory Engine B (Spark / JVM) A CPU and memory tax on every handoff. Arrow: zero-copy Engine A Engine B Shared Arrow columnar buffer one standardized memory layout Both read the same bytes. No serialization tax, which is what makes swapping engines cheap.
Arrow is the shared memory format that lets independent query engines hand data back and forth without serializing and copying it each time; the fuller canonical diagrams are Voltron Data's and Wes McKinney's PyData talks, condensed here to the one point that matters for engine portability.

OCSF · the schema contract

Between security tools that emit and consume events.

The contract is a set of event classes and field definitions. A producer (Zeek, EDR, identity provider) emits events that conform; a consumer (SIEM, lake, analyst query) reads them without per-vendor field-mapping work. The implementation is whatever wire format moves the bytes (JSON today, Parquet at rest, OCSF-over-Iceberg in the lakehouse).

Foundational standard · event schema

OCSF as a write-time contract for security events Endpoint, cloud, and identity sources each speak their own dialect. OCSF normalizes them at write time into one taxonomy — category, class, attributes — so a detection written once reads events from any mapped source, instead of the engine having to translate every dialect at read time. Many dialects Endpoint / EDR Cloud control plane Identity provider OCSF taxonomy — normalized on write category: Identity & Access class: Authentication (3002) attributes: user, src_endpoint, status_id, auth_protocol Detection written once Schema-on-read SIEM the engine translates N dialects, every rule per source. Schema-on-write OCSF normalize once at ingest; one rule reads every mapped source.
OCSF moves the translation work to ingest time (when data first lands in storage), so a detection is written against one shape of event instead of being rewritten for every vendor's format — the same lock-break Iceberg makes at the storage layer, applied here to the schema layer.

Sigma · the detection contract

Between a written analytic and the backend that runs it.

The contract is a portable rule description: log source, condition, fields. A single Sigma rule compiles to Splunk SPL, Elastic KQL, ClickHouse SQL, Sentinel KQL. The consumer is whichever backend an environment runs; the producer is the analyst or vendor who wrote the rule once. Correlation is the open frontier: rule portability is mature, multi-event correlation across backends is not.

Foundational standard · detection portability

One Sigma rule compiled to multiple backend query languages A single Sigma rule in YAML compiles through the Sigma converter to Splunk SPL, Elastic ES QL, ClickHouse SQL, and OpenSearch PPL. Single-event matching ports cleanly to all four; correlation and temporal-ordered logic does not port reliably and is the measured gap. Sigma rule (YAML) written once Sigma converter Splunk (SPL) Elastic (ES|QL) ClickHouse (SQL) OpenSearch (PPL) Ports cleanly single-event matching rules translate to all four backends. Loses fidelity correlation and temporal-ordered logic does not port reliably — the gap the portability benchmark measures.
Sigma lets you write a detection once and run it on whichever backend you land on, so the logic isn't trapped in one vendor's query language — with the honest caveat that the simple rules port and the stateful ones still don't.

Where this fits

Foundational standards are the layer the MOAR architecture assumes. Every component on the capability matrix either honors one of these contracts or doesn't. Whether a candidate tool honors the contract is one of the criteria the matrix scores on.