Security Data Works

Component reference

Apache Spark — security data architecture

The batch-processing engine underneath dbt-on-lakehouse, OCSF normalization at TB/day, and ACID writes to Iceberg. Driver–executor parallelism, Catalyst optimizer for predicate pushdown and column pruning, Tungsten for off-heap memory. The engine you encounter whether you wanted to or not when running dbt against Iceberg or Delta.

$60/mo

Ephemeral cluster cost for typical batch dbt workloads — 10 workers (r5.4xlarge) spun up for ~30 min/day. Same hardware always-on for streaming runs ~$2,880/month. The cost gap is the batch-vs-streaming decision in one number.

The pipeline

  1. Sources

    Raw logs in S3

    CloudTrail · VPC Flow · EDR · syslog

  2. Compile

    dbt models

    SQL → Catalyst logical plan

  3. Execute

    Spark cluster

    Driver + executors; ephemeral or always-on

  4. Write

    Iceberg ACID commit

    Atomic snapshot; readers see all or nothing

  5. Serve

    ClickHouse / Trino / Dremio

    Downstream query engines read the same tables

What composes, what’s brittle

  • Catalyst optimizer. Predicate pushdown + column pruning + join reorder — applied automatically on Spark SQL.
  • Why for Iceberg writes. ACID snapshot commit; failed jobs leave no partial data visible to readers.
  • Schema evolution. ALTER TABLE ADD COLUMN is metadata-only; old files return NULL for new fields.
  • Shuffle is the killer. 50–80% of wall-clock time on large joins/aggs; partition + broadcast small tables.
  • Spark vs ClickHouse. Spark writes Iceberg (batch, minutes); ClickHouse queries it (interactive, ms).
  • What's brittle. Real-Time Mode (GA March 2026) is Databricks-first and not independently benchmarked — batch positioning still holds.

Sources: Apache Spark documentation · Apache Iceberg "Reliability" docs · Databricks engineering blog (Tungsten / whole-stage codegen) · Matei Zaharia confirmations on Real-Time Mode (vendor-architect, unvalidated)

See how the pattern lands on your workload.

The matrix scoring that justified each reference architecture's tool choices is the paid deliverable. The benchmark behind it is public — reproduce it on your own workload, then book a call to scope the work.