Apache Spark — security data architecture · Component references

Component reference

Apache Spark — security data architecture

The batch-processing engine underneath dbt-on-lakehouse, OCSF normalization at TB/day, and ACID writes to Iceberg. Driver–executor parallelism, Catalyst optimizer for predicate pushdown and column pruning, Tungsten for off-heap memory. The engine you encounter whether you wanted to or not when running dbt against Iceberg or Delta.

$60/mo

Ephemeral cluster cost for typical batch dbt workloads — 10 workers (r5.4xlarge) spun up for ~30 min/day. Same hardware always-on for streaming runs ~$2,880/month. The cost gap is the batch-vs-streaming decision in one number.

The pipeline

Sources

Raw logs in S3

CloudTrail · VPC Flow · EDR · syslog
→
Compile

dbt models

SQL → Catalyst logical plan
→
Execute

Spark cluster

Driver + executors; ephemeral or always-on
→
Write

Iceberg ACID commit

Atomic snapshot; readers see all or nothing
→
Serve

ClickHouse / Trino / Dremio

Downstream query engines read the same tables

What composes, what’s brittle

Catalyst optimizer. Predicate pushdown + column pruning + join reorder — applied automatically on Spark SQL.
Why for Iceberg writes. ACID snapshot commit; failed jobs leave no partial data visible to readers.
Schema evolution. ALTER TABLE ADD COLUMN is metadata-only; old files return NULL for new fields.
Shuffle is the killer. 50–80% of wall-clock time on large joins/aggs; partition + broadcast small tables.
Spark vs ClickHouse. Spark writes Iceberg (batch, minutes); ClickHouse queries it (interactive, ms).
What's brittle. Real-Time Mode (GA March 2026) is Databricks-first and not independently benchmarked — batch positioning still holds.

Sources: Apache Spark documentation · Apache Iceberg "Reliability" docs · Databricks engineering blog (Tungsten / whole-stage codegen) · Matei Zaharia confirmations on Real-Time Mode (vendor-architect, unvalidated)

Apache Spark — security data architecture

See how the pattern lands on your workload.