Component reference
Apache Spark — security data architecture
The batch-processing engine underneath dbt-on-lakehouse, OCSF normalization at TB/day, and ACID writes to Iceberg. Driver–executor parallelism, Catalyst optimizer for predicate pushdown and column pruning, Tungsten for off-heap memory. The engine you encounter whether you wanted to or not when running dbt against Iceberg or Delta.
Ephemeral cluster cost for typical batch dbt workloads — 10 workers (r5.4xlarge) spun up for ~30 min/day. Same hardware always-on for streaming runs ~$2,880/month. The cost gap is the batch-vs-streaming decision in one number.
The pipeline
-
Sources
Raw logs in S3
CloudTrail · VPC Flow · EDR · syslog
-
Compile
dbt models
SQL → Catalyst logical plan
-
Execute
Spark cluster
Driver + executors; ephemeral or always-on
-
Write
Iceberg ACID commit
Atomic snapshot; readers see all or nothing
-
Serve
ClickHouse / Trino / Dremio
Downstream query engines read the same tables
What composes, what’s brittle
- Catalyst optimizer. Predicate pushdown + column pruning + join reorder — applied automatically on Spark SQL.
- Why for Iceberg writes. ACID snapshot commit; failed jobs leave no partial data visible to readers.
- Schema evolution. ALTER TABLE ADD COLUMN is metadata-only; old files return NULL for new fields.
- Shuffle is the killer. 50–80% of wall-clock time on large joins/aggs; partition + broadcast small tables.
- Spark vs ClickHouse. Spark writes Iceberg (batch, minutes); ClickHouse queries it (interactive, ms).
- What's brittle. Real-Time Mode (GA March 2026) is Databricks-first and not independently benchmarked — batch positioning still holds.
Sources: Apache Spark documentation · Apache Iceberg "Reliability" docs · Databricks engineering blog (Tungsten / whole-stage codegen) · Matei Zaharia confirmations on Real-Time Mode (vendor-architect, unvalidated)