Technology deep-dive

Iceberg vs Delta Lake for security data.

Choosing a lakehouse table format is the most consequential architectural decision for security data operations in 2026. It governs query-engine portability, vendor neutrality, operational complexity, and migration cost. The decision has narrowed to Iceberg or Delta — and Iceberg V3 changed the math.

Reading time: 23 minutes. Evidence tier: A (production validation from Netflix, Insider, Adobe, InMobi plus AWS and Databricks product announcements) with one Tier D update on Databricks Lakewatch.

V3 update first

What changed in 2026.

This essay was originally drafted assuming Iceberg V2 mechanics, where the "Iceberg merge-on-read is less mature than Delta" critique and the "compliance-first means choose Delta for GDPR erasure" recommendation were both defensible. Apache Iceberg V3 changes this materially through three load-bearing features:

Puffin-based deletion vectors replace V2's position-delete and equality-delete files. This is the same mechanism Delta has used for its merge-on-read maturity advantage. The "Iceberg MoR is immature" claim was a V2 statement.
Row-level lineage (row IDs plus last-updated tracking) enables incremental processing and CDC-from-Iceberg without external metadata, weakening Delta's Change Data Feed advantage.
Variant type (semi-structured / nested JSON, ratified October 2025 in Parquet, adopted in Iceberg V3) reframes the schema-evolution conversation for security log ingestion specifically.

Three caveats apply. Engine support for V3 features is rolling out across Spark, Trino, DuckDB, and Snowflake through 2026 — verify your engine version before assuming V3 mechanics in production. Delta Lake has a longer track record on these capabilities, which still matters for risk-averse compliance use cases. The vendor-neutrality and multi-engine arguments in this essay are unaffected by V3 — they're about ecosystem dynamics, not table-format mechanics, and they remain the strongest reasons to prefer Iceberg.

Specific in-line updates appear below the V2-era claims they affect. The original analysis is preserved so you can see what changed.

The decision matters

The number-one architectural decision for a security lakehouse.

Choosing between Apache Iceberg and Delta Lake determines:

Query engine portability — can you swap Trino for Dremio without data migration?
Vendor neutrality — are you locked to Databricks, or multi-cloud across AWS / Azure / GCP?
Operational complexity — daily maintenance burden for 2 PB/day ingestion.
Migration cost — switching table formats means reprocessing petabytes ($50K–500K).

This isn't a three-way shootout. For security operations in 2026, the decision has narrowed to Iceberg or Delta, with Hudi occupying a specialized CDC-heavy niche. Hudi's complexity (merge-on-read optimization, compaction tuning) creates operational overhead without proportional benefit for typical security workloads.

The evidence in this essay comes from Netflix (5 PB/day Iceberg), Insider (90% S3 cost reduction with Iceberg), Adobe (5,000+ Delta tables, petabyte scale), and InMobi (GDPR/CCPA compliance with Delta).

Architecture assumption

Dedicated security infrastructure.

This essay assumes dedicated security data infrastructure — separation-of-duties best practice. When security data lives on isolated infrastructure, separate from corporate data platforms containing PII and financial data, many architectural decisions simplify.

Why isolation matters for table-format choice

Network isolation plus IAM as the security boundary. Dedicated security VPC/VNet, team-only access (no cross-functional multi-tenancy), simplified security posture with the network boundary plus IAM providing the primary control.

Encryption overhead becomes optional. Shared platforms must encrypt Iceberg metadata (10–20% query overhead) when PII or financial data mixes with security logs. Isolated platforms can treat metadata encryption as optional, performance-first.

RBAC complexity reduces. Shared platforms need fine-grained row-level security and column masking. Isolated platforms can use table-level permissions, avoiding the 5–30% RLS latency tax.

Compliance requirements simplify. Operational security logs (EDR telemetry, network flows, cloud audit trails) have lower compliance burden than PII. Isolation satisfies separation requirements without HIPAA/PCI-DSS-grade hardening on the security plane.

Production validation: Netflix runs dedicated observability infrastructure (5 PB/day, isolated from production systems). Huntress runs a dedicated security platform (3M endpoints, 93% cost reduction, isolated). Jake Thomas at Okta runs isolated analytics infrastructure (7.5T records, dedicated to the security team).

When isolation assumptions don't hold — shared corporate data platform with security data mixed with finance and operations, multi-tenant security teams (MSSP managing 50+ customer tenants), or PII / financial data in security logs (rare, but possible for fraud-detection workloads) — compliance shifts and you enable all the hardening measures, which favors Unity Catalog plus Delta Lake for built-in governance.

Production evidence

Iceberg at scale.

Netflix: 5 PB/day validation (2018–2025)

Pre-2018, Netflix's petabyte-scale logging system ran on Apache Hive and struggled with millions of partitions causing slow metadata operations, schema evolution requiring full table rewrites, and no ACID guarantees during concurrent writes from 100+ pipelines.

Rather than accept the limitations, Netflix engineers (led by Ryan Blue) developed Apache Iceberg with a metadata layer for ACID transactions at petabyte scale, schema evolution without table rewrites, partition evolution, hidden partitioning, and time travel. Production validation: 5 PB/day sustained ingestion, millions of partitions without performance degradation, instant schema evolution (seconds, not weeks).

Iceberg was open-sourced to Apache Software Foundation, then adopted by Apple (petabyte-scale observability), AWS (native support in Glue / Athena / EMR), Snowflake (Polaris catalog), and Databricks (acquired Tabular for $1B+). The security relevance: Netflix validated lakehouse plus specialized engines at the highest scales. Security teams should follow this pattern, not monolithic SIEM architectures.

Insider: 90% S3 cost reduction (2023)

E-commerce security operations. Baseline cost was an estimated $120K–150K/month S3 (15–20 TB/day event data × 90-day retention × 2–3× duplication for streaming + batch + compliance archives = 8,100–16,200 TB total × $0.023/GB S3 Standard).

Challenge: data duplication across streaming + batch processing pipelines. Same event data written by the streaming pipeline (Kafka → S3), the batch pipeline (daily aggregation → S3), and the compliance archive (immutable logs → S3 Glacier-IR) = 2–3× storage overhead.

Solution: migrate to Apache Iceberg lakehouse with unified table format eliminating duplication. Result: 90% reduction in Amazon S3 costs — $120K–150K/month became $12K–15K/month.

Cost reduction methodology: eliminating duplication drove ~70–80% of savings (streaming + batch converged into a single Iceberg table). Optimized query scans via Iceberg's intelligent file pruning added 5–10%. Automated lifecycle tiering (hot → warm → cold) added 10–15%. Schema evolution without rewrites avoided $50–100K in migration costs.

Relevance: security data workloads have the identical duplication problem. SIEM + data lake + compliance archive = 3× storage costs. Insider's architecture eliminates this waste.

AWS S3 Tables: managed Iceberg (2024)

AWS announced Amazon S3 Tables in November 2024 — fully managed Iceberg tables with automatic maintenance, compaction, and snapshot management. Built-in Iceberg REST Catalog, queryable from Athena, EMR, Redshift, Glue, and third-party engines. AWS providing a managed service alongside Glue support validates long-term commitment. Security teams can adopt Iceberg without operating self-managed Spark compaction jobs.

Production evidence

Delta Lake at scale.

Adobe Experience Platform: 5,000+ tables (2024)

Adobe Experience Platform (customer data platform for marketing) runs 5,000+ active Delta tables (8,000+ total including historical), terabytes of data ingested daily, petabytes of data managed for customers (Unified Profile offering). Z-ORDER optimization reduced processing time from hours to minutes. ACID transactions enabled concurrent writes from 100+ pipelines. Schema evolution without downtime for customer-facing applications. Stack: Databricks + Unity Catalog + Delta Lake. Adobe validates Delta Lake at similar scale to Netflix's Iceberg — petabyte-scale, thousands of tables, high-concurrency writes.

InMobi: GDPR/CCPA compliance (2023–2024)

InMobi (mobile advertising platform, security-adjacent data privacy) faced GDPR/CCPA right-to-erasure requirements: delete user data across a petabyte data lake within 30 days. Solution: Databricks Lakehouse Platform with Delta Lake. Z-ORDER indexing optimized point deletes. Time travel allowed recovery for the 30-day compliance window before permanent deletion. Encryption at rest and in transit via FIPS 140-validated modules. Audit trails for all data access. InMobi's pattern — Delta + Unity Catalog + audit trails — validates compliance-heavy workflows for data-privacy regulations.

Anonymous SaaS: 250B messages/day (2024)

2 petabytes actively queried data migrated from NoSQL to Delta Lake. 5,000+ Delta tables for multi-tenant architecture. 250 billion messages/day across regions. 3 trillion changes/day on Delta tables. Validates Delta Lake for extreme-scale write concurrency. Security data at scale (2 PB/day, high write concurrency, multi-tenant isolation) mirrors this workload.

Scale interpretation

Scale is not the differentiator.

What "petabyte-scale" means in security terms: 1 PB/day is roughly a 100,000–500,000-employee enterprise running comprehensive logging (EDR + CloudTrail + network + application). 5 PB/day is Netflix or Apple scale (billions of events, global operations). Security-typical: 10K employees produces 1–10 TB/day; 100K employees produces 10–100 TB/day.

Both table formats handle security data scale. If your organization generates under 100 TB/day, both Iceberg and Delta Lake are proven at 50–500× your scale. Between 100 TB/day and 1 PB/day, both are validated by production case studies. Above 1 PB/day, you're in Netflix / Apple tier and should consult their architectures directly.

The takeaway: choose based on query-engine flexibility (Iceberg) versus Spark optimization (Delta Lake), not on performance limits.

Architectural comparison

Five decision factors.

1. Query engine ecosystem

Iceberg: universal compatibility — Spark, Trino, Dremio, Flink, Athena, Snowflake, BigQuery, DuckDB, Presto. Vendor-neutral. Multi-cloud (query AWS Iceberg tables from Azure Databricks via Polaris).

Delta Lake: Spark-optimized, deepest integration with Apache Spark. Expanding ecosystem (Trino, Athena, BigQuery added 2023–2024). Databricks-first — best performance within Databricks via Unity Catalog optimization.

Decision: choose Iceberg for multi-cloud portability. Choose Delta if Databricks-committed for the deepest Unity Catalog integration.

2. Metadata architecture

Iceberg: distributed metadata in Parquet/Avro manifest files. Query engines read only needed manifests (efficient at millions of partitions). Catalog-agnostic — works with Glue, Unity Catalog, Polaris, Hive Metastore.

Delta Lake: transaction log in _delta_log/ (JSON + Parquet checkpoints). Spark caching optimizes log reads. Checkpoint every 10 commits.

Benchmark (2023, 3 TB TPC-DS): Delta load time 1.68 hours, Iceberg load time 5.99 hours — Delta Lake 3.5× faster for Spark-based loads. Caveat: the benchmark is Spark-centric and does not reflect Trino, Dremio, or Athena workloads where Iceberg's distributed metadata may outperform Delta's transaction log.

Decision: Delta if Spark-dominant (ETL pipelines, batch). Iceberg if mixed engines (Athena for compliance, Trino for ad-hoc, Spark for ETL).

3. Schema and partition evolution

Iceberg: partition evolution — change strategy without rewriting data. Start with daily partitioning, switch to hourly as volume grows. Hidden partitioning — analysts query by timestamp, Iceberg applies partition filters automatically. Add, drop, rename, reorder columns without table rewrites.

Delta Lake: partition changes require rewriting the table. Analysts must specify partition columns in WHERE clauses. Add columns without rewrites, but rename and reorder require table recreation.

-- Iceberg: add hourly partition transform (does not rewrite existing data)
ALTER TABLE security_events.firewall_logs
ADD PARTITION FIELD hours(event_time);

-- Remove daily partition (deprecate, does not delete data)
ALTER TABLE security_events.firewall_logs
DROP PARTITION FIELD days(event_time);

-- Queries automatically use optimal partitioning (hidden)
SELECT * FROM security_events.firewall_logs
WHERE event_time > now() - interval '24 hours';
-- Iceberg uses hourly for new data, daily for historical

Decision: Iceberg if data volume is unpredictable. Delta if partition strategy is stable (daily partitioning sufficient for 5+ years).

4. CDC and streaming integration

Iceberg: Flink integration for streaming writes. Kafka via copy-based approaches (Kafka → Iceberg via Flink/Spark). V2 merge-on-read used position-delete and equality-delete files, operationally heavy at high update rates. V3 introduces Puffin-based deletion vectors that materially close the MoR maturity gap with Delta. V3 row-level lineage (row IDs + last-updated tracking) enables incremental processing and CDC-from-Iceberg without external metadata, weakening the historical "Delta has Change Data Feed, Iceberg does not" advantage.

Delta Lake: Change Data Feed for native CDC support. Iceberg V3 row lineage is the equivalent capability, though Delta's CDF has more production mileage. Mature MERGE INTO for upserts and SCD Type 2. Spark Structured Streaming optimized for Delta.

Decision: Delta for CDC-heavy workloads with a need for production track record. Iceberg for write-once workloads (immutable security logs, audit trails).

5. Multi-format catalogs (Unity Catalog)

Unity Catalog (Databricks, open-sourced June 2024) supports Delta Lake, Iceberg, and Hudi in a single catalog. Managed Iceberg tables via Unity Catalog's Iceberg REST Catalog API (Public Preview, Databricks Runtime 16.4+ LTS). Foreign catalog access — query Iceberg tables from AWS Glue, Hive Metastores, Snowflake via Unity Catalog. Delta Sharing for Iceberg in Private Preview.

Significance: Unity Catalog eliminates the "Iceberg OR Delta" binary. Security teams can use Delta for CDC-heavy tables (user inventory, asset tracking) and Iceberg for immutable logs (CloudTrail, firewall, EDR) with a single governance layer. Adopting Databricks reduces table-format lock-in.

Why Iceberg

Multi-engine, vendor-neutral, analyst-accessible.

The reference architecture I work with specifies Apache Iceberg as the lakehouse table format for four reasons.

1. Multi-engine design philosophy

Dual-engine architecture pairs StarRocks (ad-hoc queries, real-time threat hunting) with ClickHouse (scheduled queries, dashboards, compliance reporting). Both engines read Iceberg natively without vendor-specific connectors. A future engine swap — StarRocks to Trino, ClickHouse to Druid — requires zero data migration. Delta requires Databricks-specific connectors for non-Spark engines, which reduces multi-engine flexibility.

2. Vendor neutrality

Exit strategy preservation matters. If Polaris underperforms, swap to AWS Glue (both support Iceberg). If Dremio pricing escalates, swap to Trino or Athena (all read Iceberg). No vendor lock-in at the data layer. 15+ query engines support Iceberg natively. Delta optimizes within Databricks, which trades portability for in-ecosystem performance.

3. Hidden partitioning for analyst accessibility

Security analysts shouldn't need data-engineering knowledge to write efficient queries.

-- Iceberg hidden partitioning
SELECT * FROM security_events WHERE event_time > now() - interval '7 days';
-- Iceberg automatically filters to optimal partitions

-- Delta Lake manual filtering
SELECT * FROM security_events
WHERE partition_date >= current_date() - interval '7' days
  AND event_time > now() - interval '7 days';

Operational benefit: reduces SOC analyst training burden, prevents accidental full table scans.

4. Partition evolution for unpredictable growth

Security data volume is unpredictable. Year 1: 200 GB/day (daily partitioning optimal). Year 2: 1.5 TB/day (hourly partitioning needed). Year 3: 8 TB/day (hourly plus source partitioning). Iceberg partition evolution changes the strategy without rewriting petabytes. Delta requires a full rewrite — $50–200K cost for petabyte-scale tables.

When Delta wins

Three scenarios where Delta is the right call.

1. Databricks-first architecture

Committed to Databricks for 5+ years, Unity Catalog governance, MLflow for threat-detection models. Delta provides the deepest Unity Catalog integration (row-level security, column masking native), Spark performance optimizations (1.7–3.5× faster than Iceberg for Spark workloads), and Delta Sharing for secure external sharing. Trade-off: accept ecosystem lock-in for best-in-class performance within that ecosystem.

2. CDC-heavy workloads (with caveat)

User behavior analytics, asset inventory tracking, CMDB synchronization. Delta has had Puffin-based deletion vectors longer, MERGE INTO has more production mileage, Change Data Feed is production-validated. Iceberg V3 brings deletion vectors and row-level lineage onto parity in the spec, but adoption lags the spec — engine support for V3 deletion vectors is rolling out across Spark, Trino, DuckDB, and Snowflake through 2026.

The "Iceberg MoR is immature" argument was a V2 statement. For V3 deployments, the gap is materially smaller. Verify your query engine version before assuming V3 mechanics in production.

3. Compliance-first security operations (also with caveat)

GDPR right-to-erasure, CCPA deletion, PCI-DSS masking. Delta's Z-ORDER indexing optimizes point deletes. Puffin-based deletion vectors make erasure cheap. Time travel plus vacuum handles compliance retention windows. Databricks holds FedRAMP, HITRUST, HIPAA, SOC 2 Type II certifications.

Iceberg V3 closes most of this gap. V3 deletion vectors use the same Puffin-based mechanism, making right-to-erasure a metadata update rather than a full file rewrite. The "compliance-first means choose Delta" guidance was a V2 recommendation. For V3 deployments where engine support has caught up, this is no longer a reason to prefer Delta — it's table-stakes for both. InMobi validated Delta at advertising-platform scale; equivalent Iceberg V3 production references are still emerging (early 2026), so Delta's longer track record genuinely matters for risk-averse compliance posture.

Decision framework

How to choose.

Choose Apache Iceberg if

Multi-cloud strategy (AWS + Azure + GCP, querying across clouds).
Query engine flexibility (Trino ↔ Dremio ↔ Athena swaps without data migration).
Vendor neutrality is critical (preserving exit strategy).
Partition evolution is likely (unpredictable data volume growth).
Analyst accessibility matters (hidden partitioning reduces training burden).
Write-once workloads dominate (immutable security logs, audit trails, compliance archives).

Production validation: Netflix (5 PB/day), Insider (90% cost reduction), AWS S3 Tables (managed service).

Choose Delta Lake if

Databricks-committed (5+ year Unity Catalog roadmap, MLflow threat detection, existing Spark investment).
CDC-heavy workloads with a mature production-track-record requirement.
Spark-dominant architecture (ETL, batch, Spark Structured Streaming).
Compliance-first operations with risk-averse posture (longer production deployment behind erasure features).
Best-in-class Spark performance matters (1.7–3.5× faster for Spark workloads).

V3-era note on this list: items 2 and 4 used to be unambiguous "choose Delta" categories. With Iceberg V3, both become "choose Delta only if you specifically need the longer production track record" — the underlying capability gap has closed. Items 1 and 5 (Databricks ecosystem commitment, Spark-specific performance) are still real reasons V3 doesn't change.

Or use both via Unity Catalog

Delta Lake for CDC tables (user inventory, asset tracking). Iceberg for immutable logs (CloudTrail, firewall, EDR). Single Unity Catalog governance layer. The multi-format strategy eliminates the binary choice and enables format-specific optimization without governance fragmentation.

2026 update

Databricks Lakewatch changes the Delta calculus.

In late March 2026, Databricks launched Lakewatch — an open, agentic SIEM built on Delta Lake and Unity Catalog, with AI agents powered by Anthropic Claude. Partners include Cribl, Palo Alto Networks, Okta, Wiz, and Zscaler. This changes the competitive landscape for the Iceberg vs Delta decision in one specific way: choosing Delta Lake now comes with a security-specific product ecosystem. If your organization is Databricks-committed and evaluating SIEM alternatives, Lakewatch means your lakehouse table format and your security detection platform share the same foundation — Delta tables, Unity Catalog governance, Spark compute. That's a genuine integration advantage.

The flip side: this deepens the Databricks lock-in case against Delta. When your table format, governance catalog, compute engine, and security detection platform all come from one vendor, the exit costs compound. Databricks markets Lakewatch as "zero vendor lock-in via open formats" — but Delta Lake is Databricks-controlled, and an "open agentic SIEM" built on a single vendor's stack is a different kind of open than Apache Iceberg queried by 15+ independent engines. This is Tier D evidence (vendor launch, no production validation), so treat it as a factor in your framework, not a resolved answer.

Migration path

Start with Iceberg, evaluate Delta later.

Phase 1: pilot (months 1–3)

Deploy Polaris Catalog plus Iceberg tables for one or two data sources (firewall logs, CloudTrail). Validate query performance with StarRocks or ClickHouse. Measure operational complexity (compaction, snapshot expiration).

Phase 2: production expansion (months 4–9)

Expand to 10–20 data sources. Automate maintenance via Airflow or dbt plus Spark compaction jobs. Implement lifecycle policies (hot → warm → cold S3 tiering).

Phase 3: evaluate Delta (months 10–12)

If a Databricks adoption path emerges, pilot Delta for CDC tables. If Unity Catalog is attractive, adopt the multi-format strategy. If Iceberg meets all needs, continue Iceberg-only. Starting with Iceberg doesn't prevent Delta adoption later — both coexist in Unity Catalog.

Conclusion

The table format chosen today determines tomorrow's flexibility.

Apache Iceberg and Delta Lake are both production-validated at petabyte scale for security-adjacent workloads. The choice depends on strategic priorities, not technical limits.

Iceberg strengths: universal query engine compatibility (15+ engines), vendor neutrality, partition evolution, hidden partitioning for analyst accessibility.

Delta Lake strengths: Spark performance optimization (1.7–3.5× faster), CDC maturity, Databricks ecosystem depth, compliance certifications (FedRAMP, HITRUST, HIPAA).

My recommendation: start with Apache Iceberg for vendor neutrality and multi-engine flexibility. Evaluate Delta Lake if Databricks commitment emerges or CDC-heavy workloads dominate.

Unity Catalog's multi-format support (Delta + Iceberg + Hudi) means this isn't a permanent decision. Choose based on your organization's strategic priorities, not just on technical benchmarks.