Technology deep-dive

Kafka to Iceberg: the integration hidden costs.

Streaming security telemetry from Apache Kafka into an Apache Iceberg lakehouse sounds like a wiring exercise. In practice, four hidden costs (the small-file problem, compaction throughput, exactly-once delivery semantics, and schema-evolution coordination) determine whether the lakehouse stays performant six months in or degrades into an expensive cold-storage tier.

Reading time: about 19 minutes. Evidence tier: B overall (Jack Vanlightly's independent analysis, practitioner discussions, vendor documentation cross-referenced) with one Tier A primary source (the Ursa paper at VLDB 2025) and one Tier A point from my own H-REALTIME-OCSF-01 proof-of-concept. Cost figures are directional and hedged throughout.

The mismatch

Kafka and Iceberg were designed for opposite access patterns.

Apache Kafka is optimized for streaming: append-only writes, offset-ordered reads, small messages on the order of 16 KB, short retention measured in days to weeks, throughput delivered through partition parallelism. It is, by design, not a data warehouse.

Apache Iceberg is optimized for analytics. Large scans with column pruning and predicate pushdown, Parquet files in the hundreds-of-megabytes range, retention measured in months to years to decades, partitioning by business dimensions (event time, asset, severity) rather than offset order.

The integration question is what happens at the seam, because Kafka's natural output unit, a 16 KB message on a partition, is roughly ten thousand times smaller than Iceberg's natural input unit, a 512 MB Parquet file, and bridging that gap is an architectural decision that determines whether your lakehouse stays performant or drowns in small files.

The vendor pitches around this seam are loud ("zero-copy! Single source of truth! No duplication!"), and they obscure four practical costs that, in my experience and in the practitioner discussions I trust, are what determine whether the integration succeeds.

Hidden cost #1

The small-file problem.

"Small-file problem" is the term of art for what happens when a streaming ingestion pipeline lands each micro-batch as a separate Parquet file. Iceberg's metadata layer tracks every file. Query planning has to open and scan the metadata for each one. At a few hundred files per partition the overhead is invisible; at hundreds of thousands of small files per table, query planning starts to take longer than the query itself.

For security workloads the math is unforgiving. A modest Kafka deployment ingesting 50 million EDR events per day, partitioned into Iceberg every minute for query freshness, writes roughly 1,440 files per day per partition. Over thirty days that's 43,200 files per partition. Threat-hunting queries that touch ninety days of history are now planning across a hundred thousand files before they read a single byte of event data.

The fix is compaction, where a background job reads many small Parquet files and rewrites them as fewer, larger ones, then updates Iceberg's metadata to point at the new files and mark the old ones for expiration. It works, but it isn't free, and the cost is one of the things vendors don't put on the pricing page.

Compaction is covered in more depth in Iceberg maintenance: compaction, snapshot expiry, and the operational tax →. The short version: budget for it explicitly, monitor compaction lag the way you'd monitor Kafka consumer lag, and don't assume the defaults from a vendor's quickstart will hold at production volume.

Hidden cost #2

Compaction throughput is the real ceiling.

Once you accept that compaction has to run, the next question is whether it can keep up. Compaction throughput (measured in megabytes per second of small files rewritten into large ones) has to exceed the ingest write rate, on average, across the time window you care about. If it doesn't, the small-file backlog grows monotonically and query performance degrades on a schedule.

The Kafka Connect Iceberg sink (the open-source connector originally from Tabular, now widely deployed) lands data as small files on the streaming side and relies on a separate Spark or Flink job to compact them. That separation is operationally clean: the streaming side stays fast, the compaction side runs on a budget you control. The pitfall is that the compaction job is sized independently, and a quiet schema change or a sudden traffic spike can push it under capacity without any obvious alert until query latency degrades.

Confluent's TableFlow (the managed Kafka-to-Iceberg path inside Confluent Cloud) and Aiven's Iceberg Topics (the equivalent on Aiven's managed Kafka) bundle the compaction job into the managed service. That's convenient. It's also where the pricing gets opaque. Compaction compute is part of the bill; how much you pay depends on workload shape rather than a flat per-GB rate. I have not seen a public methodology disclosure from either vendor that lets me model compaction cost ahead of time. Treat their pricing as something to measure in a paid pilot, not something to model from the rate card. (Pricing claims here are directional; I have not independently verified 2026 quotes.)

For my own H-REALTIME-OCSF-01 proof-of-concept (Kafka Connect Iceberg sink writing OCSF Network Activity events to Iceberg through a Nessie REST catalog, on a single-node WSL2 environment), I measured roughly 7,000 events per second sustained, with zero data loss, before compaction lag became the binding constraint. That's a constrained environment and production hardware would scale higher, but it establishes a copy-based baseline I trust because I ran it myself, and what ran out first was compaction throughput rather than Kafka write rate or Iceberg metadata throughput.

Hidden cost #3

Exactly-once delivery is operationally expensive.

"Exactly-once semantics" is the property that each Kafka message lands in Iceberg exactly once: not zero times (data loss), not twice (duplicates that show up as inflated event counts in detection dashboards). It is what Kafka users want from a sink, and it is also one of the harder distributed-systems guarantees to deliver, so the cost is not always visible in the architecture diagrams.

The Kafka Connect Iceberg sink achieves exactly-once by coordinating Kafka consumer offset commits with Iceberg snapshot commits. Each batch the connector writes becomes a new Iceberg snapshot only after the corresponding Kafka offsets have been recorded as consumed; if the connector crashes mid-batch, the next instance reads the same offsets and rewrites the same data, and the Iceberg commit protocol ensures only one of those attempts becomes part of the table.

That protocol works, but it also imposes two costs. First, the Iceberg commit becomes a serialization point; concurrent writers compete for the catalog lock, and commit throughput, rather than data throughput, can become the ceiling at high partition counts. Second, the connector needs to retain in-flight batches in a way that survives failover, which adds memory pressure and a tail-latency tax.

The shortcut some teams take is at-least-once semantics with downstream deduplication. Accept that some events may land twice, write a query-time or compaction-time dedup step, and avoid the coordination tax. That works for many security analytics workloads where small duplication rates are tolerable, but it does not work for compliance and forensic workflows where bit-for-bit fidelity is what they are after, so the honest call is made workload-by-workload rather than platform-wide.

Hidden cost #4

Schema evolution requires two teams to agree.

Kafka treats schema as an immutable contract. The schema registry says "this is what producers write and consumers expect," and the value of that contract is that it does not change underneath downstream consumers. Iceberg treats schema as evolvable: add a column, drop a column, rename a column, all without rewriting historical files.

Those two stances are not incompatible, but they require coordination. When the EDR vendor adds a new field to the event format, somebody has to decide whether the field flows into the Iceberg table as a new column, whether older Iceberg rows backfill with null, and whether downstream dashboards and detection rules need to be updated before or after the schema change lands. With copy-based integration (Kafka Connect Iceberg sink, Confluent TableFlow, Aiven Iceberg Topics), the materialization layer is the natural place to handle this, so you transform the schema during the write, add the column to Iceberg without touching Kafka, and keep the two evolutions independent.

With zero-copy integration (shared storage between Kafka and Iceberg), the coordination problem gets harder. Zero-copy systems including Bufstream, Aiven's tighter-shared-storage modes, and StreamNative's Ursa share Parquet files between the streaming layer and the analytics layer. That shared storage means you cannot evolve the Iceberg schema without affecting Kafka's view of the data, which leaves two bad options. Option A is the uber-schema approach where every change adds new nullable columns and the Parquet files bloat with deprecated fields. Option B is migrate-forward, where you rewrite historical files with the new schema and break Kafka's "what you wrote is what you read" guarantee, which matters for audit and forensics.

This is the schema-evolution argument Jack Vanlightly (distributed-systems specialist, formerly RabbitMQ, now Confluent) raised in his October 2025 critique of zero-copy Kafka-Iceberg integration, and it is the one critique I think still holds in full. Vanlightly's broader claim, that zero-copy shifts costs from storage to compute by an order of magnitude, has been weakened by the Ursa benchmarks below, but the schema-evolution coordination tax is structural rather than a benchmark question, so copy-based integration sidesteps it.

Two architectures

Copy-based vs zero-copy: where each one earns its keep.

The industry has split into two camps and they are worth understanding before evaluating specific products.

Copy-based (materialization)

Data is replicated from Kafka to Iceberg through a connector (Kafka Connect Iceberg Sink, Confluent's TableFlow, or a custom Flink or Spark job). There are two copies: the Kafka log (short retention, offset-ordered, optimized for streaming consumers) and the Iceberg table (long retention, business-dimension-partitioned, optimized for analytics). The two layers evolve independently. The Kafka Connect Iceberg Sink is the open-source path, reportedly in production at fifty-plus organizations as of late 2025, a claim I've seen in practitioner discussions but have not independently audited.

The trade-offs are predictable. Storage duplication is bounded, so you pay for Kafka's retention window plus the Iceberg copy rather than for both forever, and operational clarity is high because the streaming team owns Kafka, the analytics team owns Iceberg, and the materialization layer is the explicit boundary between them. Schema evolution stays clean because the connector is the place where transformations happen, and the cost you take on is the duplication and the connector itself as a thing to operate.

Zero-copy (shared storage)

Kafka and Iceberg read from the same Parquet files on object storage. There is one physical copy. The named products in this category are Bufstream, Aiven's Iceberg Topics in its tighter modes, and StreamNative's Ursa.

The trade-offs run the other way. Storage cost drops because the duplication goes away, but operational boundaries blur. When a query is slow, is that a Kafka problem or an Iceberg problem? Whose budget pays for the compaction work? And the schema-evolution conflict above becomes architecturally unavoidable. The "single source of truth" framing is elegant in slides; it gets harder when two teams with different optimization goals try to share one storage layer.

What changed in 2025

Ursa at VLDB and the limits of "zero-copy is hype."

The version of this essay I would have written a year ago framed zero-copy as vendor marketing wrapped around an architectural compromise, and that framing was reasonable in 2024, but it has weakened since.

Ursa is a lakehouse-native streaming engine published at VLDB 2025 by Merli et al., affiliated with StreamNative and the Apache Pulsar community. The architecture is zero-disk, leaderless, and Kafka-compatible, writing directly to Iceberg or Delta Lake on object storage with no local disk tier. The published numbers (Tier A, peer-reviewed venue): 5 GB/s sustained throughput for both publish and consume, invariant across 6-to-48 partitions and 1 KB-to-64 KB message sizes; P99 publish latency under one second at a 1:5 write-to-read fan-out ratio; reported 92% cost reduction versus disk-based Kafka and 78% versus Kafka's own tiered-storage option; infrastructure spend at roughly 5% of a comparable Kafka stack; CPU utilization at 30 to 60% under peak load with network I/O as the bottleneck rather than CPU.

The CPU number is the one that carries the most weight here, because Vanlightly's October 2025 critique relied on a claim that generating Parquet from Kafka log segments costs "at least an order of magnitude more CPU cycles" than copying log segments to object storage. I could not find a peer-reviewed source for that ten-times figure when I went looking, and the Ursa benchmarks show a network-bound architecture rather than a CPU-bound one. The directional intuition behind the critique may still be correct for some workloads; the specific ten-times number should be treated as practitioner estimate, not measured fact.

What Ursa does not validate: it inherits Kafka's offset-ordered partitioning, so the partition inflexibility critique still applies. If your security analytics need partitioning by (event_time, endpoint_id, event_type) for threat hunting, Ursa does not give you that shape. The VLDB paper does not address the schema-evolution conflicts above. And the benchmarks come from StreamNative-affiliated authors; independent multi-org production validation would strengthen the claims considerably.

The honest update is narrow: peer-reviewed benchmarks now exist for a zero-copy architecture writing to open table formats, and they show economics that may justify accepting the partition and schema trade-offs for greenfield deployments. That is a different claim than "zero-copy is ready for security analytics." For brownfield Kafka deployments with custom partitioning needs, copy-based remains the safer call.

The market signal

Bufstream added a copy-based mode.

One data point that helped me calibrate the zero-copy versus copy-based debate: Bufstream, which launched as a zero-copy vendor, added a copy-based Iceberg export mode in a 2024 release. That is a vendor adding the other architecture's pattern as a first-class option, the kind of move that suggests customers were asking for the flexibility that copy-based provides, even after they bought into the zero-copy pitch.

I read that as confirmation that the "pick one philosophy" framing is too binary for production deployments. Hybrid approaches (zero-copy for some topics, copy-based for others, with the choice made workload-by-workload) are likely where the industry settles, because the honest architectural question is less about which philosophy wins and more about which trade-offs you can absorb on which workloads.

Iceberg V3 update

Row lineage may simplify CDC, but not the fundamentals.

Apache Iceberg V3 introduces row-level lineage (row IDs plus last-updated tracking), which enables CDC (change data capture) directly from Iceberg without external metadata. For Kafka-to-Iceberg integrations carrying update streams (rather than append-only telemetry), V3 row lineage may materially reduce the compute cost of keeping a streaming-fed Iceberg table consistent over time, because point updates become metadata operations rather than full snapshot diffs.

For most security telemetry (EDR events, firewall logs, CloudTrail, network flow records) the workload is append-only and the V3 row lineage advantage is small. Where it may matter: asset inventory tables, user-identity tables, threat-intel feeds that update in place. Those are legitimately CDC-shaped and V3's mechanics may simplify the integration meaningfully. I cover the mechanics in more depth in Row lineage and detection engineering →.

The hedge is that engine support for V3 features is still rolling out across Spark, Trino, DuckDB, and Snowflake through 2026, and production references for streaming-fed Iceberg tables using V3 row lineage are sparse as of early 2026, so the capability exists in the spec while the operational maturity is still forming, which means you should verify your engine version before assuming V3 mechanics in production. None of this solves the fundamentals above (small files, compaction, exactly-once, schema evolution), because those are the structural costs of bridging streaming and analytics, and they persist regardless of V3.

Security workload economics

Three concrete patterns.

EDR telemetry: custom partitioning is the whole point

50,000 endpoints emitting roughly 1,000 events per day is 50 million EDR events per day through Kafka. The analytics workload is threat hunting ("every PowerShell execution on host X in the last seven days"), and that query wants Iceberg partitioned by (event_date, endpoint_id, event_type), so Kafka's offset-order partitioning is the wrong shape and copy-based materialization is what lets you repartition. Kafka retention of seven days at compressed volume produces a small duplication window (directionally a few gigabytes), and the duplication cost runs in the cents-per-month range against typical S3 list pricing, which leaves the compaction compute as where the money actually goes.

Network flow logs: high cardinality, low value per row

10 Gbps of network traffic produces roughly 100 million flow records per day. The analytics shape is usually subnet-level ("all outbound SMB to internal subnet X"), and the partition strategy is (flow_date, source_subnet, dest_port). Same architectural call as EDR: copy-based for the partitioning flexibility, accept the duplication, watch compaction throughput closely because flow records are individually small and the small-file problem is most severe here.

Cloud audit logs: schema evolution at the seam

AWS CloudTrail plus Azure Activity plus GCP Audit, normalized to the OCSF schema for unified analytics, with regulatory retention pushing the Iceberg side to multi-year horizons. This is where the schema-evolution argument matters most. Cloud providers add new event types, OCSF revisions change field semantics, and the immutable Kafka history needs to coexist with an Iceberg table that's been re-shaped multiple times. Copy-based with the materialization layer handling OCSF normalization is the architecturally clean answer, whereas zero-copy is the answer that gets you into the uber-schema-versus-migrate-forward dilemma.

Decision framework

How to choose between the integration approaches.

Choose copy-based if

You already operate Kafka and migration costs are the binding constraint.
Your security analytics require custom partitioning by event time, asset, or severity.
Schema evolution is frequent: quarterly EDR updates, OCSF revisions, new cloud audit fields.
Compliance and forensics workflows need immutable Kafka history plus an evolvable Iceberg view.
Operational clarity matters, with separate teams owning streaming and analytics layers.

Tooling: Kafka Connect Iceberg Sink for the open-source path. Confluent's TableFlow if already on Confluent Cloud. Aiven's Iceberg Topics in copy-based mode if already on Aiven. Compaction job (Spark or Flink) sized to exceed average write rate by a comfortable margin.

Evaluate zero-copy seriously if

Greenfield deployment with no existing Kafka commitment; partition strategy is still open.
Storage cost is the primary constraint and the Ursa-class economics are credible for your scale.
Offset-ordered partitioning is acceptable for your analytics workload.
Schema is stable, with well-defined data contracts and infrequent revisions.

Tooling: Ursa (StreamNative, open formats, VLDB-published benchmarks). Bufstream (Buf Technologies, dual-mode zero-copy plus copy-based as of late 2024). Aiven's tighter Iceberg Topics modes. Tabular File Loader is the open-source path for getting batched files into Iceberg without a Kafka intermediary at all, which is sometimes the right answer for non-streaming pipelines that have been forced through Kafka unnecessarily.

What I'd do

Practical guidance for 2026.

For most security teams with an existing Kafka deployment, I'd start with Kafka Connect Iceberg Sink, budget for compaction compute as a first-class line item, and instrument compaction lag the way you'd instrument Kafka consumer lag. The copy-based path is the well-trodden one, the partitioning flexibility is what threat hunting actually needs, and the schema-evolution coordination problem stays manageable.

For greenfield deployments where the cost ceiling is the binding constraint, evaluate Ursa-class zero-copy seriously, but run your own benchmark against your own workload before committing. StreamNative's VLDB numbers are Tier A evidence; whether they hold at security-telemetry shapes (bursty, high-cardinality, OCSF-normalized) is an open question I'd want a paid pilot to answer before betting architecture on it.

For hybrid workloads (append-only telemetry plus a handful of CDC-shaped reference tables: asset inventory, identity, threat intel) the multi-format pattern is increasingly viable, with copy-based Kafka-to-Iceberg for the telemetry, V3 row lineage for the CDC tables, and a single Iceberg governance layer across both. The "pick one philosophy" framing is the wrong shape for 2026 deployments, and hybrid is where the practitioner consensus is heading, which I think is the right read.

And on the related question of which table format underneath any of this (Iceberg or Delta Lake), see Iceberg vs Delta Lake for security data →. The Kafka-to-Iceberg integration question above is mostly format-agnostic; the equivalent Kafka-to-Delta integration question has the same four hidden costs with different tooling around them.

Conclusion

The integration tax is the architecture.

Kafka-to-Iceberg integration is more than a wiring exercise, because the four hidden costs (small files, compaction throughput, exactly-once delivery, schema-evolution coordination) are themselves the architecture. The copy-based versus zero-copy debate is a debate about which trade-offs you're willing to absorb on which workloads, and the honest answer in 2026 is workload-by-workload rather than platform-wide.

The Ursa VLDB paper is the strongest update to this picture in two years, and while it does not make zero-copy the right call for every security workload, it does make "zero-copy is just vendor marketing" an outdated framing. Iceberg V3 row lineage adds another option for the small subset of workloads that are CDC-shaped, but the fundamentals are unchanged, because bridging streaming and analytics still requires compaction, exactly-once, and schema coordination.

For the security teams I work with, copy-based Kafka Connect Iceberg Sink remains the default recommendation, with compaction compute as the line item I push hardest to size correctly, because the duplication cost is small while the compaction cost is the one that determines whether the lakehouse stays performant, so it is the number worth sizing carefully.