Technology deep-dive

The streaming database decision for security data.

I've spent the last year watching four engines crowd into the same architectural slot: the place where a security data pipeline does its stateful streaming work. RisingWave, Apache Fluss, Kafka Streams, and Spark Real-Time Mode (general availability on Databricks in March 2026) are not interchangeable products so much as four different bets on how state, storage, and compute should be arranged, so picking one is a strategic call rather than a feature comparison, and the honest answer for most security teams in early 2026 is still the boring one.

Reading time: about 18 minutes. Evidence tier: B/C overall. Production deployments at Netflix (Kafka plus Iceberg), Alibaba (Fluss), and Atome (RisingWave for fraud detection), plus expert analysis from Jack Vanlightly and Anton Borisov. No engine on this list has a publicly disclosed Fortune 500 security deployment at 10 TB/day. Treat the cost and performance numbers as starting hypotheses to benchmark against your own workload, not as quoted figures.

The setup

The integration problem nobody wants to maintain.

A typical security data pipeline in 2026 looks like three systems pretending to be one. Kafka handles 10 TB/day of telemetry (EDR, network flow, cloud audit, application logs) with seven days of retention. Iceberg stores the same data for analytics at roughly $0.023/GB/month on S3 with multi-year retention. ClickHouse powers the hot tier for queries running against the last thirty days at sub-second latency.

It works, but it also costs an estimated $360K/year for the components alone before headcount, and it requires a materialization layer in between (Kafka Connect Iceberg Sink, a Flink job, or a custom Spark Structured Streaming pipeline) that has to be operated, monitored, and version- managed alongside the systems on either side, so you end up running three systems with three failure modes and two integration surfaces.

Then a vendor demo shows up. RisingWave runs PostgreSQL-compatible SQL against a continuous query, writes the result to Iceberg with no separate sink, and gives you sub-second latency on materialized views. Or Apache Fluss promises unified streaming plus lakehouse in one engine with an 80% cost reduction claim from Alibaba. Or Databricks announces that Spark Real-Time Mode is generally available, eliminating the micro-batch latency tax that's haunted Structured Streaming since 2017, and inviting you to keep your existing Spark investment and just flip a configuration flag.

Each of these is a real architectural option, and each of them solves a real problem with the three- system status quo, but none of them is the obvious choice for a SOC running 10 TB/day of security telemetry in early 2026, which is what most of this essay works through. For a closer look at what changes when detection logic moves into the stream processor, see pipeline-based detection in stream processing.

Four philosophies

The decision is about state management, not features.

Anton Borisov (data architect at Fresha) framed this well in a September 2025 piece: the streaming engines that look like competing products are actually competing philosophies of state management. State management means where the engine keeps the partial results it needs to answer "what's happening right now": the join history, the windowed aggregates, the sliding session counters. That decision shapes everything downstream. I'd extend Borisov's three-philosophy frame with a fourth category that Spark Real-Time Mode forces onto the page.

Philosophy 1: separation of concerns (Kafka plus Iceberg)

Kafka handles streaming with the state external to the streaming layer: Kafka itself stays ephemeral (offset-ordered, days of retention, row-based), Iceberg holds persistent state (business- partitioned, years of retention, columnar Parquet). A materialization layer in between (Kafka Connect, Flink, or Spark) does the format conversion and the partition-strategy rewrite.

The trade-off is operational complexity (three systems, three failure modes) versus independent optimization (each system tuned for its job), and it's the architecture Netflix runs at 5 PB/day, the one Jake Thomas at Okta uses for 7.5 trillion records, and the one I've seen roughly fifty production security deployments converge on because it works.

Philosophy 2: hybrid tiered storage (Apache Fluss)

Fluss positions itself as a Kafka alternative with native lakehouse integration. Hot data lives in memory and on local SSD; cold data lives on S3 or ADLS. The client stitches the two together at query time. Operator state stays minimal: the engine leans on an external KV store plus the log itself rather than keeping a large operator state inside the streaming layer.

The trade-off is simplicity (one system instead of three) versus ecosystem loss (no Kafka Connect, no Schema Registry, no ksqlDB, no fifteen years of accreted tooling). Jack Vanlightly classified Fluss in September 2025 as "tiered storage with client-side stitching," which is to say it's neither a zero-copy lakehouse nor a Kafka clone but something specifically new.

Philosophy 3: stateless compute over shared storage (RisingWave)

RisingWave is a streaming database with PostgreSQL-compatible SQL on top and stateless compute actors over a shared storage layer (a log-structured-merge implementation called Hummock, running on S3). Because the state lives in shared storage and the compute is disposable, that arrangement makes materialized views with sub-second freshness, native Iceberg writes, and time-travel queries achievable from the same engine.

The trade-off is operational simplicity (SQL-first, PostgreSQL ecosystem, one system) versus production validation specifically for security workloads (the validated case studies are fraud detection and manufacturing observability, not SIEM-scale telemetry). For an engine-level reference, see the RisingWave streaming component.

Philosophy 4: batch-streaming convergence (Spark Real-Time Mode)

Spark Real-Time Mode reached general availability on Databricks in March 2026. The premise is that the same SQL or DataFrame API that runs your batch jobs can run your streaming jobs, with the micro-batch latency overhead of Structured Streaming removed. Matei Zaharia (Spark co-creator at Databricks) called it the long-term vision for streaming the project has held since Structured Streaming shipped. Josh Bogran's Databricks demos position it as a unified engine and an Apache Flink replacement for many use cases.

The trade-off is zero migration cost (for the estimated 60 to 70% of data teams already running Spark, this is a configuration flag, not a new platform) versus unvalidated streaming performance for security workloads (no independent benchmarks compare Real-Time Mode against Flink, RisingWave, or Fluss for OCSF-shaped event data as of early 2026).

I want to flag the evidence quality here directly, because "amazing performance numbers" is Zaharia's own phrasing, which is an architect promoting his own project, so it's useful Tier B input but not a substitute for a third-party benchmark. Real-Time Mode is also Databricks-first; the open-source Spark community will get there, but if you're running EMR or a self-managed Spark cluster without Databricks, this option may not be on your menu in 2026.

Translation layer

The streaming terms a security architect should know.

Streaming literature carries a vocabulary that's worth translating once before the comparisons land, because the same word means subtly different things across engines.

State stores are the engine's working memory: the partial join history, the rolling counters, the windowed aggregates the rule needs to fire correctly. Kafka Streams uses embedded RocksDB on each task's local disk. Flink uses a similar model. RisingWave puts state in shared storage on S3. Fluss leans on an external KV plus the log. These are not equivalent, because the local-disk model is faster for hot state but harder to scale and recover, while the shared-storage model is slower per operation but elastic and durable without operator intervention.

Exactly-once means the engine guarantees each event is reflected in the output exactly one time, even under failure. The standard alternatives are at-least-once (the event may appear more than once after a recovery) and at-most-once (the event may be dropped). For detection rules where a false positive costs analyst hours and a false negative costs a breach, exactly-once is what you want. The cost is engineering complexity in the engine, and every engine on this list claims it; the implementations differ in subtle ways that matter under partial failure.

Watermarks are the engine's estimate of "how late an event can arrive and still be counted." A watermark of five minutes says the engine waits five minutes after a window closes before declaring the result final, to allow late-arriving events to land. Watermark tuning is one of the most common operational pain points in streaming systems, because too tight and you miss late events, while too loose and your detection latency floor rises.

Materialized views in a streaming database are continuously updated query results, kept fresh as new events arrive. RisingWave's central abstraction is a materialized view that stays sub-second-fresh against the input stream. Think of it as "the answer to a query, maintained incrementally, rather than recomputed from scratch each time you ask."

Production evidence

What each engine has actually shipped.

Kafka plus Iceberg

Maturity: production-validated at petabyte scale. Netflix runs 5 PB/day on a ClickHouse-plus-Iceberg architecture with Kafka in front. Jake Thomas at Okta queries 7.5 trillion records against DuckDB plus Iceberg. The Kafka Connect Iceberg Sink GitHub repository has 1,500-plus stars and a conservative estimate of fifty-plus production users; the actual number is almost certainly higher. Jack Vanlightly's October 2025 analysis is the canonical Tier-A piece on this architecture pattern.

Validated security use cases: EDR telemetry at 50 million events per day with custom partitioning by endpoint and event type; network flow logs at 100 million flows per day partitioned by source subnet and destination port; cloud audit logs at 50 million API calls per day with OCSF normalization and schema evolution applied at the Iceberg layer without touching the immutable Kafka archive.

Apache Fluss

Maturity: Apache Incubator, pre-1.0 as of early 2026. The single named production deployment is Alibaba at over 1 PB of data, with a claimed 80% cost reduction versus a Kafka-plus-separate- lakehouse baseline. The cost claim is single-vendor and has not been independently benchmarked.

Security use cases: theoretical. The potential fit is stateful stream processing for behavioral analytics (session tracking, user-activity sequences, primary-key tables for upserts), but no public security deployment of Fluss exists, and the Alibaba case is general-purpose streaming rather than security telemetry. The Flink integration story is the strongest part of the pitch, so if your team already has deep Flink expertise, the surface-area learning curve is smaller than it looks.

RisingWave

Maturity: open-sourced in 2020, $36M Series A in 2022, growing. RisingWave claims 1,000-plus organizations as users; the vendor number is unverified, and I'd discount it heavily for marketing signal. The named, public production references are Atome (buy-now-pay-later financial services, real-time risk management and fraud detection at sub-second latency) and CVTE (manufacturing, materialized views and real-time dashboards). VMware Skyline used Differential Datalog, RisingWave's technological predecessor, at scale for several years.

Security use cases: fraud detection and audit log analysis are validated, but SIEM-scale telemetry at 10 TB/day is not, and as far as the public record goes no Fortune 500 security customer is disclosed and no SIEM-scale benchmark with disclosed methodology is published. Fraud detection patterns are close enough to behavioral security analytics that the directional fit is plausible, but I would not bet a production security operation on the analogy without piloting first.

Spark Real-Time Mode

Maturity: general availability on Databricks in March 2026. No named public production deployments as of early 2026. Zaharia references unnamed customers seeing strong performance numbers; that's the only signal in the public record.

The unique advantage is the install base, since Spark is one of the most widely deployed data-processing engines globally, and for existing Spark users Real-Time Mode is a configuration change rather than a platform adoption. For a security team that already runs Spark for batch ETL, this is the lowest-friction streaming option on the list, with the caveat that no independent benchmark exists yet to validate the streaming latency or throughput against security workloads.

Cost modeling

10 TB/day, one year of retention.

The numbers in this section are public-pricing-applied-to-an-illustrative-workload. Treat them as a starting hypothesis for your own modeling, not as a quoted figure for any specific deployment. The scenario: a mid-size enterprise SOC ingesting 10 TB/day of EDR, network, cloud audit, and application logs; seven days of Kafka retention (70 TB total); one year of lakehouse retention (3.65 PB total); thirty days of hot tier on ClickHouse (300 TB).

Component	Kafka+Iceberg	Fluss (estimate)	RisingWave (estimate)
Streaming layer (1 year, compute)	$120K	included	$150-200K
Lakehouse storage S3 (3.65 PB)	$101K	included	$101K
Materialization (Flink/Spark + 0.5 FTE)	$60K	n/a	n/a
ClickHouse hot tier (300 TB SSD)	$80K	$80K	$80K
Storage duplication (7-day overlap)	$1.4K	n/a	n/a
Annual total	~$362K	~$152K (if 80% claim)	~$230-280K

A few honest caveats on this table. The Kafka-plus-Iceberg storage duplication for the seven-day overlap between Kafka retention and Iceberg landing is 0.4% of total cost, so the "data duplication is expensive" argument I hear in streaming-database pitches doesn't survive a spreadsheet. The Fluss column is a back-of-envelope application of Alibaba's 80% cost reduction claim, which is single-vendor and not independently validated, so I'd treat the Fluss number as a ceiling rather than a forecast. The RisingWave column is open-source self-hosted, and RisingWave Cloud pricing is not publicly disclosed and is likely a 20 to 40% premium.

Spark Real-Time Mode doesn't fit cleanly on this table because the cost depends entirely on whether you're already on Databricks. If you are, the marginal cost is compute hours against your existing cluster, potentially the lowest total cost on the page. If you're not, the cost includes Databricks platform adoption, which is a different kind of decision.

Decision framework

Which engine fits which security team.

Choose Kafka plus Iceberg if

You already run Kafka and the migration cost of replacing it dominates any streaming-database benefit.
Query performance against custom partitioning matters — event_time plus asset_id plus severity is the partition key your analysts actually use.
Operational clarity matters and you'd rather have three systems with clear boundaries than one system with blurred ownership.
Multi-year retention with regulatory compliance is in scope (HIPAA, PCI-DSS, SOX) and the immutable Kafka archive plus clean Iceberg schemas is the pattern you want to defend in an audit.

Production validation: Netflix (5 PB/day), Jake Thomas at Okta (7.5T records), fifty-plus organizations on Kafka Connect Iceberg Sink. Risk level: low.

Choose RisingWave if

You're greenfield — no existing Kafka investment to migrate from.
Your team is SQL-first. Most of your SOC analysts and detection engineers can write SQL but would struggle with Flink's DataStream API.
Your highest-value use case is real-time analytics against materialized views — behavioral anomalies, fraud-detection-shaped patterns, audit log monitoring — rather than full SIEM-scale telemetry.
You can accept the security-validation gap and pilot before committing.

Production validation: Atome (fraud detection at sub-second latency), CVTE (manufacturing). No Fortune 500 security customer publicly disclosed. Risk level: medium. Validation steps before production: pilot on fraud or audit-log analysis at 10% of your event volume; benchmark query performance against ClickHouse for your SIEM-style partitioning; verify the materialized-view freshness under late-event conditions you actually see in production.

Choose Spark Real-Time Mode if

You're already on Databricks and Spark is the engine your data team operates today.
The marginal cost of streaming is the difference between batch and streaming on the same cluster, not platform adoption.
You can tolerate being an early adopter on a March 2026 GA — no named security deployments yet, no third-party benchmarks against Flink or RisingWave for security workloads.

Production validation: unnamed Databricks customers per Zaharia. Risk level: medium for Databricks- committed teams, high for open-source Spark teams (the feature is Databricks-first). The Databricks Lakewatch announcement in late March 2026 is a related signal, since it bundles Real-Time Mode plus Delta plus Unity Catalog into a security-specific product, which deepens both the integration argument and the lock-in argument simultaneously.

Monitor Apache Fluss if

Your team has deep Flink expertise and can read the source as the documentation matures.
You're greenfield and the 80% cost-reduction claim, even discounted heavily, would meaningfully change your business case.
You can wait. Apache graduation and three-plus non-Alibaba production deployments are the signals I'd watch for before recommending Fluss in production.

Production validation: Alibaba (single-vendor, general-purpose streaming, not security). Risk level: high. The right move for most teams is to track adoption signals through 2026 rather than to deploy.

Migration risk

What each migration actually costs.

Kafka to Kafka-plus-Iceberg integration: low risk

You add an integration layer, and because Kafka Connect Iceberg Sink is open-source and mature, producers and consumers don't change and the rollout is incremental by data source. It's also reversible, since turning off the sink leaves Kafka unaffected. Estimated timeline: two to four weeks for a pilot on a single data source (DNS logs are the usual choice, high volume and low downstream value), three to six months for full migration across all sources.

Kafka to RisingWave: medium risk

You can keep Kafka for streaming and use RisingWave for analytics, or replace Kafka entirely. The partial migration is the lower-risk option, since RisingWave handles materialized views and continuous queries while Kafka stays in front of it as the ingestion buffer. PostgreSQL-compatible SQL eases the analyst migration and Iceberg integration is native, though producers require a rewrite if you replace Kafka outright. The security-validation gap means I'd pilot for three to six months before committing the full SOC operation.

Kafka to Apache Fluss: high risk

You replace Kafka entirely, so producers and consumers rewrite (the Kafka API compatibility story is partial and changing) and topics migrate to Fluss tables. Kafka Connect goes away and you adopt Flink for integration, while Schema Registry and ksqlDB don't exist in the Fluss world, and rollback is hard once data lives in Fluss format. All-or-nothing cutover is the realistic deployment shape, and single-vendor validation is the broader risk, so I'd plan six to twelve months and a meaningful rewrite budget for any team considering this seriously.

Existing Spark to Real-Time Mode: low to medium risk

For Databricks-committed teams, the change is a configuration flag and a job rewrite from micro-batch Structured Streaming idioms to per-event Real-Time Mode idioms, which is meaningful but contained. The risk is the maturity of the GA itself; Databricks ships fast, and a March 2026 GA on a streaming engine warrants three to six months of monitoring before betting production detection on it. For open-source Spark teams not on Databricks, this isn't a migration path in early 2026.

Evidence gaps

What's still missing in the public record.

The honest summary of the evidence available in early 2026:

Kafka plus Iceberg has Tier A/B evidence. Jack Vanlightly's October 2025 validation, Netflix at 5 PB/day, fifty-plus organizations on Kafka Connect Iceberg Sink, public cost modeling that survives scrutiny.
Apache Fluss has Tier B/C evidence. Vanlightly's classification piece (September 2025) and Borisov's comparative analysis (September 2025) are credible expert signals. The Alibaba production deployment is real. The 80% cost reduction claim is single-vendor and unverified. No multi-organization adoption signal exists yet.
RisingWave has Tier B/C evidence. Borisov has analyzed it; Vanlightly and Ryan Blue have not published RisingWave deep-dives as of early 2026. Atome and CVTE are public production references. No Fortune 500 security customer, no SIEM-scale benchmark, no Tier-A expert validation.
Spark Real-Time Mode has Tier C/D evidence. Zaharia's framing and Bogran's demos are credible signals from architects with skin in the game, though there's no independent benchmark, no named public deployment, and no security-specific validation. GA was March 2026, so six months of public deployment maturity simply hasn't accumulated yet.

I track Vanlightly's blog, Blue's blog, and the RisingWave and Fluss community signals monthly. The gap I'd most like to see filled in 2026 is a methodology-disclosed benchmark against OCSF-shaped security event data at 10 TB/day comparing Flink, RisingWave, and Spark Real-Time Mode. The closest anyone has come in public is the Atome fraud-detection case study, which is directionally relevant but not the same workload.

Recommendation

The boring answer is still the right answer for most teams.

For most security teams running 10 TB/day in early 2026, Kafka plus Iceberg remains the production- proven choice. Storage duplication for the seven-day overlap is 0.4% of total cost, which is negligible, and the operational clarity of three systems with clean boundaries beats the theoretical simplicity of one new system with blurred ownership and incomplete tooling.

For greenfield teams that are SQL-first and whose primary use case is materialized-view analytics against fraud-or-behavior patterns, RisingWave is worth a pilot. The architecture is clean, the PostgreSQL ergonomics are real, and the Atome case study is close enough to a fraud-shaped security workload to be directionally useful. Pilot at 10% of your volume; validate against your real partitioning strategy; treat the security-validation gap as the load-bearing risk to retire.

For Databricks-committed teams, Spark Real-Time Mode is worth a pilot through late 2026 once the GA has more deployment mileage. The marginal cost is low and the integration story with Delta plus Unity Catalog plus Lakewatch is compelling on paper; whether it holds up against an independent benchmark on security workloads is the question to answer before committing.

For Apache Fluss: monitor. The architecture is interesting, Vanlightly's classification is credible, and the Alibaba reference is non-trivial. Multi-organization adoption and Apache graduation are the signals to wait for. Greenfield-with-Flink-expertise is the only context I'd consider deploying Fluss in production in 2026, and even there I'd want to see three-plus non-Alibaba references first.

Meta-point

Four philosophies, not four products.

Borisov's framing is the right one, because the streaming engines in this comparison are not interchangeable products with overlapping feature lists but four different bets on how to arrange state, storage, and compute: separation of concerns, hybrid tiered storage, stateless compute over shared storage, and batch-streaming convergence. The right question for a security architect isn't "which is best" so much as "which philosophy fits the team I have, the workload I need to support, and the risk tolerance my organization can defend."

For production security data at scale in early 2026, the answer to that question is still separation of concerns, Kafka plus Iceberg, with the materialization layer paid for and operated as a first-class system. The newer philosophies are worth tracking and worth piloting where the fit is right, and they're worth pressuring vendors on for the missing security-specific benchmarks, but they aren't yet a default to bet a production SOC on.