Security Data Works

Technology deep-dive

NATS JetStream: the lightweight Kafka alternative, and why durability disqualifies it for security data.

NATS JetStream is impressive at one thing: low-latency, low-footprint streaming with a single binary and sub-second startup. For edge collection, that may be exactly the right shape. For the core of a security data pipeline, the December 2025 Jepsen analysis of v2.12.1 quantified durability failures that I think disqualify it: 14.1% message loss on coordinated power failure, 49.6% with single-bit corruption on one of five nodes, up to 78% of acknowledged writes lost on individual nodes under split-brain conditions, and a default fsync interval that violates the basic Raft invariant of syncing before acknowledgment. A v2.12.5 consumer-loss regression surfaced in a March 2026 GitHub discussion added a new data loss path that Synadia's own tests missed.

Reading time: about 14 minutes. Evidence tiers mixed: the Jepsen analysis is Tier A (independent expert testing, transparent methodology), enterprise adoption claims are Tier B (vendor announcement with named customers), and the security-specific extrapolations are Tier C-D (no documented security production case studies exist). The Jepsen findings are specific to v2.12.1; Synadia's response blog names v2.12.3 fixes around peer-removal and cluster-membership plus contributions to the Jepsen test library, but the core filestore-corruption recovery gap is still open as of May 2026, and no independent post-v2.12.3 Jepsen re-test has been published. Notably, Synadia chose to document the 2-minute deferred-fsync default as a risk in that response rather than change the default.

The Kafka overhead question

Do you really need Kafka?

Jake Thomas asked a provocative question at Data Council 2024: "Do you really need Kafka?" He had processed 7.5 trillion records at Okta using DuckDB for certain workloads where Kafka would have been overkill. The question is fair, and it deserves a more careful answer than a knee-jerk yes.

For security data pipelines processing 1-10 TB/day, the answer is usually "yes, but with operational overhead." Kafka asks for multiple JVM processes (brokers plus ZooKeeper or KRaft), JVM tuning (GC configuration, heap sizing, thread pools), 20-30 second cold starts for broker startup, and the operational learning curve that comes with running a distributed JVM-based system at scale. The "40 YAML files for production deployment" jab is hyperbole, but the underlying complaint is fair.

So the question is fair. Is there a simpler alternative for specific use cases? NATS JetStream is the most credible candidate, and I want to take it seriously before I explain why I would not put it on the core path for security telemetry.

What NATS JetStream actually is

Simplicity as a feature.

NATS is a CNCF incubating project written in Go, with more than 40,000 GitHub stars and real enterprise adoption. JetStream is the persistence layer built on top of NATS's core pub-sub: it adds durability (in principle), message replay, and consumer state to what was previously a fire-and-forget messaging system. Synadia is the commercial entity behind NATS, with named customers including Walmart, Rivian, FinecoBank, Replit, and PowerFlex, and a Series B announcement in March 2025 claiming 200% YoY customer growth.

The operational story is attractive. Where Kafka asks for multiple JVMs plus ZooKeeper or KRaft, NATS is a single binary (nats-server). Where Kafka cold-starts in 20-30 seconds, NATS starts in sub-second. Where Kafka needs 4-16 GB minimum memory, NATS typically runs in 1-4 GB. There are no external dependencies and no GC tuning to do, because Go's runtime handles it. Installing it on a dev machine is two commands:

# Install
brew install nats-server  # or download binary

# Start with JetStream enabled
nats-server -js

That's it: one command, one process, and streaming is enabled. The Go API surface for a publisher/consumer is similarly compact:

// Connect and create JetStream context
nc, _ := nats.Connect(nats.DefaultURL)
js, _ := nc.JetStream()

// Publish security event
js.Publish("security.events.auth", eventData)

// Subscribe with durable consumer
js.Subscribe("security.events.auth", func(msg *nats.Msg) {
    processSecurityEvent(msg.Data)
    msg.Ack()
}, nats.Durable("alert-processor"), nats.ManualAck())

I want to credit this honestly: the developer experience is excellent, and the operational footprint at the edge is a clear improvement over Kafka. If the only thing that mattered was time-to-first-message and resource overhead, NATS JetStream wins this comparison without a serious argument.

Performance shape

Latency wins, throughput loses.

The 2025 Onidel benchmark on standardized infrastructure (4 vCPU, 8GB RAM, NVMe) puts NATS JetStream at roughly 200,000-400,000 messages/second against Kafka's 500,000-1,000,000+. NATS wins on latency (sub-millisecond in memory, 1-5ms persistent) versus Kafka's 10-50ms with batching. Resource requirements scale similarly: NATS asks for 2+ vCPU and 4GB RAM in production where Kafka wants 8+ vCPU and 16GB RAM minimum.

Treat those numbers as directional. The Onidel methodology is not fully transparent and I have not independently benchmarked these workloads, so I cite this as Tier C evidence. The shape of the result (Kafka wins on raw throughput, NATS wins on latency and resource efficiency) is consistent with what you'd expect from the architectural differences, even if the specific numbers may shift on your hardware.

For an edge collection layer where sub-millisecond alerting matters and per-node resource budgets are tight, those tradeoffs may favor NATS. For a core security pipeline where you're sustaining hundreds of thousands of messages per second from EDR, network sensors, and cloud audit trails, Kafka's throughput headroom is the conservative call.

The Jepsen finding

Where the case against NATS for security data starts.

Jepsen, the consultancy run by Kyle Kingsbury, is the industry's gold standard for distributed-systems reliability testing. Jepsen has tested Kafka (multiple times, with documented fix cycles), CockroachDB, MongoDB, Redis, etcd, and dozens of others. The methodology is transparent, the test harness is open source, and the findings are independently reproducible. When Jepsen publishes a Tier A analysis, the results carry more weight than vendor benchmarks or community blog posts, and the December 2025 analysis of NATS v2.12.1 is, in my reading, disqualifying for security data on the core path.

"NATS JetStream promises that once a publish call has been acknowledged, it is 'successfully persisted'. This is not exactly true."

That quote from the Jepsen analysis is the headline, and the quantified findings underneath it are stark: 14.1% message loss (131,418 of 930,005 acknowledged messages) in a LazyFS-coordinated power-failure scenario, 49.6% total message loss (679,153 of 1,367,069) from a single-bit error in a .blk file on one of five nodes, individual nodes losing up to 78% of acknowledged writes under split-brain conditions, and 30-second write loss windows in typical failure scenarios. These are not edge cases dreamed up by an adversarial researcher, because power failures happen, disk corruption happens, and network partitions happen. In a security telemetry pipeline, a 14% message loss rate means roughly 14% of an adversary's activity goes silently undetected, and the 78% split-brain figure is the number that ought to be most alarming, because it means a single partition can leave a node blind to its own recent history.

The Jepsen report attributes the losses to four specific issues, each with a GitHub tracking ticket:

  • Lazy fsync (#7564). The default sync_interval is 2 minutes. That means when NATS acknowledges a write to your application, the data may still be sitting in volatile OS page cache for up to 2 minutes. This violates the foundational Raft invariant that you sync to disk before acknowledging. Kafka's defaults are safer here. Synadia's response to the Jepsen analysis documented this default as a risk rather than flipping it; the configuration that produced the 14.1% loss number is still what a fresh install ships with.
  • .blk file corruption (#7549). Minority node corruption cascades to cause majority data loss. A single-bit error in a .blk file on one of five nodes triggered 49.6% total message loss in Jepsen's testing.
  • Snapshot corruption triggering stream deletion (#7556). A corrupted snapshot can trigger complete stream deletion rather than a graceful recovery path. This is the kind of failure mode where the only acceptable behavior is "fail loudly," and the observed behavior is "silently destroy the data."
  • Split-brain from a single OS crash (#7567). A single OS crash plus pause produces divergent replicas. A clustered system should withstand more than one machine misbehaving without losing consensus.

The lazy-fsync issue is the one I keep returning to. Most consumers of a streaming system assume that "acknowledged" means "persisted to durable storage." That's the contract Kafka offers (with safe defaults), the contract Raft systems generally offer, and the contract the rest of the security data stack assumes. A 2-minute default fsync window means that contract is not actually being honored, and the language in the documentation has been generous about that. The fact that this was the default, not an opt-in performance mode, is what moves this from "tuning concern" to "architectural objection."

The v2.12.5 regression

Three months later, a new consumer-loss path.

The Jepsen findings on v2.12.1 would be concerning in isolation, but Synadia responded with what they described as a sweep of durability fixes in PRs #7565 through #7610, and v2.12.3 was the version meant to incorporate them. The case for continued evaluation of NATS for non-core paths would have rested on those fixes holding up.

Three months after the Jepsen analysis, GitHub Discussion #7967 (opened 2026-03-18) surfaced a new data loss path in v2.12.5: an async meta-layer snapshot combined with asset updates and a server restart could silently drop consumers in clustered JetStream. The reporter (a production user) described the impact as material. Synadia provided a workaround flag rather than yanking the release and added a warning to the v2.12.5 release notes.

The shape of this should give a security architect serious pause. A discussion-tracked regression on a version that was supposed to incorporate Jepsen-prompted fixes, with the mitigation being an after-the-fact flag and release-note warning rather than a default change, is the kind of pattern where internal test coverage and external durability expectations have not converged. Synadia has cited "2+ years of Antithesis testing" (GitHub #3312) as evidence of internal verification, but no public results from that testing have been published, and no independent post-v2.12.3 Jepsen re-test exists at the time of writing. v2.12.7 (April 2026) fixed an MQTT auth regression unrelated to durability; the filestore-corruption recovery gap from the Jepsen report remains open.

I want to be fair here: it is plausible (likely, even) that NATS has patched some specific issues identified by Jepsen since the analysis was published. Synadia is an active commercial project with real engineering investment, and the Jepsen analysis itself is a roadmap for what to fix. The honest framing is that the durability evidence base is one independent analysis with 14% loss findings, plus one subsequent production regression on a version meant to address that analysis, plus vendor claims of additional fixes without independent verification. That evidence base is not where I would want it to be before putting security telemetry behind it.

Durability confidence

The gap with Kafka is a category difference.

Kafka is not perfect. The Kafka deep-dive I wrote separately catalogs its own operational complexity, resource overhead, and learning curve, and I would not pretend any of those are zero costs. But on the single dimension that matters most for security data ("did we lose the event?") Kafka has a decade of independent verification, multiple Jepsen analyses (2013, 2018, 2020) with documented fix cycles and subsequent re-verification, peer-reviewed research, and extensive multi-organization production deployment evidence at LinkedIn (1 trillion+ messages/day), Netflix, Uber, and Cloudflare.

NATS JetStream's durability evidence base sits at a different category. A single Jepsen analysis (v2.12.1) with findings that include 14.1% loss on coordinated power failure, 49.6% loss from a single-bit error on one of five nodes, and up to 78% acknowledged-write loss on individual nodes under split-brain. A "NATS-optimized" Raft implementation with no formal proof and no TLA+ specification published. A 2-minute default fsync interval that violates sync-before-acknowledgment and that Synadia has chosen to document rather than change. No independent post-v2.12.3 Jepsen re-test. A v2.12.5 consumer-loss regression (Discussion #7967) on a version meant to incorporate Jepsen-prompted fixes. Single-organization production validation (Synadia Cloud) without the multi-org, peer-published track record Kafka has accumulated.

These are not symmetric arguments. A security architect evaluating a streaming layer for the core path is making a multi-year commitment to a system whose failure modes will determine whether incident response and compliance reporting are reliable. The asymmetry of consequences (a missed event versus an unnecessary engineering investment) argues for the system with the longer independent verification track record, even at the cost of higher operational overhead at deployment time.

Operational caveats

And other rough edges.

Beyond the headline Jepsen findings, the NATS JetStream operations surface has additional rough edges that matter at security scale:

  • OOMKill during catchup. Replicas that fall behind have no rate limiters, so a node that needs to catch up can run itself out of memory and crash. This is the kind of failure mode that cascades. Losing a replica during recovery is exactly when you want recovery to be conservative.
  • HA asset limits. Global clusters are limited to roughly 2,000 highly-available assets. That ceiling may or may not be a real constraint for a given security deployment, but it's worth confirming against your stream-count expectations.
  • Storage requirements. Local SSD is required; NAS and NFS backing stores are explicitly discouraged. That's not unreasonable, but it's a deployment constraint that may not match existing infrastructure assumptions.
  • API limits. A 10,000 inflight request cap (v2.10.21+) is high enough not to bite most workloads, but worth knowing for high-fan-out consumer patterns.

None of these are individually disqualifying. Stacked on top of the durability concerns, they contribute to a picture of a system whose operational sharp edges have not yet been worn smooth by the kind of years-long, multi-organization production pressure that Kafka has been through.

If you must use it

Mitigations for the narrow path where NATS may still fit.

If you have a use case where NATS JetStream's operational simplicity earns its place (I describe one candidate below) there is a configuration discipline worth following. The minimum config changes I would not deploy without:

# nats-server.conf - REQUIRED for durability
jetstream: enabled
jetstream:
  sync_interval: always  # Default is 2 minutes (dangerous)
  max_memory: 4GB
  store_dir: /var/nats/jetstream

Beyond sync_interval: always, the additional disciplines I'd insist on for any deployment carrying security-relevant data:

  • R=5 replication rather than the default R=3. The minority-node corruption finding (#7549) cascaded in the default replication factor. Higher replication doesn't fix the underlying bug but reduces the probability of the cascade triggering.
  • Application-level durability checks. Checksums on every message, sequence validation on every consumer, dead-letter queues for messages that fail validation. Treat the transport as potentially lossy and prove durability at the application layer.
  • Treat NATS as transport, not as store. Forward to Kafka or directly into Iceberg for durable retention as quickly as possible. Don't rely on NATS for the system of record.

I want to flag honestly: even with sync_interval: always and R=5, the .blk corruption and snapshot deletion issues (#7549, #7556) are not configuration problems. They are implementation bugs at the time the Jepsen analysis was published. Whether they remain unfixed in current releases is the kind of question that ought to be answered by independent re-verification before any architect bets on it.

Use case fit

Where NATS may earn its keep, and where it shouldn't.

The honest version of "when to use NATS JetStream" has two columns. The first is the narrow set of patterns where its operational simplicity may justify the durability risk, given the right mitigations. The second is the set of patterns where the durability risk is disqualifying regardless of configuration.

May be acceptable with mitigations

  • Edge preprocessing. Lightweight, low-resource collection at sensor sites where Kafka's footprint is impractical and the next hop is a durable store. Mitigations: sync_interval: always, R=5, forward to Kafka or Iceberg immediately, accept that the edge node is not the system of record.
  • IoT telemetry collection. High connection count, low latency, broadly the same shape as edge preprocessing. Application-level sequence validation matters even more when the upstream sources are themselves unreliable.
  • Development and testing environments. Fast iteration, low setup cost, data loss is tolerable because the data is synthetic.
  • Ephemeral real-time alerting. Sub-millisecond pub-sub for alert distribution, where the alerts also persist via a separate durable path. NATS as fast notification, Kafka or Iceberg as the audit record.
  • Microservice request-reply patterns. NATS supports request-reply alongside pub-sub in a single system, which is occasionally useful. Don't rely on JetStream persistence for state in this context.

Disqualified

  • Core security data pipelines. A 14% message loss rate translates directly into a 14% detection gap, and the compliance-adjacent retention obligations that sit on top of this only compound the consequences.
  • Financial transaction processing. The v2.12.5 regression hit payment workflows directly, so this is not a hypothetical concern.
  • Compliance-driven retention. PCI-DSS, SOC 2, NIST frameworks expect auditable data integrity. A "may have lost 14% of events" disclosure is not the kind of finding an audit goes well after.
  • Petabyte-scale retention. Kafka is designed for terabytes-per-week retention with tiered storage; that's not what NATS is shaped for.
  • Complex streaming ecosystems. Kafka Connect's 100+ connectors, ksqlDB, Schema Registry: none of these have NATS equivalents at comparable maturity.
  • Iceberg integration. Kafka-to-Iceberg connectors are mature and battle-tested. Equivalent NATS-to-Iceberg paths are not.

Hybrid architecture

NATS at the edge, Kafka at the core.

The pattern that makes the most architectural sense, if you want NATS in the picture at all, is to use it where its strengths line up (at the edge) and to use Kafka where durability is the binding constraint (at the core).

Reference pattern · edge-to-lakehouse (Tier D)

A two-tier NATS-edge to Kafka-core to Iceberg pipelineA vertical ingestion pipeline. Sensors, agents and IoT feed a NATS edge collector chosen for low latency and low footprint, which feeds a Kafka core chosen for high durability and a mature ecosystem, which lands in an Iceberg lakehouse for analytics, retention and compliance. This is a Tier-D hypothesis, stated as such in the surrounding text.Edge-to-lakehouse, two tiersSensors / Agents / IoTNATS edge collector(low latency, low footprint)Kafka core(high durability, mature ecosystem)Iceberg lakehouse(analytics, retention, compliance)
A NATS edge collector feeding a Kafka core feeding an Iceberg lakehouse. This is a Tier-D hypothesis — I have not seen a named security operation publish this exact pattern; validate on a lab benchmark before committing.

The edge layer collects from thousands of agents and sensors, pre-aggregates and filters at the collection point, and offers sub-millisecond latency for any alerting that needs to fire at the edge. The core layer provides durable storage for compliance, mature Iceberg connectors, proven petabyte-scale behavior, and the broader Kafka ecosystem (Connect, Schema Registry, ksqlDB) that security pipelines tend to grow into.

This is an informed hypothesis rather than a documented production pattern in security contexts; I have not yet seen a named security operation publicly describe deploying NATS-edge plus Kafka-core in production. It's Tier D evidence at best, and it's the kind of architecture I'd want to validate on a lab benchmark with security-shaped workloads before committing to it for a customer.

Comparison set

Other Kafka alternatives in the same conversation.

NATS JetStream is not the only Kafka alternative a security architect may evaluate. The shape of the competitive landscape:

  • NATS JetStream: lightweight streaming. Best for edge, IoT, microservice messaging. Durability concerns documented above.
  • Redpanda: Kafka-compatible wire protocol, written in C++ for lower latency and smaller resource footprint. The drop-in Kafka replacement story is the strongest claim in this set for security-data-shaped workloads.
  • Apache Pulsar: multi-tenant, tiered storage natively, cloud-native shape. Worth looking at for organizations whose constraints are governance and multi-region rather than raw performance.
  • Apache Fluss: lakehouse-native streaming designed to land directly in Iceberg. Earlier in its production trajectory but architecturally aligned with the Iceberg-first lakehouse pattern.

For security data at scale, Redpanda or Kafka remain the conservative choices for the core path. NATS may excel at the edge under the right configuration discipline. Pulsar and Fluss are worth tracking but have not yet accumulated the kind of independent durability verification record that would recommend them for the core today.

Decision framework

When NATS may be appropriate, and when to use Kafka.

NATS JetStream may be appropriate if

  • You are building an edge-only collection layer, not a core pipeline.
  • Loss of individual messages is tolerable for the use case, with durability proven at a downstream layer.
  • You configure sync_interval: always and R=5 replication.
  • You forward all data to a durable store (Kafka, Iceberg) immediately rather than relying on NATS retention.
  • Your per-stream volume is under roughly 100 GB/day.
  • Sub-millisecond latency genuinely justifies accepting the durability risk profile.
  • You do not have PCI-DSS, SOC 2, or NIST audit obligations on the data path NATS carries.

Use Kafka if

  • Any data loss compromises detection coverage, compliance reporting, or financial integrity.
  • You sustain more than 1 TB/day through a single pipeline.
  • You need the Kafka Connect ecosystem (100+ connectors with security-relevant integrations).
  • You need exactly-once semantics with transactions.
  • You're building lakehouse integration, where Kafka-to-Iceberg connectors are the mature path.
  • Your team already operates Kafka and the operational learning curve is sunk cost.
  • You need independently verified durability, with multi-Jepsen-cycle history and multi-organization production validation.

For most security operations evaluating a streaming layer today, the answer is Kafka at the core and a careful evaluation of NATS or Redpanda at the edge if specific deployment constraints make a single-binary, low-footprint collector necessary. I would not currently recommend NATS JetStream as a Kafka replacement for the core security pipeline.

Conclusion

Simplicity does not compensate for losing the event.

I drafted an earlier version of this analysis with a softer framing: "lightweight alternative with caveats." The Jepsen evidence forced a harder conclusion, and I'd rather be direct about that than hedge into vagueness.

NATS JetStream offers operational simplicity, with a single binary, sub-second startup, minimal resources, and a clean Go API, and these are valuable so I credit them. But the December 2025 Jepsen v2.12.1 analysis did not find minor edge-case issues, because it found 14.1% message loss on coordinated power failure, 49.6% loss from single-bit corruption on one of five nodes, up to 78% acknowledged-write loss on individual nodes under split-brain, and a default fsync interval that violates the most basic Raft invariant about syncing before acknowledgment. Three months later, GitHub Discussion #7967 surfaced a new consumer-loss regression in v2.12.5 (a version that was meant to incorporate the fixes prompted by Jepsen) which Synadia's own tests missed.

For a core security data pipeline, where losing the event means losing detection coverage and potentially failing compliance obligations, the durability evidence as it stands today disqualifies NATS JetStream from the core path. A 14% message loss rate in a security pipeline means roughly 14% of an adversary's activity goes silently undetected, and the asymmetry of consequences argues for the system with the longer independent verification track record.

There is a narrow path where NATS may still be the right tool, which is edge-only collection, with sync_interval: always, R=5 replication, application-level checksums and sequence validation, and immediate forwarding to a durable store. Even there, you are accepting that the .blk corruption and snapshot deletion bugs (#7549, #7556) may not be fully resolved in current releases, and you are betting on internal vendor testing rather than independent verification.

The pattern that holds up is to collect fast at the edge if your deployment shape genuinely demands it, and to store durably at the core where the evidence base is mature. As one commenter framed it, "if Kafka is a semi-truck, NATS JetStream is a rocket-powered motorcycle," which is fast and nimble but, given the current evidence, carries a non-trivial chance of dropping cargo on the trip, and for security data that's not the vehicle I'd choose for the load.