Technology deep-dive

Kafka architecture deep-dive for security data.

Apache Kafka shows up in nearly every security pipeline I evaluate, usually inherited from a data engineering team that picked the broker count, partition count, and replication factor without consulting security. Then lag spikes drop alerts, one partition takes 80% of the load, or a broker fails and events disappear. This essay walks through what topics, partitions, consumer groups, and replication actually do (the way I wish someone had explained it to me before I owned a Kafka cluster) and where partition strategy for security workloads tends to break.

Reading time: about 17 minutes. Evidence tier: B overall (Apache Kafka documentation plus published engineering blogs from Netflix, LinkedIn, Confluent), with directional throughput and cost numbers flagged for verification against your own workload.

TL;DR

If you only read one thing

Kafka shows up in nearly every security pipeline I evaluate, usually inherited from a data engineering team that picked broker count, partition count, and replication factor without consulting security. Then it breaks.
The hot partition is the most common failure mode. One partition takes 80% of load because the partition key was chosen for distribution, not for the security workload's access pattern. Lag spikes drop alerts.
Four concepts you have to understand: topics (logical streams), partitions (the unit of parallelism + ordering), consumer groups (the scaling abstraction), replication (the durability story). The defaults bite.
Retention misunderstood = compliance gap. Time-based vs. size-based retention behave differently under burst load; security teams need to know which is active and why.
When Kafka isn't the right tool. Lighter-weight options (NATS JetStream, Redpanda, Apache Pulsar) cost less to operate; pick by required ordering semantics, retention scale, and team expertise, not by which logo your data engineering team puts on slides.

Why security teams need this

Inheriting a Kafka cluster is not the same as understanding it.

The pattern I see most often: a data engineering team stood up a three-broker Kafka cluster with twelve partitions per topic and replication factor three, and the security team inherited it as a fait accompli. The configuration may be defensible, or it may be wildly mis-sized for security telemetry volume. Without understanding the why behind each setting, there's no way to tell which.

Problems then emerge in predictable shapes. Consumer lag spikes during peak traffic and detections arrive two hours late. One partition takes a disproportionate fraction of the events, usually because the partition key happens to map heavily to a single value, like an admin service account or a noisy host. A broker fails and events disappear because replication was set to one. Or the cluster is over-provisioned and costs more than the SIEM it feeds.

This essay walks through the architecture so security architects can size a Kafka cluster deliberately, design a partition strategy that avoids hotspots for security-shaped workloads, set replication for durability without paying three times for everything, and debug consumer lag before it becomes a detection failure. It's written for the architect making the buy/build/keep decision on Kafka as the routing spine of the pipeline, not for the engineer tuning JVM heap settings.

Fundamentals

Topics, partitions, consumer groups, offsets.

Topics: append-only event streams

A Kafka topic is a named event stream: a category or feed. In a security pipeline, the topics tend to map cleanly to event sources: cloudtrail-events for AWS CloudTrail logs, edr-telemetry for endpoint detection events, firewall-logs for network firewall events, saas-audit-logs for SaaS application audit trails. Producers write to topics; consumers read from them.

The structural property that matters most: topics are append-only commit logs. Events written to a topic are immutable. They can be appended, never modified, and Kafka retains them for a configured period (commonly seven days, sometimes thirty, occasionally forever). That immutability is what makes Kafka usable as a replay buffer for security investigations months after the original event landed.

Partitions: the parallelism unit

A single topic on a single broker can't absorb millions of events per second. The fix is to split the topic into partitions: independent ordered logs that can live on different brokers and be consumed in parallel. Twelve partitions on a topic means up to twelve consumers can read from it simultaneously, each one owning a slice.

How an event lands in a specific partition depends on the producer's choice of key. When the producer sends an event with a key (say user_id, src_ip, or endpoint_id), Kafka hashes the key and takes the result modulo the partition count. That formula guarantees two useful properties at once. Events with the same key always go to the same partition, which means ordering is preserved within that key (every event for a given user, in the order it happened). And events with different keys are spread across partitions, which provides the parallelism.

The trade-off matters. More partitions means more parallelism and higher throughput, but each partition costs file descriptors, memory, and metadata overhead on every broker. The Apache Kafka documentation has historically suggested a soft ceiling around 4,000 partitions per broker and 200,000 partitions per cluster as a practical limit before metadata operations slow down. These are guidance numbers, not hard caps, and they may move as KRaft (which replaced ZooKeeper for metadata) matures. Verify against your Kafka version before sizing close to them.

Consumer groups: load balancing without duplication

Most security pipelines need the same event stream fed to multiple destinations: the SIEM for real-time alerting, the data lake for historical hunting, and a real-time analytics database like ClickHouse for dashboards. Kafka handles this through consumer groups, which coordinate partition assignment so each partition is consumed by exactly one member of the group.

Three consumers in a single group named data-lake-writer, reading a topic with twelve partitions, will each be assigned four partitions. If one consumer dies, Kafka rebalances and the survivors split the dead consumer's partitions between them. If a new consumer joins, Kafka rebalances again to spread load. Within a group, no event is processed twice.

Multiple consumer groups read independently. A second group named splunk-ingest can read every partition of the same topic without knowing or caring that data-lake-writer exists. That fan-out is the architectural reason Kafka sits at the center of so many security pipelines: one event stream, multiple independent destinations, no duplication of the source data on disk.

Offsets: the consumer's bookmark

Within a partition, every event has a monotonically increasing offset. Consumers commit the highest offset they've successfully processed back to Kafka. On restart, they resume from the next offset rather than re-reading the entire topic from the start.

The commit strategy choice carries weight for security pipelines. Auto-commit, the default, commits every five seconds in the background, which is fast, but if the consumer crashes between commits you lose or duplicate up to five seconds of events. Manual commit, after every successful processing step, gives exactly-once semantics at the cost of throughput, while batch commit every thousand events is the common compromise between the two.

For security workloads I default to manual or batch commit, because auto-commit on a critical-alert pipeline is the kind of decision that produces a stack trace on the second-worst day of your on-call rotation, and the throughput cost you pay to avoid that is rarely the binding constraint when alert reliability is on the line.

Throughput and latency

What a Kafka cluster can actually carry.

A three-broker Kafka cluster with replication factor three can typically sustain 150–300 MB/sec aggregate write throughput, which translates to roughly 1.5–3 million events per second for 100-byte events. Single-broker write throughput sits around 50–100 MB/sec under realistic security workloads with replication and serialization overhead. Confluent's published benchmarks report higher numbers (605 MB/sec producer, 800 MB/sec consumer on a single broker under optimal conditions), but those numbers assume tuned hardware, minimal replication, and zero serialization tax. Real security workloads (JSON or Protobuf payloads, replication factor three, broker-side compression) usually land at the lower end.

End-to-end latency, producer through broker to consumer, runs in three rough tiers. With acks=1 (producer waits for the leader broker to acknowledge but not for replicas), two to five milliseconds is achievable. With acks=all (producer waits for every in-sync replica), ten to fifty milliseconds is more typical. The 99th-percentile latency, the number that matters for alerting SLAs, may push to 100–200 milliseconds under network congestion or disk I/O contention.

For most security alerting pipelines, ten to fifty milliseconds Kafka latency is invisible against the rest of the budget; the SIEM is taking seconds to evaluate the rule. But for detection pipelines targeting the breakout-time numbers reported in the 2026 CrowdStrike Global Threat Report (as fast as 27 seconds for the quickest recorded intrusion), the cumulative latency of every intermediate hop starts to matter. Kafka isn't usually the bottleneck (the downstream stream processor or SIEM is), but it's worth keeping the budget honest.

For batch analytics (hourly aggregations, daily reports, retrospective hunting) Kafka latency is irrelevant, so a five-minute lag against the same topic is fine when the consumer is a micro-batch Spark job feeding an Iceberg table for offline analysis rather than a real-time detector, and the reason Kafka works as a fan-out backbone is that one topic can serve both of those budgets at once.

Delivery guarantees

At-most-once, at-least-once, exactly-once.

Kafka offers three delivery semantics, with steepening trade-offs. At-most-once fires the event and moves on; if the broker crashes before writing, the event is lost. At-least-once waits for broker acknowledgement and retries on failure, which means an acknowledged-but-lost ack will cause the producer to retry and produce a duplicate. Exactly-once, introduced in Kafka 0.11 via KIP-98 and extended in KIP-447, uses idempotent producers and transactional writes to guarantee neither loss nor duplication.

Exactly-once is not free. Confluent's published numbers suggest roughly a 20–30% throughput reduction versus at-least-once delivery, with the overhead varying by batch size and replication factor. Older guidance cited up to 50% overhead; the current number depends on Kafka version and configuration, so I'd treat the 20–30% as directional and verify against your own workload before committing.

For security pipelines, the choice is per-topic, not per-cluster. Compliance-critical events (audit logs that must be defensible in regulatory review, financial transaction logs feeding fraud detection) may justify exactly-once. High-volume telemetry (EDR events, firewall logs, DNS queries) usually doesn't, because the downstream data lake or SIEM can deduplicate on event ID cheaper than Kafka can guarantee no-duplicates upstream.

At-most-once is rarely the right call for security data, because the throughput win is small while the forensic cost of a silently dropped event is unbounded.

Sizing

How big a Kafka cluster do I actually need?

Three inputs drive Kafka sizing for a security pipeline: peak event throughput, retention window, and replication factor. The arithmetic is straightforward once you commit to numbers.

A worked example. Peak throughput is 200,000 events per second at an average of 500 bytes per event, which is 100 MB/sec sustained. Retention is seven days. Replication factor is three for durability. Daily data volume is 100 MB/sec × 86,400 seconds, which is 8.64 TB/day. Seven days of retention is roughly 60.5 TB. Replicated three ways, that's about 181 TB of total storage across the cluster.

A three-broker cluster needs 60 TB per broker. On AWS, that may look like r5.4xlarge instances (sixteen vCPU, 128 GB RAM) with 64 TB EBS gp3 attached per broker. Instance cost is around $1,000 per broker per month; storage at $0.08/GB-month for gp3 is roughly $5,100 per broker per month. Total cluster cost lands in the $18K/month range, and that number is sensitive enough to instance family, EBS tier, and AWS region that I'd treat it as a starting point for a real quote, not a price.

Tiered storage (KIP-405, generally available in Kafka 3.6+) changes the math meaningfully. Hot data (the last day or two) stays on broker-local NVMe for low-latency consumer reads. Older data offloads to S3, which is cheaper per gigabyte and removes broker storage as the scaling constraint. For a seven-day retention window, keeping one day hot and six days in S3 may cut total cluster cost by roughly 30–35%. The trade-off is operational complexity (one more system path to monitor) and a small latency penalty when consumers replay old data from S3.

For most security pipelines I evaluate, the right starting point is a three-broker cluster sized for one week of hot retention, with tiered storage enabled for anything older. The Confluent-managed equivalent (Confluent Cloud) or alternatives like Aiven for Apache Kafka or Redpanda's BYOC offering trade per-byte cost for operational simplicity. Worth comparing if your team doesn't already have Kafka operations expertise on staff.

Partition strategy

Where security workloads break the default partition strategy.

This is the section where Kafka guides written for data-engineering audiences and Kafka guides written for security teams diverge. The standard rule of thumb (partitions equal target throughput divided by single-partition throughput) gets you to a reasonable count, but it doesn't tell you how to pick the partition key, and security telemetry has structural properties that punish the wrong choice.

High-cardinality keys are good. Low-cardinality keys cause hotspots.

The textbook partition keys for security data are user ID (for authentication logs), source IP (for network and firewall logs), and endpoint ID (for EDR telemetry). Each preserves ordering within the entity that matters for detection: same user's failed logins in sequence, same IP's port-scan probes in order, same endpoint's process tree intact.

The failure mode shows up when the key distribution is skewed. Authentication logs partitioned by user_id may have one service account or one CI/CD pipeline user generating 60% of all events, so that user's partition gets six times the load of the others and the consumer assigned to it falls behind, which means detection lag against that user (often the most security-relevant user) becomes the longest in the system.

Source IP has the same failure mode in different clothing. A handful of internal load balancers, proxies, or NAT gateways may emit a disproportionate fraction of network events. Partition by src_ip and those few IPs drown out everyone else. Endpoint ID is usually less skewed because endpoints are humans-shaped distributions, but a noisy server endpoint with an unbounded process-spawn loop can still create a hot partition for hours.

Composite keys may help, with caveats

The pattern I've seen used to spread hotspots is a composite key, for example user_id concatenated with a time bucket like floor(timestamp / 60s). That preserves per-minute ordering for a given user (usually enough for detection rules that look for bursts of activity) while spreading a heavy user's events across multiple partitions over time.

The caveat is that you lose strict cross-minute ordering. A detection rule that needs to see all failed logins for a user in order across an hour may have to re-sort downstream, which adds latency and complexity. For burst-detection rules (five failed logins in sixty seconds) the per-minute bucket is fine. For longer windows, the composite key may cause more pain than it saves.

Another option is random round-robin partitioning (no key), which maximizes parallelism and eliminates hotspots entirely at the cost of losing the per-entity ordering guarantees. That's the right choice when downstream processing is order-independent (write-into-Iceberg jobs, archive dumps to S3) and the wrong choice for stream-processing detection, so like the other partition questions it gets settled per-topic rather than once for the whole cluster.

Partition count: how many is enough?

Single-partition throughput on the consumer side is usually limited by consumer processing speed, not Kafka itself. A Python consumer that does ten milliseconds of per-event work tops out around a hundred events per second per partition; a Rust or Go consumer doing simple shape-and-forward work may handle five to ten thousand. So partition count depends as much on what the consumers do as on what the producers emit.

For EDR telemetry at 100,000 events per second with a consumer that processes 2,000 events per second per thread, you need at least fifty partitions, plus a margin on top of that for rebalances and consumer failures. Forty-eight partitions is a common starting number for heavy security topics (it's also a friendly divisor: works for 1, 2, 3, 4, 6, 8, 12, 16, 24, or 48 consumers). Twelve partitions is a common starting number for moderate topics like CloudTrail, and six is the usual starting number for low-volume topics like SaaS audit logs.

Over-partitioning is a softer mistake than under-partitioning. Increasing partition count later requires either creating a new topic (which means rewriting producers and consumers) or accepting that newly created partitions break key-based ordering for existing keys. Start with more partitions than you think you need, within the soft limit guidance above.

Replication and durability

What replication factor and acks actually buy you.

Replication factor is the number of copies of each partition stored across brokers. Replication factor one is no replication; if the broker dies, the partition is gone. Factor two survives one broker failure. Factor three, the conventional default, survives two simultaneous broker failures.

Replication factor three is the right default for security logs that must be defensible in audit or compliance review. The cost is 3x storage, but the durability case is easier to make to a regulator than a forensics tool. For high-volume telemetry where individual event loss is acceptable in exchange for cost reduction (uncommon, but real for some EDR archive topics), factor two may be defensible.

The producer's acks setting is where the durability decision gets made. acks=0 means fire-and-forget, which is the fastest and lowest-durability option and isn't recommended for security. acks=1 waits for the leader broker to write to its local disk before acknowledging, which is fast, but if the leader dies before replicating to followers the event is lost. acks=all (sometimes written acks=-1) waits for every in-sync replica to write before acknowledging, so it's the slowest and highest-durability setting and the right default for compliance-grade security topics.

The In-Sync Replica (ISR) concept matters here. A replica is "in sync" if it has caught up to the leader within the configured replica lag threshold. acks=all waits for every ISR, not every replica; if a follower has fallen behind and dropped out of the ISR set, the producer doesn't wait for it. The min.insync.replicas setting controls the floor: if the ISR set shrinks below it, the producer gets an error rather than a silent loss of durability. For replication factor three, min.insync.replicas=2 is the conventional pairing: survives one failure, refuses writes if two replicas drop, never silently degrades durability.

For security pipelines, the combination I default to is replication factor three with min.insync.replicas=2, then acks=all on critical alert topics and acks=1 on bulk telemetry where the latency win is worth the slightly weaker guarantee, which is again a choice I make per-topic rather than once across the cluster.

Retention per topic

Retention is a per-topic decision, not a cluster default.

The default Kafka retention is seven days, configured via log.retention.hours at the broker or retention.ms at the topic level. For security pipelines, the topic-level override is the one that matters; different event types have wildly different replay needs and wildly different costs.

High-volume topics like EDR telemetry may justify a shorter Kafka retention window (say, 24 to 48 hours) because anything beyond that is going to be served from the data lake anyway. Why keep eight terabytes per day in Kafka for a week when the same data is also being written to Iceberg for ninety-day hot retention and S3 Glacier for longer? Kafka's job in that pipeline is the replay buffer for fixing a broken consumer, not the long-term archive.

Lower-volume but higher-replay-value topics (CloudTrail audit logs, identity provider events, Sigma-rule-driven alerts) may justify longer Kafka retention (thirty days, sometimes longer) because the cost is small and the convenience of replaying directly from Kafka rather than rebuilding from the lake matters when a detection rule has a bug.

Tiered storage changes this calculus in the direction of "keep more, pay less." With KIP-405 in place, the cost difference between seven-day and thirty-day Kafka retention on a high-volume topic shrinks significantly because the older days move to S3 automatically. I'd revisit the retention defaults whenever tiered storage gets enabled.

Alternatives

When Kafka may not be the right tool.

Kafka is the safe default for a security routing spine, but it's not the only credible option, and the choice is more interesting than the typical Kafka-vs-everything-else framing suggests.

AWS Kinesis: managed simplicity at moderate scale

AWS Kinesis Data Streams trades flexibility for operational simplicity. AWS runs the infrastructure; you pay per million events plus per shard-hour. For an AWS-native security architecture that already lands logs in S3, runs detection in Lambda, and writes to Glue (common at smaller scale), Kinesis may eliminate an entire class of operational toil. The rough break-even versus Kafka is somewhere around ten to twenty million events per day; below that, Kinesis is usually cheaper, and above that, Kafka's per-byte cost starts to win.

Kinesis loses on retention (max 365 days, default 24 hours), on ecosystem (the Kafka Connect catalog is dramatically larger than the Kinesis equivalent), and on multi-cloud portability (it's AWS-only). For a security team committed to AWS, those losses may not matter. For a multi-cloud or hybrid security architecture, they do.

Redpanda: Kafka API, simpler operations

Redpanda is a Kafka-API-compatible streaming engine written in C++ that eliminates ZooKeeper (which Kafka itself did via KRaft, but later) and bills itself on lower latency and fewer operational moving parts. For security teams that want Kafka's ecosystem (the Kafka Connect catalog, existing client libraries, Schema Registry) without Kafka's JVM tuning surface, Redpanda is a real alternative worth benchmarking. I have not personally run Redpanda in a production security pipeline, so I'd flag the operational simplification claims as Tier C (vendor positioning) until I see independent production validation at the scale of the deployment I'm sizing.

Apache Pulsar: multi-tenancy and geo-replication built in

Pulsar separates brokers from storage (the storage layer is Apache BookKeeper), which allows independent scaling of compute and storage and makes multi-tenancy and geo-replication first-class concerns. For an MSSP running a security platform for fifty customer tenants on one cluster, Pulsar's tenancy model is genuinely better than what Kafka requires you to build on top of itself. For a single-tenant in-house security team, Pulsar's additional architectural complexity (separate broker and storage tiers) is unlikely to pay back versus Kafka's simpler single-binary-per-broker model.

Aiven and Confluent Cloud: managed Kafka

Confluent Cloud is the cloud-native version of Apache Kafka from the company that commercialized it, with the deepest tooling ecosystem (Schema Registry, ksqlDB, Connect) integrated. Aiven offers managed Kafka across AWS, Azure, GCP, and DigitalOcean with a lighter-touch operational model. For security teams that want Kafka's flexibility without the on-call burden of running brokers, both are credible. Confluent Cloud tends to be more expensive at the high end but gets you the broadest connector catalog; Aiven tends to be cheaper and works better in multi-cloud setups. The decision usually comes down to which ecosystem you're already in.

Operating it

What to monitor on a security Kafka cluster.

The two metrics that catch most production problems in security Kafka pipelines are consumer lag and partition skew, in that order.

Consumer lag (the difference between the latest offset on a partition and the consumer group's committed offset) is the canary for downstream processing health. If the SIEM-ingest consumer group's lag on the EDR topic starts climbing, detections are arriving late. The threshold worth alerting on depends on the topic: for real-time alert paths, anything more than a few seconds of lag is worth investigating; for archive-to-lake paths, several minutes may be fine.

Partition skew, the ratio of events on the hottest partition to the average, is the canary for partition-strategy problems. A 2x skew is usually fine. A 6x skew (one partition with 60% of the load that should be 17%) means the partition key is wrong for this workload. Most Kafka monitoring dashboards don't surface skew by default; it's worth building.

Other metrics worth tracking: under-replicated partitions (replication is failing; investigate immediately), broker disk usage (storage exhaustion takes the cluster down hard), ISR shrink rate (replication is degrading), and request handler idle ratio (broker is CPU-saturated). Prometheus plus Grafana with the Kafka JMX exporter is the open-source default; Confluent Control Center, Datadog, and New Relic offer commercial equivalents.

For security teams specifically, I'd add one more: the lag on the dead-letter-queue topic, if you have one. Events that failed downstream processing should not silently accumulate in the DLQ forever. A growing DLQ is a detection-coverage problem disguised as a Kafka problem.

Why this matters

Kafka is the routing spine, not the detection layer.

The architectural shape I keep coming back to is Kafka as the fan-out backbone: one topic per event source, multiple independent consumer groups feeding the SIEM for real-time alerting, the streaming processor (Flink, RisingWave, Tenzir) for detection, the data lake (Iceberg via Spark) for retrospective hunting, and the long-term archive (S3) for compliance retention. Each consumer pays only for what it needs and operates on its own latency budget.

What Kafka is not is a detection engine, a stream-processing engine, or a SIEM. Confluent's ksqlDB layer adds materialized views and stateful stream processing on top of Kafka, which can serve a subset of streaming detection use cases (failed-login counts per user, threshold alarms), but the moment your detection logic needs joins across three or more streams, stateful enrichment from reference tables, or windowed aggregations more complex than tumbling counts, a dedicated stream processor will probably serve you better. RisingWave and Flink are the common choices; I've written about RisingWave's role in the reference architecture separately.

The summary I'd offer to a security architect inheriting a Kafka cluster: understand the partition strategy, audit the replication and acks settings per topic, monitor consumer lag and partition skew, set retention per topic rather than cluster-wide, and resist the temptation to do detection inside Kafka itself, because Kafka is good at being the spine and the cluster runs better when you let the rest of the stack do the work each layer is good at.