Security Data Works

Technology deep-dive

ETL vs ELT for security data: who owns the schema, and when.

The acronym debate looks like an ordering preference, transform first or load first, but the question underneath it is schema ownership, which is to say who decides what the data looks like and at what point in the pipeline that decision gets locked in. For security data the answer drives roughly 30-70% of pipeline cost, and it determines whether your detection logic ends up sitting inside a vendor's compute platform or inside your own lakehouse.

Reading time: about 16 minutes. Evidence tier: B (vendor pricing analysis, practitioner validation, list rates from Databricks, Snowflake, and Cribl public documentation). Cost figures use list pricing as a worst-case anchor; real enterprise contracts typically negotiate 20-40% off that anchor. All comparisons exclude personnel costs unless explicitly called out.

At a glance

ETL vs ELT, side by side.

The headline differences for security data workloads. Every row is expanded in the prose below with the conditions that change the answer.

DimensionETL (Extract, Transform, Load)ELT (Extract, Load, Transform)
Transformation timingTransform in flight, before data lands. Schema and detection logic applied during streaming ingestion.Transform after data lands. Raw events written first; schema applied later via SQL or notebooks against the destination store.
Where transformation runsStreaming pipeline (Cribl Stream, Tenzir, Vector, or custom Flink/Spark Structured Streaming).Destination platform's compute layer: Databricks DBUs or Snowflake credits running SQL/notebooks against the loaded raw layer.
Schema rigidityDecided upfront and enforced at ingest. Filtered-out data is gone; sampled data is statistical, not complete.Decided after the fact, frequently re-decided by different teams running different queries against the same raw layer.
Replay / reprocessing costBounded. Only the surviving signals plus ~10-20% raw sample are in the lakehouse. New detection logic against historical raw is limited by what survived the filter.Cheap in principle (raw is still there), expensive in practice. Re-running correlation rules against 900 TB of 90-day-retained data burns billable compute every pass.
Data sovereigntyStorage in your own S3 or ADLS bucket using open table formats (Iceberg, Delta). Pipeline vendor is an interchangeable extraction layer.Data lives inside the vendor's managed storage. Egress at $0.09-0.15/GB: moving 900 TB out of Snowflake costs roughly $108K before rewriting any queries.
OCSF normalizationApplied in flight inside the pipeline (Tenzir has native OCSF transforms; Cribl exposes it via packs). Lakehouse stores already-normalized records.Applied after load via SQL or dbt models against the raw layer. Each query/dashboard pays the normalization cost, or you materialize OCSF views and pay storage twice.
Best-fit scenario5+ TB/day with validated detection rules; detection-tier latency requirements (sub-10-second event-to-alert); data sovereignty matters strategically.Under 1 TB/day, exploration-phase detection R&D, or hunting-as-primary-workload where query-everything flexibility outweighs the compute bill.

The reframing

ETL and ELT are schema-ownership decisions in disguise.

ETL (Extract, Transform, Load) is the older pattern. You pull data out of a source system, apply a schema and any cleansing or detection logic in flight, and write the structured result into a destination store, so the schema is decided before the data lands and the destination store sees only what you chose to let through.

ELT (Extract, Load, Transform) inverts the last two steps. Pull data out of a source system, load it raw into a warehouse or lakehouse, and apply transformations later, usually via SQL or notebooks running inside the destination platform. The schema is decided after the data lands, and frequently decided multiple times by different teams running different queries against the same raw layer.

Stated that way it sounds like ELT is the more flexible pattern, and most "modern data stack" marketing implies as much, which is right for some workloads, though for security data specifically the ordering question collapses into a sharper one, namely where does the schema live, and who pays to maintain it?

If the schema lives at the source (inside a vendor's connector library, a SaaS integration, or a transformation step you don't directly control), you've handed that vendor a structural dependency. When their schema changes, your pipeline breaks. When their pricing changes, your migration is painful. If the schema lives in your warehouse or lakehouse, in version-controlled SQL or dbt models you can audit and rewrite, the dependency runs the other direction. The vendor becomes an interchangeable extraction layer, and the analytical logic stays yours.

The ELT story

Load first, transform later, and pay for compute on every query.

Databricks and Snowflake are the canonical ELT platforms. Their pitch is straightforward: ingest everything, store it in our managed object storage, and run any transformation you want against it via SQL, Spark, or notebooks. The architecture is convenient for small teams without a data engineering function, because the platform handles the storage layer and the compute layer, and you only have to write queries.

The pricing shape is where it gets interesting. ELT platforms make money on compute, not on storage. Storage is cheap (Snowflake's documented rate sits around $23-30 per terabyte per month depending on region and time-travel configuration). Ingestion is cheap or, in some vendor contracts, bundled. The recurring cost is the query layer: every correlation rule, every dashboard refresh, every analyst investigation runs through billable compute units (Databricks DBUs, Snowflake credits).

For security workloads, that cost shape compounds in a specific way. Detection rules run on a schedule (every 5 to 15 minutes for batch-query SIEM patterns) and they scan a rolling window of data each time. Across a 90-day retention window at 10 TB/day of raw security telemetry, that's roughly 900 TB of data being scanned hundreds of times per day by detection logic, plus ad-hoc analyst queries, plus any ML feature engineering or model training. The compute bill grows with usage, not with data volume, which is the part most architects underestimate.

The other piece of the ELT shape is egress. Cloud storage providers and managed lakehouse platforms typically charge $0.09-0.15 per gigabyte to move data out. Migrating 900 TB out of Snowflake at $0.12 per gigabyte lands around $108K just to move your own data, before you pay to rewrite the SQL, retrain analysts, and rebuild integrations on the new platform, which is where the flexibility story ELT was sold on turns into a lock-in story you didn't price at the start.

The ETL story

Transform in flight, store only what survives the filter.

The ETL pattern, modernized for security operations, looks like this: collect raw telemetry from EDR, firewall, DNS, identity, and cloud audit sources; run it through a streaming pipeline that applies detection logic, OCSF normalization, and sampling decisions in flight; and write only the surviving signals (plus a controlled sample of raw events) into the destination lakehouse.

Three vendors anchor the practical implementation choices. Cribl Stream is the commercial routing platform with the deepest source/destination integration library and a routing-by- value model: high-value events to the SIEM, everything else to the lake. Tenzir is the detection-focused open-core pipeline with native OCSF transformation and streaming ML hooks. Vector (from Datadog) is the open-source observability pipeline that's grown a security following because the configuration story is more legible than custom Flink jobs. Apache Flink and Apache Spark Structured Streaming sit underneath as the heavy-lift options when commercial platforms don't fit.

The cost shape inverts from ELT. You pay once during ingestion for the pipeline compute, and then almost nothing per query against the resulting lakehouse, because the lakehouse is storing 5-15% of the raw volume (signals plus sample), not 100% of it. Storage at $0.02-0.025 per gigabyte per month on S3 with Iceberg metadata is nearly free at that reduced footprint. ClickHouse or DuckDB on top of that storage handles analyst queries at a small fraction of Snowflake compute costs, because there's less data to scan.

The architectural cost is real, because you need a pipeline team (usually two to four data engineers depending on volume) and you need detection logic defined upfront, since the data you filtered out in flight is gone (or sampled, which is the same problem with statistics layered on top). So the tradeoff this essay is trying to make legible is that you give up query-time flexibility to get ingestion-time control, and you give up vendor compute lock-in to take on in-house data engineering investment instead.

The cost comparison

What 10 TB/day actually costs in each model.

The scenario I'll work through is a 10 TB/day security telemetry footprint, typical for a 10,000-50,000-employee enterprise running full logging across EDR, firewall, DNS, identity, and cloud audit trails. The retention window is 90 days, the common compliance floor for SOC 2, PCI-DSS, and most GDPR-aligned policies. The horizon is 3 years, the standard enterprise procurement cycle.

These are list-rate numbers. Real enterprise contracts negotiate 20-40% off list, sometimes more on committed-use agreements. The relative ratios hold; the absolute totals will be lower in practice. I'm deliberately using list rates as the worst-case anchor because that's the only number anyone can actually verify without an NDA.

ELT on Databricks or Snowflake (list-rate anchor)

Line itemCalculationResult
Ingestion10 TB/day x $0.15/GB x 30 days= $1.5M/month
Storage900 TB x $30/TB/month= $27K/month
ComputeCorrelation rules= $300-500K/month
Analyst ad-hoc queries= $100-200K/month
ML feature engineering + training= $200-400K/month
(subtotal averages around $850K)
Monthly total (average):~$2.38M/month
3-year TCO (list rate):~$85.6M

The dominant line item is ingestion, at about 54% of the total. Compute is roughly another 36%. Storage is a rounding error. The reason that ordering matters: a "save money on ELT" pitch that focuses on storage tier optimization is solving the wrong problem. The bill is being driven by ingestion fees and recurring query compute, and storage optimization has marginal impact at this shape.

ETL with a self-managed pipeline (Tenzir on Kubernetes)

Line itemDetailResult
Pipeline computeTenzir self-managed K8s= $50K/month
Storage99 TB (signals + 10% sample)
at $0.02/GB on S3 + Iceberg
= $2K/month
Query computeClickHouse dashboards= $20K/month
DuckDB ad-hoc= $5K/month
Monthly total:~$77K/month
3-year TCO (infrastructure only):~$2.77M

The infrastructure-only comparison is roughly $85.6M vs $2.77M, about a 96% reduction. That number only holds if you also account for personnel cost. Two to four data engineers at $250-400K fully-loaded per year for a three-year horizon adds $1.5M-4.8M. With personnel included, the infrastructure savings narrow but stay material: roughly an 89-93% reduction depending on team size. The right way to think about this is that the savings are real but smaller than the infrastructure-only headline suggests, and they require building (or already having) a data engineering capability.

ETL with a commercial pipeline (Cribl Stream)

Line itemDetailResult
Pipeline computeCribl Stream (10 TB/day x $0.10/GB list)= $300K/month
Storage99 TB on S3 + Iceberg= $2K/month
Query computeClickHouse + DuckDB= $25K/month
Monthly total:~$327K/month
3-year TCO (infrastructure only):~$11.77M

Commercial pipeline platforms collapse the operational burden: no Kubernetes cluster to manage, no on-call rotation for pipeline failures, fewer data engineers needed (often just one or two for integration work). At list rate, Cribl on this workload runs roughly $11.77M over 3 years versus Databricks/Snowflake at $85.6M, about an 86% reduction, infrastructure-only. Add an analyst or two instead of a data engineering team and the personnel delta narrows further. The Cribl-versus-Tenzir choice ends up being less about cost than about how much of the pipeline you want to control directly.

Detection latency

Cost is not the only axis; latency is the other one.

ELT batch-query detection runs on a schedule, typically 5-15 minute intervals for the correlation rules that fire on log volume in any decent SIEM. Add the actual query latency (30 seconds to a few minutes against 900 TB of data with reasonable partitioning), and you arrive at total event-to-alert latency in the 5-20 minute range. That was the accepted baseline for the last decade of SIEM design.

ETL pipeline detection runs against the event stream as it arrives. Correlation logic executes per event (or per micro-batch in Flink terms), with latency measured in seconds. Total event-to-alert latency lands in the 1-10 second range for well-tuned streaming pipelines.

Three 2026 data points have made the 5-20 minute tier difficult to defend for detection-tier workloads specifically. CrowdStrike's 2026 Global Threat Report observed a 27-second fastest recorded breakout time (the interval from initial compromise to lateral movement) in their telemetry. A detection pipeline that needs 5-15 minutes to surface the initial event has already lost that race by more than 10x. Mandiant's M-Trends 2026 documented a negative mean time-to-exploit: on average, vulnerabilities are being exploited 7 days before patch release, which moves the defender's window from "time to patch" to "time to detect and contain." And Anthropic's Claude Mythos preview (April 2026, benchmarked that June) demonstrated economically viable autonomous vulnerability discovery and exploit generation at roughly $2,000 per kernel-class exploit (eight distinct exploits for $15,700 in API credits), with the AISLE rebuttal showing eight small open-weight models all reproducing its flagship FreeBSD exploit, meaning machine-speed offensive capability is no longer a frontier-lab artifact.

The honest read on those data points is that they don't argue that every security workload needs sub-second latency, but rather that the detection tier specifically has to operate at machine speed, and that scheduled-query detection at 5-15 minute cadence has shifted from a tunable operational tradeoff to a structural exposure. Hunting, BI dashboards, baseline computation, and post-incident forensics are still fine at minute-to-hour latency, because there an analyst is in the loop and the timing tolerance is different.

That's the latency case for pipeline detection, and the claim isn't that ETL is faster so ETL wins. The claim is narrower, that the detection tier specifically may no longer be a fit for batch-query architecture, and the implementation pattern that satisfies that constraint is in-flight processing, which happens to be the ETL shape.

When ELT is the right answer

Three scenarios where I'd choose ELT.

1. Small scale where compute cost stays under the noise floor

Under roughly 1 TB/day, Databricks or Snowflake costs typically sit in the $50-100K/month range, which is small enough that operational simplicity wins. A two-person security team without a data engineering function is better served by a managed ELT platform than by standing up a Kubernetes-hosted streaming pipeline. The personnel cost of running the pipeline outweighs the infrastructure savings at that scale. The cutoff isn't a sharp line (it depends on team skills and query patterns), but as a rule of thumb, under 1 TB/day, the math usually favors ELT.

2. Exploration phase, before detection logic is validated

The hardest part of ETL is that filtering decisions are irreversible: data you didn't keep can't be queried later. If you don't yet know what to detect, an ELT platform is a more honest fit, because you can prototype detection logic against raw data, validate precision and recall, and only then decide what's worth running in a pipeline. The pattern I see work well: 6-12 months in an ELT platform with a smaller sample (1-2 TB/day, not the full 10 TB/day production volume) for detection R&D, followed by migration of validated rules into a pipeline. That's roughly the HMM3-to-HMM4 progression in detection engineering maturity terms: exploration in an ELT platform, then automation in an ETL pipeline.

3. Retroactive investigation as a primary workload

If your team's actual workload is hypothesis-driven hunting against historical data (pulling arbitrary fields from arbitrary log sources across 90+ day windows to chase a hypothesis) ELT may fit better, because pipeline filtering throws away the data the hunter wants. The honest framing is that this is a small fraction of typical SOC workload; most investigations resolve with 100% of signals plus a 10-20% random sample of raw events. But if the team is structured around hunting rather than around detection-and-response, the ELT cost model may be worth paying.

When ETL is the right answer

Three scenarios where I'd choose ETL.

1. Large scale where ELT compute escalates non-linearly

Above 5 TB/day, ELT compute costs cross into seven-figure-monthly territory at list rates. Even with negotiated discounts, the trajectory is uncomfortable. A 10-50 TB/day enterprise is paying for the same data to be scanned by the same detection rules every five minutes, and paying for it every time. Pipeline detection moves that compute upstream to a one-time-per-event cost. That's where the ETL economics start to dominate the conversation, and where the personnel cost of a data engineering team becomes a rounding error against the infrastructure savings.

2. Production operations with validated detection playbooks

Once detection rules are stable (meaning you know what you're looking for, you've validated precision above 90% and recall above 70% against historical data, and you have a maintained library of correlation logic) the case for running those rules in a pipeline strengthens. The exploration flexibility of ELT matters less when the analytical question is already known. Pipeline detection also satisfies the latency constraint the threat-model section makes explicit, which scheduled-query detection on an ELT platform may not.

3. Cost predictability and egress portability matter strategically

Pipeline compute is fixed per gigabyte ingested, which makes capacity planning legible. ELT compute is variable per query, which makes capacity planning a conversation with the vendor's sales team. For organizations where security budget is fixed and predictable (most of them) that predictability is valuable. Combined with storing data in your own S3 or ADLS bucket using open table formats (Iceberg or Delta), the ETL pattern preserves the option to swap query engines or storage platforms without paying egress, which is a strategic asset on a 5-year horizon.

The hybrid pattern

Most production SOCs end up running both.

The cleanest mental model for the security data stack isn't "ETL or ELT" but a tiered architecture where each tier picks the pattern that fits its workload.

  • Tier 1, pipeline detection (ETL). High-confidence, validated detection rules run in flight against the event stream via Cribl, Tenzir, Vector, or custom Flink jobs. Output is signals plus a controlled sample of raw events, written to Iceberg or Delta on object storage.
  • Tier 2, lakehouse queries (ETL output, ELT pattern). ClickHouse or DuckDB on top of the Iceberg lakehouse handles analyst dashboards, signal investigation, and queries against the sampled raw layer. Cost is dominated by storage, which is cheap, and query compute, which is small because the data volume is.
  • Tier 3, exploration platform (ELT). Databricks or Snowflake with a 30-day retention window holds a higher-fidelity sample for hypothesis-driven hunting and detection R&D. The footprint is small enough that ELT cost is manageable, and the workload benefits from query-everything flexibility.

That structure puts the pattern with each workload's natural fit, instead of forcing a single architectural choice across workloads that have legitimately different requirements. The cost story improves at every tier because the expensive ELT compute layer is now sized for exploration (small) rather than for production detection (large), the latency story improves because the detection tier is moving at machine speed, and the data ownership story improves because the long-tail storage lives in open formats on object storage you control.

Common mistakes

Three pitfalls I see repeatedly.

1. "We need 100% raw data for investigations"

This is the most common objection to pipeline filtering, and it's usually false in practice, because the investigations that genuinely benefit from 100% raw data are a small minority while the ones that resolve with signals plus a 10-20% random sample of raw events are the large majority. The honest framing is to accept some bounded investigation limitation (5-10% of cases that would have benefited from full retention) in exchange for 80-90% cost reduction, and if the team can't accept that tradeoff explicitly, the ETL pattern is not the right fit and ELT is the more honest choice. The work is in making the tradeoff explicit rather than letting it stay buried.

2. "We'll migrate to ETL eventually, after we ingest everything for now"

This is the path that gets organizations stuck, because egress costs scale with the data volume already ingested, so twelve months of "we'll migrate later" can mean six- or seven-figure egress bills, plus the rewriting of the queries and the retraining of analysts on a new platform. The honest version is that if ETL is the eventual destination, you should start moving toward it within the first 6-12 months, before the migration cost compounds past the ELT savings it was supposed to capture, since the window where migration stays cheap is also the window where ELT still looks tolerable.

3. "We'll build a custom Flink pipeline from scratch"

Custom Apache Flink or Spark Structured Streaming is the right answer for a small number of organizations, typically those with existing data platform teams and very specific requirements that commercial pipelines don't fit. For most security teams the right starting point is a commercial or open-core platform (Cribl, Tenzir, Vector) for the first 12-24 months, with custom Flink reserved for components where the commercial option doesn't fit, because the economics of "build it ourselves" rarely survive contact with operational reality: the maintenance burden on a self-built streaming pipeline at 10 TB/day is substantial, and the time-to-production tends to run into quarters rather than weeks.

Decision framework

A short checklist for the next architecture conversation.

Four questions I'd ask in any conversation about ETL versus ELT for a security data stack:

  • What's the daily ingestion volume, and where's the trajectory? Under 1 TB/day, ELT is usually defensible. 1-5 TB/day is the gray zone where the choice depends on team skills and workload. Above 5 TB/day, the ELT compute bill at list rates is uncomfortable enough that pipeline detection becomes worth considering even with the data engineering investment.
  • How mature is the detection logic? If detection rules are still being prototyped and validated, ELT's query-everything flexibility is valuable. If rules are stable and well-validated, pipeline detection moves them upstream where they run faster and cheaper. The maturity question often answers the architecture question.
  • Where does the schema live, and who maintains it? If the schema lives at the source (vendor connector library, SaaS integration), changes are outside your control. If it lives in your warehouse or lakehouse as version-controlled dbt models or SQL, the vendor becomes an interchangeable extraction layer. The latter is the more durable position for a 5-year architecture.
  • What's the egress exposure? Calculate the cost to move your current data volume out of the current platform at $0.12/GB. If that number is uncomfortable, you're already paying the ELT-lock-in tax, whether or not you've realized it. Plan migration before the number gets worse.

None of these questions resolve the choice on their own, but taken together they tend to make the answer obvious, and the teams that get stuck are the ones that frame the question as "which architecture is better" instead of "which architecture fits this workload, at this scale, with this team." There isn't a universally correct answer so much as a workload-specific one, and the framing decides it more than the technology does.

Conclusion

Schema ownership is the real decision.

The ETL-versus-ELT debate looks like an ordering preference, but underneath it is a decision about where your schema lives, who pays to maintain it, and what happens when the vendor's pricing or product direction changes, which makes it a strategic decision dressed up as a technical one.

ELT on Databricks or Snowflake makes sense for small teams, early-stage detection programs, and exploration-heavy workloads. The cost shape is uncomfortable above 5 TB/day at list rates, and the egress lock-in is a real strategic exposure on a 3-5 year horizon.

ETL via Cribl, Tenzir, Vector, or custom Flink pipelines makes sense at production scale with validated detection logic. The cost story is materially better (80-90% infrastructure savings even after accounting for personnel) and the latency story is necessary, not optional, for detection-tier workloads in the 2026 threat environment.

Most production SOCs end up running a hybrid: pipeline detection for the validated rules, lakehouse queries for the analyst layer, and a smaller ELT platform retained for exploration and detection R&D. That's the architecture I recommend most often, because it puts each workload on the pattern that fits it.

The deeper point is that schema ownership compounds, so every quarter you spend with the schema living inside a vendor's compute platform is a quarter the vendor accumulates pricing power, while every quarter you spend with the schema living in version-controlled SQL against open table formats on your own object storage is a quarter you accumulate flexibility instead. That's an architectural claim rather than a technical one, and it's the claim I think most often gets missed in the ETL-versus-ELT framing.