Economics deep-dive

The index pays twice.

A hot search index stores more bytes per event than a columnar lakehouse table, and it stores those bytes on more expensive storage, so it loses on both factors at once and they multiply. On the same ten million Zeek connection events, an OpenSearch index on block storage costs about 14.8× a warm Iceberg-on-S3 table, and the gap is mostly the storage class rather than the codec. At thirty days that is a rounding error on a SOC budget; at the seven-year retention horizon a regulated firm actually has to plan for, it is the difference between a line item and a project.

Reading time: about 9 minutes. Evidence tier: B. The footprints are first-party measurements on a single host over the same sha256-pinned 10M-row Zeek conn corpus as the flagship rerun; the prices are AWS us-east-1 list, read off the live rate card on 2026-06-10 and re-checked unchanged on 2026-06-14. This is a storage floor, not a total-cost-of-ownership model, and I keep those two layers separate throughout because conflating them is how cost arguments go wrong.

The decomposition

Four point two times the bytes, three point five times the price.

The headline multiplier factors cleanly, and the factoring is the whole argument. The first factor is how many bytes each realization writes for the same event. I loaded the identical ten-million-row synthetic Zeek conn corpus six ways and measured the footprint, and the spread is wide: an OpenSearch 2.18.0 index with best_compression and a force-merge lands at 186.8 bytes per event, while the same events written as Iceberg Parquet with pyiceberg's zstd defaults land at 44.0 bytes. That is a 4.2× difference in bytes before anyone has been charged a cent, because an inverted index carries the posting lists and the doc-values and the structures that make a term lookup instant, and those structures are the thing you are paying for when you ask for sub-millisecond search.

Measured footprint · same 10M Zeek conn events in every realization

Realization	Bytes/event	Reduction vs raw
Raw JSONL	374.3	1×
OpenSearch 2.18.0 index (best_compression, forcemerged)	186.8	2.0×
ClickHouse MergeTree, default LZ4	68.5	5.5×
Iceberg Parquet, pyiceberg zstd defaults (warm)	44.0	8.5×
ClickHouse MergeTree, blanket ZSTD(22) (tuned-hot)	41.5	9.0×
Single-file Parquet, zstd-19, 1M-row groups (cold)	38.6	9.7×

Footprints measured on a single host (Tier B) over the same sha256-pinned 10M-row Zeek conn corpus. The byte ratios are parameters of this corpus, a flat sixteen-column conn schema, so nested OCSF or long-message logs will compress differently and should be re-measured per workload.

The second factor is the price of the byte, and it is set by the storage class each realization has to run on rather than by anything you tune. A hot search index has to sit on block storage to serve interactive queries, so its bytes live on gp3 at $0.08 per GB-month. A lakehouse table is read by an engine that pulls from object storage, so its bytes live on S3 Standard at $0.023, and colder tiers drop to Glacier Instant Retrieval at $0.004. Block storage against S3 Standard is a 3.5× difference in dollars per byte, and that factor is independent of how many bytes you wrote.

So the index pays twice: 4.2× the bytes, at 3.5× the price per byte, which compounds to 14.8× the monthly storage bill for the same events held warm on Iceberg. The number is not a single dramatic measurement but the product of two ordinary ones, and writing it that way matters, because it tells you which factor to attack.

The counterintuitive part

The storage class dominates the codec.

The instinct, when the storage bill is too high, is to squeeze the bytes harder, and the measurement says that instinct is aiming at the smaller lever. Tuned ClickHouse with a blanket ZSTD(22) codec actually compresses better than Iceberg's zstd default, 9.0× against 8.5×, so on bytes alone it should win. It still costs 3.3× more per month, because those well-compressed bytes are sitting on block storage to keep the hot OLAP table fast, while Iceberg's slightly larger bytes are sitting on S3. The better codec lost to the storage class.

That is the finding I would carry out of this whole exercise: compression tuning moves the bill by tens of percent, and moving bytes from gp3 to S3 moves it by 3.5×, and moving them to Glacier Instant Retrieval moves it by twenty. The architecture decision, which is where old data is allowed to live, dominates the engineering decision, which is how hard you squeeze it. A team can spend a quarter tuning codecs and dictionary encodings and shave a fifth off the bill, or it can decide that data older than a few weeks belongs on object storage and shave most of it off, and the second decision is usually less work than the first.

It also reframes what an open table format buys you. The case for Iceberg or Delta is often pitched as portability, which is real, but the cost-to-serve angle is more immediate: an open table on object storage is the thing that lets old data fall to a cheaper storage class without losing queryability, and that fall is where the money is. A vendor-managed hot store that you can only read through the vendor's compute does not give you that lever, because the bytes never leave the expensive tier.

The retention lever

Annoying at thirty days, existential at seven years.

A multiplier on its own does not tell you whether to care. What makes the 14.8× matter is that both layers are linear in days, so the gap grows with the retention window, and the retention window for a regulated firm is not thirty days. Priced at one terabyte per day of raw ingest, held at steady state, storage only, the curves look like this:

Monthly storage cost · 1 TB/day raw, steady state, storage only · AWS us-east-1 list

Realization · class	30 d	90 d	365 d	7 y
OpenSearch index · gp3	$1,200	$3,600	$14,600	$102,200
ClickHouse LZ4 · gp3	$439	$1,316	$5,338	$37,367
ClickHouse ZSTD-22 · gp3	$266	$797	$3,234	$22,636
Iceberg zstd · S3	$81	$244	$988	$6,914
Cold Parquet · S3	$71	$214	$866	$6,064
Cold Parquet · Glacier IR	$12	$37	$151	$1,055

Each realization is priced on the storage class it actually has to run on, because that mapping is half the cost story. Storage only: no compute, licensing, ops labor, egress, or IOPS add-ons. The Glacier IR column is storage-only and excludes the $0.03/GB retrieval charge.

At thirty days the index-versus-lakehouse gap is about $1,100 a month, the difference between $1,200 on the index and $81 on warm Iceberg. That is the kind of number a SOC absorbs without a meeting. At the seven-year horizon that the SEC's 17a-4 and the EU's DORA push regulated firms toward, the same gap is about $95,000 a month per terabyte per day of ingest, $102,200 against $6,914, and that is the kind of number that starts a migration. Nothing about the architecture changed between those two columns; only the retention window did, and the window is set by a regulator rather than by an engineer.

The cold tier widens it further. Index-on-gp3 against cold Parquet on Glacier Instant Retrieval is 96.7× on storage alone, which is the largest number in the table and the most easily abused, so it comes with a condition: Glacier IR charges $0.03 per GB on retrieval, so that tier is for data you genuinely rarely read, and any claim that cites the 96.7× has to either stay storage-only or model how often the data actually gets pulled back. The honest version of the cold-tier number is that it is enormous for archival you touch a few times a year and shrinks fast if your analysts are querying it weekly.

What the number is not

A storage floor, not a TCO model.

I want to be precise about what this measures, because storage-cost arguments earn their bad reputation by sliding from one thing into everything. This is the storage floor and only the storage floor: no compute, no licensing, no ops labor, no egress, no IOPS or throughput add-ons beyond the gp3 baseline. A real platform spends on all of those, and on some workloads the compute and license lines dwarf the storage line, so the 14.8× is a floor under the storage share of the bill rather than a claim about the whole bill.

The byte ratios are also parameters of this particular corpus, a flat sixteen-column Zeek conn schema, and they will move on other data. Nested OCSF with deep object structure, or long free-text message fields, compress differently in an index than in columnar Parquet, sometimes narrowing the footprint gap and sometimes widening it, so the right move in an engagement is to re-measure on the actual workload rather than to carry these exact ratios across. The shape of the finding, that the storage class dominates the codec and that retention is the multiplier, is the portable part; the specific 4.2× is not.

And no real architecture runs everything on one tier. Nobody serves seven years of data from a hot index, and nobody should, so the all-hot column is a strawman if you read it as a proposal. The honest comparison the table enables is a tiered design, a hot tier for the recent seven-to-thirty days plus warm and cold for the long tail, against an all-hot design that keeps everything on the index because the platform makes tiering hard. That second design is more common than it should be, because a lot of SIEM pricing couples ingestion and retention so that old data keeps paying the hot rate, and the table is really a measurement of what that coupling costs.

Finally, this is not the old desk-derived 130–227× storage-cost ratio I have used before, and these numbers do not confirm or replace it. That number came from list-price arithmetic, Splunk's indexed-storage rate of roughly $3 to $10 per GB-month divided by S3 Standard's $0.023, so it was always storage list price over storage list price, with compute and license sitting on top of it rather than inside it. This benchmark measures bytes and storage prices directly, and what it does is put a measured storage floor of roughly 15× to 97×, depending on tier, under that same storage ratio, which is a smaller and more defensible thing to stand on than the desk arithmetic.

What to do with it

Design the tiers before you tune the codec.

The practical reading is an ordering of decisions. The first decision, the one that moves the bill most, is where data is allowed to live as it ages, which means choosing a storage layout that lets old data fall from block storage to object storage and then to a cold class without becoming unqueryable. An open table on S3 gives you that fall; a vendor-managed hot store generally does not, because the bytes can't leave the tier the vendor serves them from. The second decision, codec and encoding tuning, is real and worth doing, but it is a tens-of-percent lever and it should not be where the project starts.

For a regulated firm the math is sharper still, because the retention window is fixed by rule and the all-hot bill grows linearly inside it. If 17a-4 or DORA obliges seven years of retained telemetry, the question is not whether to tier but how aggressively, and the $95,000-a-month gap per terabyte per day is the budget that a tiered design hands back. That is the regulated-firm case writing itself: the firms with the longest mandatory retention are exactly the firms with the most to save by getting the storage classes right, which is why the lakehouse argument is strongest in financial services rather than where thirty days of retention is plenty.

None of this says the index is the wrong tool. The index earns its bytes and its block storage on the recent data an analyst hits constantly, where instant term lookups are worth paying for, and the latency side of that same trade, where the index wins the cheap lookups while the columnar engines win the heavy hunting aggregations, is measured in the lab. The argument here is narrower and only about cost: keep the index for the window where its speed is worth its price, and let everything older fall to a tier that charges what cold data should cost, because the storage class is the lever and retention is the multiplier, and the codec is a detail you can tune after both of those are decided.