Technology deep-dive

Materialized views for security data.

Materialized views are the difference between “the lakehouse is fast enough” and “the lakehouse needs a cache layer.” The decision sounds binary; in practice it’s a per-workload choice across half a dozen engines whose implementations have almost nothing in common. ClickHouse’s AggregatingMergeTree, RisingWave’s incremental view maintenance, Dremio Reflections, Snowflake Dynamic Tables, Materialize, Trino, and ksqlDB each make different trade-offs, and the right answer depends on whether the view is feeding a detection rule, a behavioral baseline, or an analyst hunt.

Reading time: about 18 minutes. Evidence tier: A–B (vendor documentation, Netflix and Airbnb production case studies, peer-reviewed computational complexity research). One Tier A primary source on the q-hierarchical dichotomy that bounds what incremental maintenance can do.

Engine pattern · materialized views

A materialized view precomputes the expensive part of a detection once and refreshes it in the background, so the hot dashboard or recurring rule reads a small summary instead of re-scanning petabytes — paying in storage and a little lag for a large drop in query cost.

The promise

Compute once, query many times.

Vendor documentation and production case studies report impressive speedups, though most of these are vendor-published numbers rather than figures I’ve reproduced, so treat them as Tier C until someone measures them independently. A single-developer benchmark (Sid Ngeth, 2025) on a synthetic Rails dataset reports PostgreSQL materialized views running 350× to 9,000× faster than equivalent base-table queries, which is a wide spread that depends heavily on the query and the base-table size. Snowflake’s published figures (vendor docs; Tier C) report roughly 78% query improvement with 21% cost reduction on single-table aggregations. A practitioner write-up (unsafehex.com, “tstats: afterburners for your Splunk threat hunting”) documented a 270× improvement on Splunk Data Model Acceleration: 10.2 seconds dropping to 0.038 seconds for an accelerated tstats query. Airbnb’s Riverbed platform processes 2.4 billion events per day using around 50 materialized views in a lambda architecture.

My own lab number sits inside that range and sharpens which part of it is the durable win. Serving a materialized view of twelve to two thousand rows against a twenty-million-row base, the read speedup was 45× to 77×, comfortably above the 5–30× a lot of the vendor framing claims for the small-aggregate case. The honest counterweight is maintenance, because refreshing the view incrementally was about 2.9× cheaper than recomputing it for a low-cardinality aggregate but only broke even for a high-cardinality one at this base size, and the incremental win grows as the base-to-batch ratio does, so the read acceleration is the robust result and the maintenance economics are what actually decide whether a given view earns its place. Storage overhead was negligible, well under a hundredth of a percent. Tier B, single machine.

The theoretical value for security operations is obvious: dashboards refresh in milliseconds rather than minutes, compliance reports finish in seconds, and behavioral baselines that previously scanned terabytes complete sub-second. Three questions determine whether the promise holds: do refresh costs exceed query savings, does schema volatility invalidate views faster than they provide value, and do complex correlation rules force full refresh regardless of incremental capabilities?

The answer to all three is “it depends on the engine, the workload, and the architectural choices around the view.” Security data’s specific characteristics (high schema volatility, complex multi-table correlation, continuous append) create failure modes that general analytics workloads rarely hit. The rest of this essay walks through the per-engine differences and the four-tier strategy I recommend for getting it right.

Per-engine landscape

Seven engines, seven implementations.

The word “materialized view” suggests a uniform concept. In practice, each engine ships a materialization mechanism with different incremental maintenance capabilities, different refresh semantics, and different failure modes under schema change. The choice of engine constrains the choice of workload more than vendors typically advertise.

ClickHouse AggregatingMergeTree

ClickHouse implements materialized views as INSERT triggers on top of the MergeTree family of table engines. The standard pattern is to define an AggregatingMergeTree (or SummingMergeTree for simpler cases) as the target table, then attach a materialized view that pre-aggregates incoming rows on write. The aggregation state is stored alongside the data and merged in the background as parts compact.

The mechanics are different from the SQL standard idea of a materialized view. ClickHouse’s view is not refreshed; it is updated continuously on every INSERT against the source table. Aggregation functions (sum, count, uniqExact, quantileTDigest) maintain partial state in the target table’s AggregatingMergeTree rows, and the merge logic at read time combines those partial states into a final result. For append-only security telemetry, this is close to ideal: refresh latency tracks ingestion latency, and there is no separate refresh job to schedule.

The constraint is the workload shape. ClickHouse materialized views are most natural for single-table aggregations on append-only logs. Multi-table joins inside a materialized view are possible but operationally awkward: the view only sees the table it was attached to, so a JOIN against a slowly-changing dimension table requires either a dictionary lookup or careful ordering of writes. For the “authentication × endpoint inventory” correlation pattern, ClickHouse dictionaries may be the better tool than a materialized view.

A plausible read of Netflix’s reported ClickHouse architecture is that they run 5 PB/day through the ClickHouse layer with no materialized views on it — an inference from how they describe the deployment rather than a figure Netflix states as such, so I’d hold it as a working read, not a quoted claim. At sufficient scale, the operational complexity of managing view definitions and merge state can exceed the query performance benefit, and the public account has Netflix optimizing base-table partitioning and pushing materialization into downstream serving layers where staleness is tolerable. See ClickHouse at petabyte scale for the longer treatment.

RisingWave: streaming-native incremental view maintenance

RisingWave is a streaming database that treats materialized views as the primary abstraction. Every materialized view is incrementally maintained using the streaming engine’s dataflow graph: no separate refresh job, no batch recomputation. Incremental view maintenance (IVM) is the technical term: when a row arrives on a source stream, the engine computes the delta to every downstream materialized view and updates state in place.

The trade-off RisingWave makes is the trade-off Materialize and Feldera also make: state has to live somewhere. A streaming join between two tables requires keeping enough state to compute the join result as either side changes, which can mean materializing significant intermediate state on disk. For security workloads with append-heavy sources and bounded join keys (user IDs, IPs, hostnames), this is usually tractable. For unbounded joins (every event joined against every other event in a time window), state can grow faster than the value.

RisingWave’s sweet spot is the streaming detection tier: correlation rules that need sub-second to low-single-digit-second latency from event arrival to alert. The PostgreSQL-compatible SQL surface lowers the learning curve relative to Flink, and the IVM guarantees mean the latency budget is predictable. The directional claim of “1000× lower latency than traditional OLTP replicas” that Materialize publishes applies to RisingWave by similar mechanism, with the same caveat: that’s an in-vendor benchmark and security-specific numbers are not yet published.

Dremio Reflections

Dremio Reflections are a different shape entirely. A Reflection is a pre-computed result set (aggregation Reflection) or alternate physical layout (raw Reflection) stored as Parquet alongside the source data, transparent to the query. Analysts write SQL against the logical view; the query planner decides whether to use the base table or a Reflection. The user does not reference the Reflection by name.

Reflections support incremental refresh on Iceberg sources when the underlying table change pattern is compatible. The documented limitation is the multi-table case: “if two or more tables refreshed since the last refresh, full refresh is used.” For security correlation rules joining three to five tables that all change continuously, full refresh is the normal path, which means the Reflection’s cost-benefit is determined by query frequency rather than incremental efficiency.

Reflections are a strong fit for compliance dashboards and scheduled analytical reports against Iceberg-resident security data: queries that run dozens to hundreds of times per refresh cycle, against tables with stable schemas. They are a poor fit for volatile detection rules where the query shape changes weekly. The Dremio semantic layer makes the abstraction work in practice; see Dremio semantic layer for the architectural picture.

Snowflake materialized views and Dynamic Tables

Snowflake ships two different things under adjacent names. Snowflake Materialized Views are the older feature, with significant restrictions: single-table only, no joins, no window functions, no non-distributive aggregates (no MEDIAN, no PERCENTILE_CONT). Snowflake Dynamic Tables are the newer approach: they support joins, window functions, and target-lag freshness control, with automatic refresh strategy selection between incremental and full recompute.

Schema change behavior is where Snowflake’s aggressiveness shows. Any column modification or drop on a source table suspends all materialized views dependent on that table, even if the modified column is not referenced in the view definition. For security logs where OCSF schema evolution and vendor-driven field additions are routine, this is the operational tax. Dynamic Tables handle additions better when the view uses explicit column lists, but renames and type changes still require recreation.

Snowflake’s own deployment guidance is explicit: “start slowly with this feature, create only a few materialized views on selected tables, and monitor the costs over time.” The vendor-recommended caution is itself useful evidence; if the platform owner says selective deployment, that is a statement worth weighting, not marketing hedging.

Materialize and Feldera: differential dataflow

Materialize is built on differential dataflow, a computational model that maintains query results incrementally by tracking deltas through a dataflow graph rather than storing result sets and recomputing. The practical consequence is that Materialize supports incremental maintenance for multi-table joins with concurrent changes on all sides: the case that forces full refresh on most other engines.

For window functions, Materialize’s documentation states: “when an input record in a partition is added, removed, or changed, Materialize recomputes results for the entire window partition.” The performance constraint is that touched partitions stay under roughly one million rows per second of change. Security incident bursts (alert storms, widespread malware outbreaks) can exceed that, so the engine is well-matched to steady-state correlation but may degrade during incidents, which is exactly when latency matters most.

Feldera shares the differential-dataflow lineage with a different deployment model. Both are strong candidates for detection-tier correlation in teams with the engineering depth to operate a streaming database alongside the primary lakehouse.

Trino materialized view support

Trino supports materialized views through the Iceberg connector (and a few others), with refresh triggered explicitly via REFRESH MATERIALIZED VIEW. The refresh is full recompute by default; incremental refresh against Iceberg sources is possible when row lineage is available, which is a 2026 capability rolling out through Trino releases. The view definition is stored in the metadata catalog, and queries against the logical view name transparently route to the materialized data when it’s fresh.

Trino’s sweet spot for security workloads is the analyst-tier query layer over Iceberg: materialized views for known dashboards and compliance reports, base-table scans for ad-hoc hunting. The federation story matters here: Trino can materialize a view that joins Iceberg security data against a PostgreSQL asset inventory, which is a pattern several other engines on this list cannot do without separate ingestion.

ksqlDB

ksqlDB sits in a different category: it’s a Kafka-native streaming SQL engine that produces continuous queries materialized as Kafka topics or local state stores. For simple aggregations on a single stream (count failed authentications per user per five-minute window), ksqlDB is the lowest-friction option in the Kafka ecosystem.

The limitations are severe for complex correlation. ksqlDB supports only single-column join keys, has no SQL-style window functions (only Kafka Streams windowing primitives), implements COUNT_DISTINCT as an approximate HyperLogLog rather than exact, and has no native MEDIAN or PERCENTILE support. For detection-engineering teams whose rules already live in Kafka Streams, ksqlDB is a natural fit. For teams trying to express sophisticated multi-table correlation, it is not the right tool; Materialize, RisingWave, or Flink SQL are better matches.

Refresh strategy

Lazy, eager, incremental, full: what these words actually mean.

The vocabulary around refresh strategy is loaded with vendor-specific meanings. The honest taxonomy:

Eager refresh updates the view as part of the same transaction that writes to the source, so it stays consistent on read but it is expensive on write, which is why most analytical engines do not offer true eager refresh: the latency tax is too high for OLAP workloads.

Lazy refresh updates on a schedule (every five minutes, hourly, nightly) or when explicitly triggered. How stale the view can get is the freshness budget. Dremio Reflections, Snowflake Materialized Views, and Trino materialized views are lazy by default.

Continuous incremental maintenance (the IVM model used by RisingWave, Materialize, and ClickHouse’s INSERT-trigger views) updates the view as each source row arrives. There is no batch boundary; freshness budget equals streaming latency.

Incremental refresh (in the lazy sense) processes only rows that arrived since the last refresh rather than recomputing from scratch. Simple aggregations on append-only sources can almost always be incremental. Multi-table joins, window functions, and non-distributive aggregates frequently cannot.

Full refresh recomputes the entire view from the source: the universal fallback when the query pattern exceeds the engine’s IVM capabilities. Cost equals the cost of running the underlying query, so a view with $1 query cost and daily full refresh consumes $30 per month in refresh compute regardless of how often it’s read.

The hard limit

Some query patterns are provably unmaintainable incrementally.

This is not an engineering limitation. It’s a computational complexity result. Berkholz, Keppeler, and Schweikardt proved the q-hierarchical dichotomy for incremental view maintenance: a restricted class of join queries (the q-hierarchical class) can be maintained in O(1) time per update, and every other query provably requires at least O(N^1/2) update time under the standard Online Matrix-Vector Multiplication conjecture. That’s Tier A peer-reviewed evidence, not a vendor benchmark.

The practical translation is that some query patterns are fundamentally impossible to maintain incrementally in sub-linear time, regardless of how clever the implementation is, and the most common security-relevant instances are these:

Non-distributive aggregates (MEDIAN, PERCENTILE_CONT, MODE) cannot be computed from sub-aggregates. To calculate the 95th percentile of authentication latency, the engine needs access to all underlying values, sorted. When any row changes, the entire aggregate must be recomputed. This rules out incremental maintenance for the common “alert when this user’s authentication time is outside their 95th-percentile baseline” pattern, on almost every engine.

Window functions (ROW_NUMBER, RANK, LAG, LEAD) establish an ordering across an entire partition. Any insertion or deletion within a partition requires recomputing the window values for every row in that partition. For a process-ancestry view that uses LAG(process_name) OVER (PARTITION BY endpoint_id ORDER BY timestamp), every new process event on a high-activity endpoint triggers a partition-wide recomputation. Build servers and developer machines that generate 1,000+ process events per minute can keep the IVM engine continuously busy with no net query benefit.

Multi-table joins with concurrent changes are where most engines diverge from Materialize. Dremio Reflections explicitly document that “if two or more tables refreshed since the last refresh, full refresh is used.” Most lateral-movement detection rules join authentication logs, IP reputation, and malware detection tables (all of which change continuously). The result on most platforms is full refresh on every cycle, which means the materialized view provides zero incremental benefit.

The trap to avoid is contorting a sophisticated behavioral detection into an IVM-compatible shape and losing detection efficacy in the process, because if the rule needs PERCENTILE_CONT to express the threat model correctly, then accepting full refresh keeps the detection intact, whereas rewriting the rule in a weaker form just to get a materialized view to update incrementally trades away the thing you were trying to detect.

Schema volatility

Security logs change shape constantly. Most engines treat that as a refresh trigger.

Security log schemas evolve for reasons outside the data team’s control. New log sources arrive when vendors are acquired. EDR agents add fields on upgrade. OCSF v1.3 added 40+ optional fields over v1.2. The base rate of schema change on a security data lake is closer to weekly than yearly.

Engine responses vary widely. Snowflake suspends all materialized views dependent on a table when any column is dropped or modified, even when the modified column is not referenced by the view. Snowflake Dynamic Tables behave better with explicit column lists but still require recreation on rename and type change. Dremio Reflections mark themselves INVALID on schema change, requiring manual refresh that can take hours at petabyte scale. Databricks materialized views trigger full recompute on any schema change. ClickHouse’s INSERT-trigger model is more forgiving in the additive case (new source columns can be ignored without breaking anything) but renames and type changes still require careful migration.

The production-validated mitigation is the staging-view pattern: define a logical (non-materialized) staging view that absorbs vendor field-name variations and schema evolution, then materialize from the stable staging view. Schema changes in the raw table only require updating the staging view’s coalesce logic, not rebuilding the materialized view.

Microsoft Sentinel handles schema volatility at the ingestion layer using Data Collection Rules and ASIM. Google Chronicle uses the Universal Data Model for schema-on-write normalization. Normalizing at ingestion to absorb upstream churn before any materialization may be the right architectural choice when the materialized view layer is fragile and the ingestion layer is robust.

The economics

When refresh costs exceed query savings.

The cost equation for a materialized view is roughly: query cost times query frequency must exceed refresh cost times data change rate plus storage cost. That phrasing is simple to write down and hard to evaluate in practice, because the four terms vary independently across workloads.

Consider an authentication-failures dashboard. 50 million events per day at 1 KB each is 50 GB/day. Without materialization, the dashboard query scans 50 GB, runs the aggregation, and returns in 30–90 seconds at roughly $0.50–1.50 of Snowflake compute. With a materialized view, the dashboard returns a pre-computed aggregation in 1–3 seconds at $0.01–0.05 per query.

The apparent saving is 95%+ on query cost. But refresh is the offset. Full refresh re-aggregates all 50 GB daily at the same $0.50–1.50 cost. Incremental refresh processes only the new 50 GB partition, same dollar figure. If the dashboard refreshes every 15 minutes (96 times per day), the view delivers a roughly 95× cost saving over the unmaterialized path. If the dashboard refreshes once per week (~0.14 times per day), refresh cost exceeds query savings and the view is a net loss.

Multi-table correlation rules where all source tables change continuously land in the worst case. Dremio’s “two or more tables refreshed since last refresh” rule means full refresh every cycle on a lateral-movement detection joining auth logs (50 GB/day appends), IP reputation (2 GB/day updates), and malware detection (5 GB/day appends). The refresh scans 57 GB daily (the same scan the unmaterialized query would do) with no incremental benefit.

AWS Redshift’s AutoMV feature acknowledges this directly: it creates materialized views with the highest cost-benefit ratio and automatically drops views when net benefit falls below threshold. The automation is itself an admission that views don’t stay economically viable indefinitely. Databricks Predictive Optimization automates OPTIMIZE, VACUUM, and ANALYZE but notably does not automatically create materialized views; the platform optimizes table layout but leaves materialization decisions to engineers, an implicit acknowledgment that automated MV creation is fraught with risk when the workload mix is unpredictable.

The framework

Four tiers, four workload classes.

The framework I recommend is a four-tier materialization strategy where each tier maps to a workload class rather than to a comfort level, so the tier a workload belongs in is determined by whether it serves the detection budget (machine-speed, seconds to sub-second) or one of the analyst budgets (hunting, analysis, baseline computation, which tolerate minutes to hours).

One calibration note before the tiers. After CrowdStrike’s 2026 Global Threat Report reported a 27-second fastest recorded adversary breakout time, any correlation rule used for detection has to live in the streaming tier, not the micro-batch tier. A 5–15 minute refresh cadence on a correlation rule is structurally a 5–15 minute floor on time-to-alert, incompatible with the modern threat model.

Tier 1: streaming detection (sub-second to low-single-digit-second)

Platforms: RisingWave, Materialize, Feldera, Flink SQL, or pipeline-detection engines like Tenzir or Cribl. Workload: every detection rule that fires alerts. Example: count failed authentications per user per five-minute window, alert when greater than 10. Refresh model: continuous incremental maintenance. There is no refresh, only stream processing latency.

The property that makes this tier non-negotiable is that at a 27-second adversary breakout time a detection that takes five-plus minutes to fire has already missed the response window by more than 10×, so the tier exists precisely because the detection budget cannot tolerate batch boundaries.

Tier 2: micro-batch baselines (5–15 minute refresh)

Platforms: Dremio Reflections, Databricks materialized views with incremental refresh, ClickHouse AggregatingMergeTree. Workload: behavioral baselines, peer-group statistics, rolling-window aggregates, near-real-time dashboards. The output of this tier feeds the detection tier (for example, as a baseline the streaming layer compares against) but does not itself fire alerts on the detection budget.

Example: median processes created per user per day. Simple aggregation, append-only source, incremental refresh is straightforward.

Iceberg V3 row lineage is worth checking here: rather than relying on a watermark column that silently drops late arrivals, an engine that surfaces row-level timestamps can query by last-updated directly, which makes incremental refresh more correct, and the mechanics and per-engine support timeline are in the Iceberg V3 thesis-shift piece.

Tier 3: batch analysis (daily or weekly full refresh)

Platforms: scheduled Spark or Trino jobs, dbt models with full refresh, ClickHouse with scheduled materialization. Workload: complex analytics, historical baselines, compliance reporting, 95th percentile authentication time per user over a 30-day rolling window. This is the work that exceeds IVM capabilities; the refresh cadence acknowledges the complexity rather than fighting it.

Tier 4: data lake hunting (no materialization)

Platforms: Iceberg tables queried directly via ClickHouse, Trino, or DuckDB. Workload: ad-hoc threat hunting where queries are written during investigations and the analyst is in the loop. Example: “show me all DNS queries to domains registered in the last 48 hours.” No materialization. The query pattern is unpredictable, schema-on-read flexibility is essential, and the analyst can tolerate the scan latency because they’re actively waiting for the result, and a materialized view has little to offer here because the next query is rarely the previous query.

Engine selection

Matching engines to tiers.

The per-engine analysis and the four-tier framework collapse into a small set of practical recommendations.

Tier 1 (detection): RisingWave (PostgreSQL syntax, accessible), Materialize (most expressive multi-table correlation), or Flink SQL (deepest ecosystem, steepest operational curve). ksqlDB sits below in capability: usable for simple aggregations, insufficient for complex correlation.
Tier 2 (baselines): ClickHouse with AggregatingMergeTree for append-heavy single-table aggregation, Dremio Reflections for dashboard-style acceleration over Iceberg, or Snowflake Dynamic Tables if the organization is Snowflake-committed and schemas are stable.
Tier 3 (batch analysis): prefer scheduled jobs over materialized views. The work is complex enough that incremental maintenance doesn’t apply, and explicit scheduling makes refresh cost visible rather than hidden in a view definition.
Tier 4 (hunting): do not materialize. Query Iceberg directly via Trino or ClickHouse, driven by latency tolerance and SQL-dialect familiarity. See Iceberg vs Delta Lake for the underlying table-format choice.

Deployment heuristics

When to deploy materialized views, and when to walk away.

Deploy materialized views when all of the following hold:

Query frequency dominates data change rate: a dashboard refreshing every five minutes against data updating hourly is a 12:1 ratio.
Refresh cost is materially lower than query cost savings: $50/day refresh against $500/day in query savings is a 10× benefit.
Schema is stable: log source schema changes less than once per month.
Query patterns are IVM-compatible: simple aggregations, single-table or append-only multi-table.
The deployment is piloted: 3–5 views measured for actual performance gain and refresh cost before expanding.

Avoid materialized views when any of the following hold:

Schema volatility is high: new log sources weekly, vendor format changes monthly.
Complex correlation requirements dominate: five-table joins with window functions and non-distributive aggregates.
Query patterns are unpredictable: threat-hunting workloads where queries are written during investigations.
Data change rates are high across all source tables: multi-table concurrent changes force full refresh on most engines.
The team lacks the operational expertise to debug materialized view refresh failures at the rate they’ll occur.

The read that Netflix runs 5 PB/day with no materialized views at the ClickHouse layer (again, an inference from their reported architecture rather than a stated Netflix claim) points to a deliberate trade-off worth taking seriously, where operational simplicity and schema flexibility outweigh query performance optimization once base-table performance is already acceptable.

Conclusion

Materialized views are tools, not solutions.

The architectural constraints are real. Refresh costs can exceed query benefits for high-change-rate workloads. Schema volatility does trigger aggressive invalidation on most engines. Complex queries do force full refresh regardless of data delta size. Selective deployment is best practice even when automation is available.

The right MV strategy depends on the workload class, so detection-rule MVs belong in streaming engines (RisingWave, Materialize, Flink), while analyst-hunt work has no materialized view at all because it lives in the data-lake layer. Compliance dashboards and behavioral baselines are where materialized views earn their keep, and even there the engine choice (ClickHouse AggregatingMergeTree, Dremio Reflections, Snowflake Dynamic Tables) constrains the workload more than vendor marketing suggests, because the q-hierarchical dichotomy is a complexity result rather than an engineering limitation, and it will not be engineered away.

The discipline is to measure rather than assume, so deploy 3–5 materialized views for high-value use cases, instrument refresh costs and query savings, and expand only where the economics and operational stability justify it. For security data infrastructure that prioritizes manageability over extreme performance, this selective approach aligned with the four-tier framework provides the best balance of performance, cost, and operational sustainability.