Security Data Works

Detection engineering deep-dive

Row lineage as the missing CDC primitive for detection engineering.

Iceberg V3's row-level lineage (row IDs plus last-updated tracking carried in metadata) is being marketed as a Change Data Capture enabler, which undersells it, because for detection engineering specifically row lineage is the piece that turns four otherwise painful workflows into ordinary metadata queries. It is also the V3 feature with the least mature engine support, which is why I'm careful to frame this as a roadmap-shaping argument rather than a production guide.

Reading time: about 16 minutes. Evidence tier: B for the V3 spec mechanics (Apache Iceberg specification, vendor documentation), C to D for the security-engineering use cases. Those are my synthesis of where row lineage may matter, not validated against production deployments. As of mid-2026, row lineage has experimental Spark support and roadmap status in Trino and DuckDB; Snowflake added catalog-side support in May 2026, but query-engine surfacing of the row-level columns lags the catalog work. Production-ready use cases are forecast, not observed.

The short version

What row lineage actually changes.

Iceberg V3 introduces two related pieces of metadata that the table format now tracks for every row. Row IDs are stable, table-scoped identifiers assigned at insert time. They survive compaction, file splits, and merges, because they live in the metadata layer rather than in the data file. Last-updated timestamps record the snapshot timestamp at which each row was last modified (insert, update, or mark-deleted), again carried in metadata rather than in the row payload itself.

The phrase "carried in metadata" is doing real work in that paragraph. It means a query like SELECT * FROM cloudtrail WHERE _row_last_updated > '2026-04-01' can be answered by reading manifest files (the small index files that describe what the table contains) and only touching the data files for rows that actually qualify, so there is no full table scan and no separate audit table, because the metadata layer answers the "what changed since" question and the data files are read only for the rows that match.

For detection engineering, that one capability ("tell me which rows have changed since timestamp T, cheaply, without external infrastructure") turns out to be the missing piece behind four otherwise painful workflows:

  • Re-firing a changed detection rule against historical data without re-evaluating already-processed rows.
  • Handling late-arriving events in dbt incremental models without silently dropping the late ones.
  • Answering compliance audit questions ("which rows are governed by which policy version") without an external audit log.
  • Feeding a CDC (Change Data Capture) stream from an Iceberg table to a downstream consumer without Debezium or snapshot diffing in the middle.

None of these are new problems, and all of them have V2-era workarounds that involve external metadata, watermark columns, or full-table re-evaluation, so what's new is that row lineage makes them solvable inside the same Iceberg tables the detection queries already run against. The catch (and it is a catch worth flagging) is that the query engine has to surface the row-level columns, which makes engine support the gating factor, and as of mid-2026 only a few engines have shipped it.

Use case 1

Re-firing changed detection rules.

This is the use case I think will matter most for detection-engineering teams operating at higher maturity levels: the teams that iterate on detections weekly rather than annually.

Picture a detection rule running against a 90-day rolling window of CloudTrail data. The rule has been live for six months. Last week, an analyst updated the rule to add a new indicator of compromise: a recently discovered command-and-control domain. They want to re-evaluate the new version of the rule against the existing 90 days of CloudTrail data, but only against events that haven't already been evaluated by this version. They don't want to re-fire alerts that already fired under the previous version, and they don't want to spend compute re-processing events that haven't changed.

The V2-era answer is brute force: run the new rule against the full 90 days, deduplicate alerts against the existing alert store, and accept the cost of re-evaluating every event. For high-volume tables (CloudTrail at enterprise scale routinely produces 100+ GB per day), this is hours of compute and tens of dollars per re-evaluation. Detection engineers do it grudgingly, and only when the rule change is significant enough to justify the cost. The friction is real enough that I've watched teams batch rule updates into weekly windows specifically to amortize the re-fire cost, which is exactly the kind of process artifact that slows down the rule-iteration loop.

With V3 row lineage, the new rule can be run only against rows whose _row_last_updated is newer than the deployment timestamp of the previous version of the rule:

-- Pseudo-SQL: exact syntax depends on engine V3 support
WITH rule_versions AS (
    SELECT 'v1.7' AS version, TIMESTAMP '2026-04-10 09:00:00' AS deployed_at
    UNION ALL
    SELECT 'v1.6', TIMESTAMP '2026-03-22 15:00:00'
)
SELECT
    eventTime,
    userIdentity_userName,
    eventName,
    'v1.7' AS evaluated_by_version
FROM cloudtrail
WHERE _row_last_updated > (SELECT deployed_at FROM rule_versions WHERE version = 'v1.6')
  AND sourceIPAddress IN (SELECT ip FROM new_iocs);

For an append-only table like CloudTrail, the rows whose _row_last_updated exceeds the previous deployment timestamp are exactly the new rows ingested since then. The compute cost drops from "re-evaluate the entire 90-day window" to "evaluate the delta since the previous deployment." For a rule that's been live for six months and changes weekly, that's typically a 50 to 100 times reduction.

For detection-engineering organizations operating at higher maturity, rule updates happen multiple times per week, and the compute cost of re-evaluation is currently a friction that slows down the rule-iteration loop, so row lineage removes that friction without changing the underlying table layout or requiring an external metadata service, because the same Iceberg table that the production rule queries against already carries the information needed to do the cheap incremental re-fire.

Use case 2

Late-arriving events without silent drops.

The standard dbt incremental pattern uses a watermark column. A dbt model is told to load only rows newer than the highest event time it has already processed:

SELECT * FROM source
WHERE event_time > (SELECT MAX(event_time) FROM {{ this }})

This pattern silently drops late-arriving events. If a CloudTrail event has an event time of 2026-04-12 10:00:00 but does not arrive in your S3 bucket until 2026-04-13 14:00:00 (a 28-hour lag, which is not unusual for CloudTrail's eventually-consistent delivery model), and the next dbt run after the late arrival is at 2026-04-13 16:00:00, then the watermark MAX(event_time) is already past the late event's event_time and the late event is permanently excluded from the incremental table.

For most business intelligence workloads this is a tolerable bug, because late events are rare and the impact on a daily dashboard is small, but for security workloads it is a correctness failure, since a late-arriving CloudTrail event may be the one that proves a privilege escalation actually happened, and dropping it silently is the kind of failure that makes auditors unhappy and incident responders unhappier.

With row lineage, the incremental query keys off the row's last-updated timestamp (when the row was written into the Iceberg table) rather than the event time embedded inside the row:

SELECT * FROM source
WHERE _row_last_updated > (SELECT MAX(_row_last_updated_at_last_run) FROM {{ this }}_state)

The row last-updated timestamp is processing time, not event time. A late-arriving event has a recent _row_last_updated (because it was just inserted), so the incremental query picks it up regardless of the embedded event time. The dbt model then processes it correctly, stores the event with its actual event time, and updates the state table with the new _row_last_updated watermark.

The detection-engineering version of this pattern: rules that look back over a window (for example "alert if 5 failed logins in 10 minutes") need to handle late-arriving events to avoid false negatives. The V2 workaround is to re-evaluate the window every time new data arrives, which is wasteful. Row lineage lets the rule evaluator know exactly which events have changed since the last evaluation, so it can re-evaluate only the windows that contain changed events. The correctness improves and the compute cost drops at the same time.

Use case 3

Compliance audit trails without external infrastructure.

PCI-DSS, HIPAA, and GDPR each have requirements that read roughly as: demonstrate that all data covered by policy version X has been processed under policy version X. When the policy changes (a new retention rule, a new masking requirement, a new export restriction), you need to point at every row in the table and say which policy version applied at the time of last modification.

This is the audit-trail problem that security teams have struggled with for as long as security data has lived in lakehouses. In the V2 world, answering it requires either an external audit log (every write to the Iceberg table mirrored to a separate audit table that records the timestamp and the policy version active at that time) or full table snapshots, one per day, retained for the audit period and queried via Iceberg time travel. Both are operationally expensive, both create their own consistency problems, and both ship a lot of bytes around to answer what is at heart a metadata-shaped question.

There is a heavier version of the same instinct worth naming, because it is the one the current "agentic analysis needs governed history" conversation keeps reaching for: adopt Data Vault for the integration layer, model each entity as a hub with its descriptive attributes in dated satellites, and let the satellite history stand in as the audit trail. Data Vault is a defensible modeling choice for estates that already run it, but adopting it specifically to obtain chain-of-custody trades a lot of extra JOIN work (every reconstruction now stitches hubs to links to satellites) for a property the table format hands you directly. The evidence requirement is append-only with "what you knew and when you knew it," and append-only snapshots plus row lineage already satisfy it, so the modeling tax is paying a second time for history the metadata already carries.

V3 row lineage makes it a metadata query:

-- Rows last touched after the new GDPR policy version went live
SELECT COUNT(*) AS rows_under_new_policy
FROM customer_events
WHERE _row_last_updated >= '2026-05-01';

-- Rows still under the old policy
SELECT COUNT(*) AS rows_under_old_policy
FROM customer_events
WHERE _row_last_updated < '2026-05-01';

The auditor's question (how many rows in this table are governed by which version of the policy) is answered by metadata, in milliseconds, without any external infrastructure. The same answer that used to require running an audit pipeline becomes a SELECT against the table the auditor already cares about.

One caveat worth naming: this assumes that touching a row (insert, update, mark-deleted) is the right semantic for "policy applies to this row." For some compliance regimes, the relevant timestamp is when the row first entered the system, not when it was last touched. V3 row lineage tracks the latter. Row IDs are stable across operations, though, so you can compose row IDs with the snapshot history to derive insert times for audit queries that need them. It is more work, but the infrastructure is still inside the Iceberg metadata layer rather than outside it.

Use case 4

CDC from Iceberg without external infrastructure.

The original V3 row-lineage marketing pitch is "CDC from Iceberg," and I want to be careful here, because this is the use case I am least excited about for detection engineering specifically. It overlaps with what streaming-first architectures already give you, and the production tooling won't catch up until 2027 at the earliest.

The use case: a downstream system (a data warehouse, a search index, a feature store) wants to stay incrementally synchronized with an Iceberg table without re-reading the entire table on every refresh cycle. CDC (Change Data Capture) is the umbrella term for the family of techniques that extract just the changes.

The V2 toolkit has three answers, each with downsides. Snapshot diffing reads the manifest list for snapshot N and snapshot N+1, identifies the added, removed, and modified files, and ships the difference downstream, but it is operationally heavy and only as granular as the file. External CDC tools like Debezium or Fivetran run against the source database that feeds the Iceberg table rather than against Iceberg itself, which adds an entire infrastructure layer. And watermark-based incremental (the dbt pattern from use case 2) drops late-arriving events.

With V3 row lineage, the downstream consumer can query the Iceberg table directly for rows changed since the last checkpoint, where each consumer maintains its own watermark on _row_last_updated, and late arrivals are picked up by construction, so there is no Debezium, no snapshot diffing, and no per-consumer audit table.

Why I am cautious about this specific use case: production CDC consumers (Debezium, Fivetran, Hightouch, RisingWave's Iceberg source) will need months to years to integrate row lineage into their existing CDC plumbing. In the meantime, teams that want CDC from Iceberg will keep using snapshot diffing or external metadata. The row-lineage CDC story is available in 2026 but probably not production-mature until 2027.

If you want CDC from a security data lake right now, the V2-era answer is still the right one: emit events to Kafka first, write Kafka into Iceberg via a streaming sink, and let downstream consumers subscribe to Kafka instead of trying to extract CDC from Iceberg directly, because Kafka is the unified log while Iceberg is the cold-tier archive, which keeps CDC a Kafka problem rather than an Iceberg problem. Row lineage may eventually displace this pattern for some use cases, but I would not bet on it before 2027.

Adoption reality

Engine support is the bottleneck.

V3 row lineage requires the query engine to surface the _row_last_updated column (or the equivalent identifier) in queries. The spec defines the metadata; the engine has to expose it. As of mid-2026, my read of the public roadmaps and release notes is:

  • Spark 4.1: experimental support for row lineage, working in development environments. This is the most advanced engine on the row-lineage curve, and the one I would point at for proof-of-concept work today.
  • Trino: on the V3 roadmap, no committed timeline as of this writing. Trino is the most-deployed query engine for security data lakes, so its timing matters disproportionately for when row lineage becomes practical in production.
  • DuckDB: on the roadmap, expected late 2026. DuckDB is the natural early-adopter environment for analyst-side row-lineage experimentation because it runs on a laptop and the iteration loop is fast.
  • Snowflake: added catalog-side V3 support in May 2026, meaning Polaris and the managed Iceberg surface can store and serve row-lineage metadata. Query-engine surfacing of the row-level columns lags the catalog work, so verify your Snowflake version's behavior before assuming parity.
  • ClickHouse: not on the V3 row-lineage roadmap. ClickHouse has its own MergeTree-native solutions for incremental and last-updated semantics, so the Iceberg V3 mechanism is less of a priority for the project.

For most production deployments, this means the use cases above are realistic only if you can run Spark 4.1 or newer for the queries that need row lineage. For multi-engine deployments (which the broader vendor-neutrality argument in Iceberg vs Delta Lake recommends), there will be a period where some engines support row lineage and others do not, and the architecture has to account for that asymmetry. I would not design a critical detection workflow around row lineage today if my production engine were Trino, ClickHouse, or DuckDB.

I am hedging deliberately throughout this section because engine support claims age badly, so verify against the release notes for the engine version you actually run before committing to a row-lineage design. The spec is real, but whether your engine writes and reads the row IDs is the question that decides whether row lineage is something you can use in production or only something you can talk about.

Honest caveats

Storage overhead and edge cases.

Storage overhead is small but not zero

Row lineage tracking is not free. Every row in a V3 table carries a row ID and a last-updated timestamp in metadata. For wide tables (security event tables typically have 50 to 200 columns) the metadata overhead is small, maybe 1 to 3 percent of total table size. For narrow tables (counter tables, lookup tables, threat-intel feeds), it can be 10 to 15 percent. That is a smaller overhead than running a separate audit log table, and it is much smaller than running Debezium. But it is not zero, and for very large narrow tables it is worth measuring before assuming row lineage is effectively free.

Compaction does not update _row_last_updated

A compaction operation (the maintenance task that rewrites small files into larger files for query performance) does not update _row_last_updated, because the row contents have not changed, so the timestamp is preserved, which is correct behavior. It also means you cannot use _row_last_updated as a proxy for "when was the underlying data file last written," which is sometimes what people want when they are diagnosing storage or write-path issues, and for those questions the snapshot timestamp and manifest history are still the right place to look.

Schema evolution does not update _row_last_updated for existing rows

If you add a column to the table, existing rows do not have their last-updated timestamp bumped. The new column is simply NULL (or the V3 default value, if you set one) for old rows. This is also correct behavior, because nothing about the existing row data changed. It does mean a query like "show me all rows that have the new column populated" cannot be answered by row lineage alone. For that, you need a predicate on the new column itself, which is a different question and a different shape of query.

Decision framework

When to reach for row lineage.

Use row lineage when

  • Your query needs to identify rows that have changed since a specific point in time.
  • You are trying to avoid re-processing rows that have not changed.
  • The cost of re-processing the entire table is non-trivial, typically true for security event tables at 100+ GB per day.
  • Your query engine supports V3 row lineage. As of mid-2026, that's effectively Spark 4.1.

Don't reach for row lineage when

  • You need detection-tier latency (sub-second to a few seconds). Row lineage is a lakehouse-tier piece. It lives in the analytical layer, not the streaming layer. For real-time detection, the streaming bus is still the right place to do the work.
  • You need CDC for an external system that can subscribe to Kafka. The Kafka-front pattern remains simpler and more production-mature than Iceberg-native CDC.
  • Your query engine does not support V3 row lineage yet. The spec without engine support is not a feature you can use.
  • The table is small enough that re-processing it on every run is cheap. Row lineage adds complexity; spend that complexity budget where it pays back.

The framing I keep coming back to is that row lineage is the right piece for hunting-tier and analysis-tier workloads on the lakehouse, while it is not a substitute for streaming infrastructure and not a way to speed up tables that are already small. For the detection-engineering teams whose iteration loop is slowed by re-evaluation cost or by late-event correctness, it may close real gaps, and for the teams without those constraints it is worth tracking before it is worth adopting.

Leading indicators

What I'm watching through 2026.

Three signals will tell us whether row lineage delivers on its promise or stays a spec-level curiosity. I treat these as the gates that decide whether the optimistic framing in this essay holds up by year-end.

The first is a production reference deployment that uses row lineage to feed a downstream system without snapshot diffing or external metadata, and as of mid-2026 I have not seen one in public, so if this has not happened by Q4 2026, the CDC story will have shifted from "promising" to "spec-only." Conference talks and engineering blog posts are the place to watch for this, because the security teams most likely to adopt early tend to write about it after the fact.

The second is dbt-iceberg adapter support for row-lineage-based incremental models, because the dbt project is the most likely vector by which row-lineage-based incremental will reach detection-engineering teams. If dbt-iceberg ships row-lineage incremental in 2026, I would expect rapid adoption, since the dbt user base will absorb it the way it absorbed incremental models in the first place, and if it does not, the user base will keep using watermark-based incremental and the row-lineage advantage will not reach the people who would benefit most.

The third is Trino V3 row-lineage support, because Trino is the most-deployed query engine for security data lakes, and its V3 roadmap is the leading indicator for whether row lineage becomes a production-grade capability in 2026 or stays experimental until 2027, so if Trino ships, the rest of the ecosystem follows quickly, and if Trino does not, expect another 12 months of "available in Spark, experimental everywhere else."

If all three land by Q4 2026, the optimistic framing in this essay holds up. If none of them do, treat row lineage as interesting in principle and not yet usable in production through 2026, and revisit the question in 2027. I am open to either outcome; the V3 spec is real either way, and the underlying argument (that row lineage closes a real gap in detection engineering) does not depend on a specific calendar quarter for engine support.

The longer arc

Why this is the V3 feature that changes the most.

Iceberg V3 ships two other features that get more airtime than row lineage, the Puffin-based deletion vectors and the Variant type, and both of those matter. I cover the three-feature recalibration in Iceberg V3 changes the thesis, and I cover how those features change the table-format decision in Iceberg vs Delta Lake.

Row lineage is the V3 feature I think changes the most for detection engineering specifically, because it turns a class of expensive workarounds into ordinary queries. Re-firing rules, handling late events, answering audit questions, feeding downstream systems: these were all solved problems in V2, but the solutions were external infrastructure layered on top of the lakehouse. Row lineage moves the answer into the metadata, which is where it should have been all along.

The reason it gets the least airtime is that the use cases are subtle. Deletion vectors and the Variant type have clean before-and-after pictures: "you couldn't do compliant erasure, now you can," or "you couldn't ingest nested JSON without a schema, now you can." Row lineage looks more like "you could already do these things, but now they're cheaper and more correct," and that framing is less viral, though I think it is the framing that matters most for the teams running production detection workflows.

If I am wrong about row lineage's importance, it will probably be because engine support takes longer than expected and the dbt-iceberg path stalls. If I am right, the next 18 months will see a slow shift in how detection-engineering teams structure their rule-iteration and incremental workflows: less external tooling, more metadata queries against the same Iceberg tables their detections already read from. That is the kind of shift that does not generate a launch event but does compound over the long arc of how security data lakes get built.