State of the field, 2026

The catalog became the control plane. For security, who enforces is still an open question.

Three independent sources in the same week of June 2026 land on one claim from three directions: the table-format fight is over, Iceberg won, and the contested ground has moved up a layer to the catalog, which now resolves metadata, vends credentials, sequences commits, and increasingly decides what an AI agent is allowed to see. For analytics that reading is right. For security data it is the question I spend my time interrogating rather than an answer I'd repeat, because the part everyone is calling the new control plane is, on close reading of the people selling it, distributed, delegated, still in beta, and non-portable across vendors.

Reading time: about 14 minutes. Evidence tier: mixed, and I flag it inline. The vendor catalog surveys I lean on are tier C and one of them is Dremio's, which sells a catalog, so I read its framing as direction and its dated facts as checkable. The measurements I cite from my own lab are tier B, first-party, single-host, and synthetic, and I say so on every number.

The convergence

Three independent sources, one week, the same claim.

Alex Merced, writing as Dremio's head of developer relations, opens his state-of-the-catalogs survey with the flat assertion that the table format question is settled and Apache Iceberg won, so the interesting fight has moved up to the catalog, which he describes as the single API boundary between every engine and every byte you own and, more pointedly, as the place that is becoming the AI control plane: every catalog roadmap he surveys is bending toward agent support, absorbing semantic definitions as governed assets, exposing machine-readable context through MCP servers, and promising governance that holds without a human in the loop.

Daniel Beach, an independent practitioner who has spent years pushing DuckDB into production, shows the same gravity pulling on the other table format. In his walkthrough of Unity Catalog commits he quotes Databricks' own commit documentation conceding that external engines writing to Delta directly in object storage cause catalog metadata to silently diverge from the actual table state, and that across engines, tools, and agents there is no standardized enforcement of row or column-level controls, which is precisely why Delta is now following Iceberg into routing writes through the catalog. When the vendor describes the failure mode its new product fixes, the admission is worth more than the pitch, and the admission here is that direct writes diverge and that nothing enforces consistent controls across the tools.

And Filip Stojkovski, who works on SecOps AI strategy and has every commercial reason to talk up the AI layer rather than the plumbing beneath it, arrives at the same place from the SOC side. His advice for a SIEM migration is to optimize the pipeline, decouple the detection engine, and only then move the data, because half-finished migrations leave logs sprawled across three places with nobody sure where they live, and his line about AI is that agents sit downstream of your data, so bad data in means hallucinations out and no AI analyst fixes a broken data architecture. The data-engineering vendors, an independent data engineer, and a working SecOps practitioner are converging on the same structural point from three vantages, which is what makes it worth taking seriously.

The steelman

For analytics, the consensus is right.

I want to grant the position at its strongest before I turn on it, because the strongest version is genuinely persuasive and most of it is true. The reason the catalog can be the control plane at all is that the Iceberg REST specification turned it into a swappable interface: start with REST and you can change catalog backends later without touching engine configuration, while if you start with the old Thrift-based Hive Metastore you inherit a migration the day you outgrow it. Once the catalog is the API boundary rather than a pile of static metadata files, a lot of capability can live behind that one boundary, and Merced's survey is a good tour of what already does.

Server-side scan planning, which is the catalog working out which files and rows a query needs before any engine touches the data, is the feature he flags to watch this year, and as of Iceberg 1.11.0 in late May 2026 the REST spec carries a reference implementation of that planning endpoint, so it's now arriving rather than hypothetical. The security version of the argument is real. If the catalog applies row filters and column masks while it plans that scan and returns only the rows an engine is permitted to see, then the engine never sees what it is not allowed to see, because the filtering happened before the plan existed. That holds only as long as the engine cannot go around the catalog, though, because the same vended credentials that let it read a table also let it read the underlying Parquet directly, so the catalog-applied filter binds only where cloud IAM denies the engine direct bucket access, and where direct-to-storage reads are allowed the filter is a policy the storage layer never sees. Credential vending hands an engine a short-lived token scoped to a table, and the stricter pattern, remote signing, hands it no token at all and pre-signs each individual file read scoped to one file and one operation, which for regulated data where even a few minutes of broad access is unacceptable is a meaningful difference. Treat the catalog with the seriousness you'd treat a production database, Merced says, because functionally that is what it is, and on that last point he is simply correct, because the catalog is a tier-1 dependency now, and a metadata-tamper bug against it is a plausible attack surface rather than a feature concern.

So I'm not disputing that the catalog matters more than it did, or that for a data-platform team standardizing analytics on a lakehouse the catalog choice is now the decision that determines governance reach and engine portability. If you've already concluded that and you're choosing between the credible options, I've written that comparison separately, and a companion piece on what to do when the catalog you picked won't enforce what you need; this essay is the prior question of whether the catalog is even the layer to bet on for security, and what nobody is scoring once you decide it is. The disagreement is narrower and it is specific to security workloads.

The turn

For security data, the catalog is not the moat.

The hypothesis I hold, at moderate confidence and with the evidence still short of where I'd want it, is that the access-control granularity security actually needs (per-event row-level decisions, classification labels that travel with the record, dynamic masking that responds to who's asking), together with the provenance and audit-retention requirements that regulated security data carries, push the lock-in back down to the engine and pipeline layer rather than up to the catalog. The reason is mechanical. A record's classification label and the masking logic apply where the data is shaped and the engine runs, and a provenance primitive like Iceberg V3 row lineage lives in the table format itself, so all three survive a change of catalog, which means the lock-in that actually costs you sits below the catalog rather than in it. The catalog-as-moat thesis describes analytics workloads well, and security workloads diverge from it, and the most useful evidence for that divergence comes from the source most invested in the convergence story, because Merced's own catalog-by-catalog detail keeps undercutting the present tense of his framing.

Read the survey for what enforcement actually does today rather than what the roadmap promises, and the cross-engine access control that would make the catalog the enforcement point turns out to be four different kinds of not-yet.

Where the catalog computes policy at all, the enforcement is usually distributed. AWS documents Lake Formation's own design as exactly that, with the catalog working out the policy and vending the credentials while the engine is the part that actually applies the row, column, and cell filtering, under a fail-close trust contract whose own wording says integrated services are trusted to enforce it properly. The thing that applies the rule is the engine, not the catalog, by the vendor's own description.
On Polaris and Lakekeeper it is delegated, because they don't decide fine-grained access themselves and instead hand it to a separate rules engine that lives outside the catalog (OPA, OpenFGA, Cedar), so the catalog is the caller and something else is the decision.
On Databricks Unity, the one catalog Merced calls the most complete, the controls that key off who is asking and what the data is (the column masks and row filters the catalog applies while it plans the scan) are still in beta, and that beta is a narrow one: external engines querying under those controls are read-only, a VARIANT column can't be projected and a BINARY column can't be filtered, and the catalog returns server-sanitized rows rather than rewriting the data. The open-source build is a separate, slower-moving project with a real feature gap, so the controls a security team would actually need live in the managed tier only.
And across all of them it stays non-portable, because there is no industry standard for moving a governance policy from one catalog to another, the first of the unsolved problems Merced lists himself, so a policy written for Polaris does not move to Unity or to Glue.

The one design that genuinely puts enforcement inside the catalog is Unity's server-side planning, and that is the beta exception, which makes the honest claim narrower than a flat "catalogs don't enforce." Today the mature enforcement points are engine-side or delegated to an outside rules engine, and the single catalog-side implementation is not generally available yet. Read that way, the catalog-as-control-plane is arriving, in beta, and externally delegated, which sharpens the who-enforces question rather than closing it. That is why I keep this at moderate confidence and won't promote it. The path to calling it settled runs through a practitioner deployment I can audit at regulated scale, and three blog posts plus one demonstration moving a few hundred files a day don't clear that bar, so vendor and practitioner writing consolidates the read without graduating it. The honest status is that the question is sharper than it was a month ago and still open.

What to keep, not resolve

Two contradictions worth keeping open.

The first is whether managed catalogs are genuinely open or open as marketing, and the sources give a per-vendor split rather than a verdict, which is the honest answer. Snowflake has been explicit that its managed catalog runs the same Polaris backbone the community downloads, not a stripped-down fork, so there managed and open really are the same thing. In the same survey Databricks Unity's open-source build is the feature-gapped poor cousin to the managed one, which is the textbook shape of an open interface wrapped around a proprietary core. Beach supplies the mechanism in miniature, where DuckDB can write to Delta now, which sounds like openness, but it writes through a managed workspace URL and a personal access token, so the engine-openness sits on top of a managed control point. Collapsing that into either "openness won" or "openness is theater" would be wrong in both directions; the accurate read is that it varies by vendor and you have to check.

The second contradiction is the one a security buyer should price most carefully, because it interacts with the first to produce lock-in at exactly the layer the analytics story calls open. Granting the catalog-as-control-plane thesis at full strength, a governance policy still does not travel from Polaris to Unity to Glue, so the catalog can be the enforcement boundary within a vendor while remaining non-portable across vendors. Combine that with the managed-versus-open gap and the trap comes into view, because the controls a security team needs are concentrated in the managed tier, and the managed tier's policies don't move, so a team that adopts catalog-level enforcement is buying per-vendor lock-in at the governance layer while being told that layer is the open one. This is the catalog-layer version of a problem security already knows from detection content, where a Sigma rule that reads cleanly does not reliably write to every backend, and it is the kind of contradiction a fair scorer keeps on the board rather than resolving in the vendor's favor.

The market signal

What Cyera paid for Ryft.

The genuinely new datapoint in the window is a price. On the 23rd of April 2026, Cyera, an AI-security platform carrying a nine-billion-dollar valuation, acquired Ryft, an Iceberg-operations startup, for somewhere around a hundred to a hundred and thirty million dollars. Ryft's product was the unglamorous half of the lakehouse: detect fragmentation, run compaction, optimize layout, manage the snapshot lifecycle, handle deletion for privacy requests, and, from early 2026, a context layer that turned schema, query patterns, freshness, and table statistics into something an agent could read. The facts of the deal are independently checkable, and the strategic reading I'd borrow is Merced's, that a security vendor paid nine figures for an Iceberg-operations startup so it could give AI agents traceable, governed, secure access to lakehouse data.

The durable lesson is narrow and I want to keep it narrow. Table health plus agent-readable context is now being priced as security spend, not just analytics spend, which is the first market-priced confirmation I've seen that the data-health layer security teams treat as someone else's problem has acquisition value when an agent has to stand on it. The over-read to resist is the temptation to turn one acquisition with an unproven integration into a license to go build an agent-governance product, and I'm not going to, because a single deal prices a direction, it doesn't validate a roadmap. It also fits a pattern I keep watching, where an open capability that a team could run itself (Iceberg-ops, an agent-context layer) gets absorbed behind a SaaS boundary, which is the better business model and the worse ownership model, and the self-managed counterparts to Ryft's function still exist for a team that would rather own the half it depends on.

The unscored half

The half that's still up for grabs is verification.

Merced ends on a line I'd underline if it were mine, that resolving metadata is only half the job, and keeping the tables healthy and the agents accountable is the other half, and that half is still up for grabs. That is the half I've been measuring, and the runs are public. The measurements are small, first-party, and synthetic, single-host runs on planted corpora, tier B and not production rates, but they point the same way every time, which is that the catalog resolving metadata correctly tells you nothing about whether the answer an engine or an agent gets is the right one.

In one run I put the same ten-million-row table behind twelve query engines and Parquet readers and asked each the same pre-registered questions against a ground truth computed from the data generator. Ten of the twelve returned the correct answer on every probe. Two — an embedded ClickHouse reader and fastparquet — silently returned wrong answers on at least one probe, in one case dropping rows from an equality filter so that the count came back low with no error and no warning. The catalog was identical across all thirteen; it resolved the same metadata and handed back the same files, and it had nothing to say about the reader that miscounted. The only thing that caught the miscount was running more than one engine and comparing, which is verification, and verification is not something the catalog provides.

In another I tested what flattening security data into a coarse schema does to detection, by scoring a fidelity-preserving store against a normalized one on the same planted APT29-style attack chain, where a recovery score of 1.0 means the flattened store answered an attack-hunting query exactly as well as the full-fidelity store and 0 means it missed entirely. Routine, high-volume queries came back identical, a delta of +0.000, exactly as they should. The adversary queries, the ones that depend on a timestamp's timezone offset, on absence encoded as something other than null, on the inter-arrival jitter of a beacon, degraded by +0.188 on the de-gamed run against real MITRE APT29 telemetry with unmodified SigmaHQ rules — a larger +0.719 on the gameable synthetic testbed — and the fidelity that prevents that loss is not free, because on that synthetic testbed the full-fidelity store ran 1.93 times the size of the coarse one (5.36 MB against 2.78 MB on this corpus). Flattening leaves the dashboard intact while it eats the adversary tail, the small fraction of events that carry the actual attack, and the catalog sees none of that, because the loss happened in how the data was shaped on the way in.

And in a third I measured how much of an asset inventory you can actually reconstruct from imperfect tools, observing twenty thousand assets across seven attributes through four overlapping sources. The best single tool recovered 47.7% of the true cells. Merging across tools by freshness and authority recovered 75.6%, a gain of nearly twenty-eight points, and it still left a quarter of the picture — 24.4% — that no tool got right. Assurance lives in the join across tools, not in any one tool's view, and certainly not in the catalog that indexes them. None of these numbers is large or production-grade, and I'd trade all three for one audited deployment at scale, but they're consistent, and they say the contested half of the lakehouse is the half nobody is scoring.

Why this is the work

Why I can measure the lakehouse and not the SIEM.

There's a reason the numbers above run on lakehouse engines and an open catalog rather than on the incumbent SIEM most security teams are actually trying to leave, and it isn't that the SIEM is uninteresting. Splunk's general terms restrict a customer from publishing competitive benchmark results that name Splunk, and Oracle's terms carry a similar clause, so every published benchmark that names those products is, by construction, run by the vendor or its partners. Snowflake's acceptable-use policy and Databricks' master terms don't carry that restriction. The practical consequence is the sharp end of the whole argument: the lakehouse you'd move to lets an independent party check the vendor's claims, and the SIEM you'd leave binds you from publishing what you find, which is exactly why independent measurement has value precisely where the terms forbid it.

That's the practice this site is, and it's why I keep the catalog story in its place. The catalog convergence is a dimension I score in the capability matrix, alongside the enforcement model, the credential pattern, the governance-portability gap, and the data-health half nobody else is grading; it is evidence that feeds the scorecard, not a new thing to be bullish about. For analytics the catalog may well be the moat. For security it's the question I opened with, the access-control and provenance requirements push the decision back down to the engine and the pipeline, and the useful contribution isn't another vendor telling you the catalog won — it's measuring the half they've all agreed to leave unscored, with terms open enough that you don't have to take their word for it.