Security Data Works

Technology deep-dive

Unity Catalog vs Polaris vs Nessie: choosing the catalog for security data.

Picking Apache Iceberg as the table format settles one decision. The catalog choice is the next, and it is the one that determines how access control, governance, multi-engine portability, and vendor lock-in actually play out. In 2026 the field has tightened to three credible options: Unity Catalog (Databricks, Apache 2.0 since June 2024), Apache Polaris (Snowflake-stewarded, Iceberg REST), and Project Nessie (Dremio-incubated, Git-style versioning), with DuckLake sitting as a 2026 wildcard that may or may not consolidate the catalog layer entirely.

Reading time: about 19 minutes. Evidence tier: B (vendor documentation, Apache project incubation records, and architectural analysis from security practitioners). Where I cite production deployments, I name the source. Where I'm reading vendor tea leaves on roadmap, I flag it.

The decision after the decision

Catalog choice is the governance contract.

The Iceberg catalog is more than a metadata registry, because it's also the enforcement point for who can read which tables, whether row-level security or column masking can be expressed at all, how schema changes are audited, and which query engines can participate without a custom connector. Pick the wrong catalog and the next year is spent writing authorization middleware, while picking the right one means the access-control questions a SOC has to answer collapse into table grants and namespace policies.

That's why I treat catalog as an Iceberg-pillar-adjacent decision. The table format gives you ACID transactions, schema evolution, and time travel, but it's the catalog that decides whether those features are governable, because without a catalog you have files in S3 whereas with one you have governed data assets and an answer to the auditor's question about who queried CloudTrail last Tuesday.

One qualifier I want to set down before the comparison, because it changes how much weight to put on the choice. In a general enterprise data platform the catalog can become the moat, the place that owns governance and therefore owns the customer. In security specifically I think that's mostly wrong. The governance security teams actually fight over is per-event: row-level ABAC on individual records, provenance on where a given event came from, and audit-retention that survives an investigation timeline. Those concerns push the real lock-in down to the engine and pipeline layer, where the policy is enforced and the data is shaped, not up to the catalog that registers tables. The counter-evidence is concrete: Iceberg V3 row lineage (Iceberg 1.9.0, April 2025) puts per-row provenance and last-updated tracking in the table format itself, readable by any engine that surfaces it and independent of which catalog wrote it. When the audit primitive lives in the open format, swapping Polaris for Nessie doesn't cost you the audit trail. So the catalog choice (Unity vs Polaris vs Nessie vs DuckLake) matters less for lock-in than the engine choice does. It still matters a great deal for operational fit, which is what the rest of this piece is about.

I'd been asserting that swap so I ran it. On a single host I wrote and read the same OCSF table through three independent catalog implementations over one MinIO store: the iceberg-rest reference fixture (the Java reference implementation), Nessie (Java and Quarkus, with the Git-style branching above), and Lakekeeper (a Rust catalog backed by Postgres). The query returned the identical answer, rdp=125, through all three, with the data never moving and only the catalog changing underneath it (./moar swap-catalog in the MOAR reference stack, 2026-06-07). I ran the same check one layer down with ./moar swap-format, and the same OCSF batch returned the identical answer across Iceberg and DuckLake on the same store, so the table format is replaceable on the same footing as the catalog. Three separate codebases agreeing on the answer is a portability signal rather than a single-implementation coincidence, which is the central claim of this piece demonstrated rather than asserted. I want to keep the framing honest about what it is: an answer-equality check on one host, not a production migration or a performance result, and Unity Catalog isn't in that stack, so its behavior here stays documented rather than first-party.

For background on why Iceberg over Delta in the first place, see Iceberg vs Delta Lake for security data. This piece picks up where that one ends, with the table format chosen and the catalog now on the table.

Architecture assumption

Dedicated security infrastructure changes the catalog math.

Most of the security architectures I work with assume a dedicated security data plane: a separate VPC or VNet, IAM scoped to the security team, no shared multi-tenant platform mixing PII with EDR telemetry, and they assume it because separation of duties is a sound practice. When this assumption holds, the catalog evaluation looks different than the textbook enterprise data-governance pitch.

Start with row-level security, which often becomes unnecessary here. Shared corporate data platforms genuinely need RLS to restrict analysts to their region's logs or their customer's tenant. On an isolated security platform, the entire security team is authorized to query every security log, so table-level permissions are sufficient. That avoids the 5-30% query latency overhead that Unity Catalog row filters add for every query that hits a filtered table.

Column masking tends to fall away for the same reason. Shared platforms mask source IP, user agent, and other quasi-PII for non-privileged users, but isolated security teams need to see those fields to do incident response, so skipping masking avoids the 3-10% query-rewriting overhead and keeps the data fully usable for threat hunting.

So the catalog priorities shift. On shared platforms, Unity Catalog's catalog-enforced fine-grained access is the differentiator and probably the right answer. On isolated platforms, Polaris and Nessie become genuinely viable, because table-level RBAC plus network isolation is enough. That frees you to optimize for vendor neutrality (Polaris) or multi-table transactional rigor (Nessie) instead.

Production validation for the isolation pattern: Netflix runs a dedicated security observability plane (ClickHouse hot tier plus Iceberg cold tier, network-isolated) with table-level permissions via a vendor-neutral catalog. Daniel Muino's QCon 2024 talk on ClickHouse at Netflix is the public source. Huntress runs its EDR data lake on an isolated AWS footprint with Iceberg plus table-level RBAC. Jake Thomas at Okta has confirmed an isolated security analytics platform using DuckDB against Iceberg tables with table-level permissions and no column masking.

Even on isolated infrastructure, fine-grained access still matters in three cases I keep running into: managed security service providers hosting multiple customer tenants on shared infrastructure (row filters per customer are non-negotiable), GDPR "right to access" workflows that need column-level lineage and masking audit trails, and federated global security teams where regional log access is legally restricted. For everyone else (single-tenant, dedicated infrastructure, team-wide access) table-level catalog security is enough, which opens the field beyond Unity Catalog.

What a catalog actually does

Metadata registry plus enforcement point.

An Iceberg catalog tracks four things that the table format itself doesn't pin down: which S3 paths hold which tables, the version history of schema changes, the snapshot list that enables time travel queries, and how data is physically partitioned, which is the registry function and table-stakes for any of the options here.

For security teams, the catalog is also the enforcement surface for governance, which is where it decides who can query the CloudTrail tables, whether the source IP column is visible to a particular role, what gets written to the audit log when a user runs a query, how OCSF normalized tables stay consistent when fifteen of them have to update atomically, and which dashboards depend on which raw log tables so that a schema change doesn't silently break a detection rule.

Not every catalog does every one of these, though, because some assume a single cloud, some require specific query engines, and some give you row-level filters while others stop at table-level grants. The trade-offs are real, and they're the reason this decision deserves more than a Slack-thread skim.

Option one

Unity Catalog: Databricks' governance layer.

Databricks open-sourced Unity Catalog in June 2024 under Apache 2.0. The repository is unitycatalog/unitycatalog on GitHub. The licensing matters because before that move, Unity was a Databricks-only commercial feature; afterward, the catalog REST API server, table and schema management, basic RBAC, AWS Glue and Hive Metastore integration, and multi-format support (Delta plus Iceberg plus Hudi) became available to anyone willing to self-host.

The asterisk on that move: row-level security, column masking, attribute-based access control, column-level lineage, and Delta Sharing remain Databricks-platform features. If you self-host open-source Unity Catalog, you get table-level governance, but the moment you want catalog-enforced row filters like the canonical "users only see logs from their region" rule, you need a Databricks workspace. That is by design, because Databricks is competing with Polaris and Nessie on the open layer while keeping the fine-grained features as a commercial differentiator.

Where Unity earns its seat is the catalog-enforced ABAC story. A row filter or column mask defined in Unity is enforced across every engine that goes through the catalog (Databricks SQL, Spark, Trino with the Unity connector). That's a stronger guarantee than implementing the same logic at the query engine layer, where every new engine adds a new place the policy can drift.

A few other features are worth naming. Federated identity integrates with Azure AD, Okta, and Ping Identity via SCIM and OAuth 2.0. Audit logging records every query with user identity, tables accessed, and rows returned, and the logs export to S3 for downstream SIEM ingestion. Column-level lineage tracks impact analysis, useful for the "if I change this CloudTrail schema, which detection rules break" question. Delta Sharing lets you share Iceberg tables with external partners (MSSPs, IR vendors) without copying data, with revocable tokens.

The trade-off is real: Unity's fine-grained features are Databricks-coupled. For organizations already on Databricks, that's a non-issue and Unity is probably the right answer. For multi-cloud teams explicitly avoiding vendor lock-in at the data layer, that coupling is the reason to look at Polaris next.

Option two

Apache Polaris: Snowflake's open Iceberg catalog.

Snowflake donated Polaris to the Apache Software Foundation in 2024 as the vendor-neutral counterweight to Unity Catalog. The design principles are deliberately narrow: pure Iceberg (no Delta, no Hudi), standard Iceberg REST catalog protocol, cloud-agnostic deployment, and explicit support for any Iceberg-compatible query engine (Spark, Trino, Dremio, StarRocks, ClickHouse). Snowflake's own Iceberg Tables product uses a Polaris-compatible API, and customers can self-host Polaris or use the Snowflake-managed service.

Polaris's access model is table-level and namespace-level RBAC. You grant SELECT on cloudtrail_logs to a security_analysts role, grant ALL PRIVILEGES on detection_rules to soc_lead, and revoke when an analyst leaves. Namespace-level grants are useful for multi-cloud splits. Give cloud_security_team access to everything in the aws_logs namespace, give azure_security_team access to everything in azure_logs. The privilege model is granular at the operation level (READ, WRITE, CREATE_TABLE, DROP_TABLE, MANAGE_CONTENT) and supports principal-based authentication with service principals and OAuth or OIDC identity providers.

What Polaris does not have is row-level filters and column masking, so if you need those, you implement them at the query engine layer (Trino row filters, Dremio column policies) or via dynamic views, and you accept that the policy now lives in two places instead of one. For a single-tenant security platform on isolated infrastructure that's an acceptable trade, though for a multi-tenant MSSP it isn't.

Audit logging in Polaris is catalog-level (who accessed which tables, when) and integrates with CloudWatch and Azure Monitor. Query-level audit is delegated to the engine, which means a complete audit trail requires correlating Polaris logs with engine logs. For SOC 2 or ISO 27001 the catalog-level trail is usually enough, but for GDPR right-to-access requests that need column-level lineage it isn't.

The strategic read on Polaris is that Snowflake is using it the way Databricks used Spark a decade ago: donate the core, build commercial value on top, and let the open standard pull the ecosystem along. That dynamic is not by itself a reason to choose or avoid Polaris, but it shapes how the roadmap is likely to evolve. I'd hedge on long-horizon claims about Polaris's feature trajectory; the catalog is Apache-governed but the practical roadmap pressure comes from Snowflake's commercial priorities.

Option three

Nessie: Git semantics for the lakehouse.

Project Nessie came out of Dremio in 2020 and entered Apache Software Foundation incubation in 2021. Its defining bet is applying Git concepts (branches, tags, commits, merges) to Iceberg catalog metadata. That sounds like a software-engineering aesthetic choice until you watch a security data engineer try to roll back a detection rule deployment that's polluted six normalized tables.

The workflow Nessie enables: create a branch off main for a new detection rule, update the relevant tables within that branch in isolation, run validation queries to confirm the rule produces expected volume and shape, then merge into main once it passes. If it produces false positives in production anyway, reset main to a known-good tag from before the merge. Production state recovers atomically across every table the rule touched.

-- Create branch for a new detection rule
CREATE BRANCH detection_rule_v2 IN nessie FROM main;

-- Stage updates in isolation
USE BRANCH detection_rule_v2;
INSERT INTO suspicious_logins
SELECT * FROM cloudtrail_logs
WHERE eventName = 'ConsoleLogin' AND mfaUsed = false;

-- Validate expected volume before merging
SELECT COUNT(*) FROM suspicious_logins;

-- Promote to production
MERGE BRANCH detection_rule_v2 INTO main;

-- Or roll back if production behaves badly
RESET BRANCH main TO TAG production_2026_05_14;

Three Nessie capabilities matter specifically for security workloads. First, multi-table transactions across an arbitrary number of Iceberg tables. When OCSF normalization rewrites fifteen tables, Nessie guarantees that either all fifteen commit or none do. Polaris and Unity handle ACID per table; Nessie handles it across the catalog. Second, cross-table time travel: query every table as it existed at a specific commit hash, which is the right shape for "show me all security data as it looked when the breach was detected." Third, branch-level RBAC. Grant the dev branch to security engineers and the prod branch to SOC leads, so testing changes doesn't require giving everyone production write access.

Production validation: Dremio's blog has Nessie-at-Netflix material from 2021, and the original contributors cite Apple and LinkedIn as scale users. The financial-services adoption I've heard about anecdotally (for OCSF transformation pipelines that need atomic multi-table commits) is consistent with the design intent. I have not independently validated those deployments end-to-end.

Nessie's limits are the same shape as Polaris's, plus a learning curve. Branch-level RBAC is the finest-grained access control on offer; row filters and column masking are not in scope. And the Git mental model is genuinely a new thing for teams that haven't worked with feature-branch workflows on data before. Adoption usually lives or dies on whether the data engineering team has Git fluency to begin with.

The 2026 wildcard

DuckLake may consolidate the catalog layer entirely.

DuckLake is the 2026 entrant I'm watching closely, and the framing I've landed on is less "separate wildcard" than "different point on a spectrum." Catalog metadata can be realized three ways behind one Iceberg read contract: as static metadata files (the Unity, Polaris, and Nessie default), as metadata held in a SQL database, or as a virtual layer synthesized on demand (Streambased's ISK is the early articulation of that virtual end). DuckLake is the database-backed point. Instead of a catalog service sitting beside the table format, it keeps catalog metadata in a single SQL database queried directly by DuckDB. The early benchmark claims are striking. DuckDB Labs has cited two-orders-of-magnitude improvements on streaming workloads compared to file-based catalog metadata. That's the kind of number that deserves scrutiny, and I haven't independently benchmarked it against OCSF-shaped security data.

There's a caveat that decides where DuckLake fits, and it carries most of the weight here. DuckLake reaches outside engines through DuckLake-native clients rather than through an Iceberg REST endpoint, and its published roadmap through v2.0 doesn't add one. So a Trino, Spark, or ClickHouse process can't simply point an Iceberg REST connector at DuckLake the way it can at Polaris. Cross-engine consumption is copy-bridged: you replicate metadata out to an Iceberg-REST-fronted tier rather than reading DuckLake in place. That places DuckLake squarely in the warm, operational tier, fast single-engine access close to ingest, with an Iceberg-native catalog still doing the cross-engine retention work behind it.

The strategic question DuckLake raises: if catalog metadata can live in a SQL database queried alongside the data, do you need a dedicated catalog service for the operational tier at all? For small to mid-scale security deployments where the catalog itself is not a scaling bottleneck, DuckLake may simplify the warm-tier stack enough to compete on operational footprint rather than feature parity. For petabyte-scale deployments running concurrent Trino, Spark, and ClickHouse workloads against the same tables, the dedicated Iceberg-REST catalog service is probably still the right answer in 2026, precisely because DuckLake can't yet serve those engines without the copy.

I'd flag this as a watch-don't-bet item. DuckLake is early. The DuckDB ecosystem is real and the performance work is credible, but a category-redefining shift takes more than a v1.0 release. For architects making catalog decisions in 2026, the prudent move is to design the catalog layer so that a future DuckLake operational tier is possible without rewriting governance, which is one more argument for Iceberg-native, vendor-neutral choices over deep Unity Catalog coupling. For the deeper treatment of the metadata-realization spectrum and the format-war stakes, see Iceberg V4 vs DuckLake.

Honorable mention

Apache Gravitino sits above the catalog layer.

Apache Gravitino, developed by Datastrato and donated to Apache Incubator in 2023, is a meta-catalog. It doesn't store metadata itself but provides a unified API across multiple underlying catalogs. A single Gravitino instance can federate Polaris for Iceberg tables, Unity for Delta tables, Hive Metastore for legacy Hadoop tables, and PostgreSQL for relational data, all under one discovery surface.

The pattern that justifies Gravitino: federated enterprises with multiple business units already on different catalog standards, who want unified data discovery and cross-catalog lineage without forcing a migration. That's a real situation in large organizations, though it's rarely a security-team-specific one.

There are two things to know about Gravitino. First, access control is delegated to the underlying catalogs (Unity row filters, Polaris table grants) so Gravitino adds discoverability, not enforcement. Second, the meta-catalog abstraction adds a network hop, typically 10-50 ms of latency on metadata operations. That's negligible against query execution time but worth knowing if you're tuning a high-throughput ingest pipeline.

Operational reality

Catalogs track metadata. They don't maintain tables.

One thing the catalog comparison can obscure: regardless of which catalog you pick, Iceberg tables require periodic maintenance to keep query performance and storage costs under control. Catalogs track snapshots; they don't compact small files, expire old snapshots, or clean up orphans. The one exception is Unity Catalog managed tables on Databricks, which run maintenance automatically. Everywhere else, the maintenance job is yours to schedule.

Three operations have to be running somewhere. File compaction: streaming ingestion produces thousands of small Parquet files that slow scans; daily or weekly compaction into 1-2 GB files restores performance. Snapshot expiration: Iceberg keeps snapshots indefinitely, which bloats storage; weekly expiration beyond your retention window (often 90 days for compliance, 7 days for ops tables) keeps metadata reasonable. Orphan file cleanup: failed writes leave Parquet files in S3 that no snapshot references; monthly cleanup typically recovers 2-5% of total table size.

-- Daily compaction target 1 GB files
CALL spark_catalog.system.rewrite_data_files(
  table => 'security_logs.cloudtrail',
  strategy => 'binpack',
  options => map('target-file-size-bytes', '1073741824')
);

-- Weekly snapshot expiration past 90-day window
CALL spark_catalog.system.expire_snapshots(
  table => 'security_logs.cloudtrail',
  older_than => DATE_SUB(CURRENT_DATE(), 90),
  retain_last => 10
);

-- Monthly orphan file cleanup
CALL spark_catalog.system.remove_orphan_files(
  table => 'security_logs.cloudtrail',
  older_than => TIMESTAMP '2026-04-01 00:00:00'
);

The catalog-specific story breaks down like this. Unity Catalog managed tables on Databricks automate all three operations via TBLPROPERTIES. Polaris delegates maintenance to your query engine: schedule Spark jobs via Airflow, dbt, or Kubernetes CronJob. Nessie does the same, with the added wrinkle that you maintain each branch independently, though Nessie's multi-table transactions let you compact across all fifteen OCSF tables atomically.

Operational cost is modest, roughly 30-60 minutes per week for ten to twenty Iceberg tables once the automation is wired up, and the performance return is worth it because compaction can improve query speed 2-10x when fewer files means faster metadata reads and better predicate pushdown. Catalogs that automate this for you are buying back engineer time at the cost of vendor coupling.

Decision framework

How to choose.

Choose Unity Catalog if

  • You're already on Databricks or migrating there.
  • You need catalog-enforced row-level security and column masking (multi-tenant MSSP, regulated workloads, federated regional teams).
  • You manage AI and ML assets alongside security data (MLflow models, vector embeddings, LLM training data).
  • You want Delta Sharing for secure external partner access without copying data.
  • Multi-format support (Delta plus Iceberg plus Hudi in one catalog) is a real requirement.

Where it fits poorly: organizations explicitly avoiding vendor lock-in at the catalog layer, or multi-cloud teams that need non-Databricks engines as first-class citizens.

Choose Apache Polaris if

  • Vendor neutrality is a hard requirement (no Databricks or Snowflake lock-in at the catalog).
  • You've standardized on Iceberg only (no Delta or Hudi in scope).
  • You run multiple query engines (Trino, Dremio, StarRocks, ClickHouse) and need them all first-class.
  • You deploy multi-cloud (AWS plus Azure plus GCP, possibly on-premises).
  • Table-level and namespace-level security is sufficient (typically the case for isolated security platforms).

Where it fits poorly: when catalog-enforced fine-grained access is non-negotiable and pushing it down to the query engine isn't acceptable.

Choose Nessie if

  • Multi-table transactions across OCSF normalization (fifteen-plus tables updating atomically) are part of your pipeline.
  • You want Git-style version control (branches for testing detection logic, tags for compliance snapshots, rollback for bad deploys).
  • Cross-table time travel for incident investigation matters.
  • Your data engineering team is already Git-fluent and the branching model will land naturally.
  • Table-level and branch-level RBAC is sufficient.

Where it fits poorly: when fine-grained row or column access is required, or when the team doesn't have the appetite to learn a new operational mental model.

Consider Apache Gravitino if

  • You're managing multiple catalog types already (Iceberg plus Delta plus Hive plus PostgreSQL).
  • Unified data discovery across business units matters more than enforcement.
  • You're in a federated enterprise where consolidating to one catalog isn't realistic.
  • Cross-catalog lineage tracking is a real requirement, not a nice-to-have.

Watch DuckLake separately. It's not a 2026 production answer, but the trajectory may matter for 2027 and beyond, so design the catalog layer so a future DuckLake migration is at least possible.

My recommendation

Default to Polaris on isolated infrastructure.

For the architectures I most often work with (dedicated security data plane, single tenant, table-level RBAC sufficient, multi-engine query stack) Apache Polaris is the default, because vendor neutrality preserves the exit path, Iceberg REST is a real standard, table-level access control plus network isolation covers the governance requirement, and maintenance automation via Spark plus Airflow is a known pattern.

I'd choose Unity Catalog over Polaris in two specific cases. First, organizations already on Databricks where the integration benefits compound, where Unity is strictly additive rather than a new vendor decision. Second, security platforms that genuinely need catalog-enforced row filters or column masks: multi-tenant MSSPs, GDPR-driven column-level lineage, federated regional teams with legally restricted access patterns.

I'd choose Nessie when the data engineering team treats detection logic as code: version control, feature branches, atomic rollback. That's a real shape, more common in mature security data engineering teams than in average SOCs, and the Git mental model pays off there in a way it doesn't elsewhere.

The wrong answer is running no catalog at all, because an Iceberg deployment without governance is a data swamp with extra steps, so pick one of the three, wire up the audit trail, and move on to the next decision.

Honest gaps

What I'm still uncertain about.

Three things I'd want to test against more evidence before treating any of this as settled. First, production case studies comparing all three catalogs in security deployments are thin. Most of the public material is single-catalog deployment narratives, not head-to-head comparisons. The catalog evaluation in this piece is structured on architectural analysis plus vendor documentation, not side-by-side production benchmarks.

Second, the Polaris roadmap under Apache governance is genuinely uncertain. Snowflake's commercial priorities will pull in one direction, the broader Iceberg community in another, so I'd hedge on any claim about Polaris's feature trajectory more than a year out, because "may add row filters in 2027" is currently speculation rather than commitment.

Third, DuckLake's two-orders-of-magnitude streaming performance claim deserves independent benchmarking against OCSF-shaped data before anyone designs an architecture around it. The DuckDB Labs team is credible and their benchmarks are usually defensible, but no third party has yet reproduced the catalog-collapsed-into-SQL numbers on security workloads I trust.

None of those gaps change the headline recommendation for 2026. They do shape how confidently to invest in any particular catalog choice as a five-year bet versus a two-year bet.