Technology deep-dive

Meta-catalogs and asset context in federated environments.

A federated security architecture has two visibility problems that look related but live at different layers. Where does the data physically live across multiple data catalogs (Hive, Polaris, Unity, PostgreSQL)? And what does any given asset in that data actually mean to the organization (who owns it, how critical is it, where is it supposed to be)? Meta-catalogs answer the first question. CAASM (Cyber Asset Attack Surface Management) answers the second. Together they're how a federated SOC gets cross-organization visibility without flattening every business unit onto the same platform.

Reading time: about 17 minutes. Evidence tier: B overall (vendor documentation from Apache Gravitino, Apache Polaris, and Databricks Unity Catalog; Axonius customer case studies; practitioner reports). Gravitino-specific claims are flagged as early-maturity throughout: the project is in Apache incubation and production references are still emerging.

The multi-catalog problem

Four catalogs, four APIs, one detection engineer.

The typical security data platform that's been alive for more than three years has usually grown by accretion rather than by design, and the pattern I see most often looks like this: a legacy Hive metastore from the Hadoop era, still holding hundreds of terabytes of Zeek and Suricata logs that older investigations occasionally need, alongside a Delta Lake on Databricks for current SOC analytics governed by Unity Catalog, plus an Apache Iceberg deployment on S3 added in the last two years and governed by Apache Polaris, and then a PostgreSQL instance somewhere holding threat intel feeds, asset inventory exports, and user context tables that everything else joins against.

That leaves four catalog types with four different APIs and four different governance models, which becomes a problem in practice on the day a detection engineer asks how to join CloudTrail in Iceberg with user context in PostgreSQL and legacy Zeek logs in Hive in a single query, because the answer is that they need a layer above the catalogs.

That layer is the meta-catalog, and it's where the rest of this piece picks up.

Vocabulary

Unified catalog versus meta-catalog.

These terms get used loosely in vendor marketing, but the distinction matters once you're choosing components.

A unified catalog stores and manages metadata for one format or one ecosystem. Apache Polaris is a pure Iceberg catalog: vendor-neutral, Apache Software Foundation governance, REST API. Databricks Unity Catalog is multi-format (Delta Lake natively, plus Iceberg and Hudi) but Databricks-centric in deployment and lifecycle. The Hive metastore is a unified catalog for the legacy Hadoop ecosystem and still widely deployed.

A meta-catalog doesn't store table metadata itself. It federates access to multiple underlying catalogs and exposes a single API that knows how to route a query to the right one. Apache Gravitino is the most prominent open-source example. The mental model is:

Gravitino API
   |-- Polaris Catalog (Iceberg tables)
   |-- Unity Catalog (Delta Lake tables)
   |-- Hive Metastore (legacy Hadoop tables)
   |-- PostgreSQL (relational data)

Gravitino is a metadata router rather than a data proxy, so the data still flows directly from storage (S3, ADLS, GCS) to the query engine (Trino, Spark, Dremio) while Gravitino only routes the metadata lookup. Governance is delegated to the underlying catalogs, which means Gravitino checks catalog-level role-based access control, the underlying catalog enforces table-level permissions, and the query engine applies row filters or column masks where the catalog supports them.

A practical example. Imagine a detection engineer wants to correlate CloudTrail events in Polaris-backed Iceberg, user records in PostgreSQL, and legacy Zeek connection logs in Hive, all in one query through Trino:

-- Join Iceberg (Polaris), PostgreSQL, and Hive via Gravitino
SELECT
    c.eventName,
    c.userIdentity.principalId AS user_id,
    u.email,
    u.department,
    z.src_ip,
    z.dest_ip
FROM gravitino.polaris_catalog.cloudtrail AS c
JOIN gravitino.postgres_catalog.users AS u
    ON c.userIdentity.principalId = u.user_id
LEFT JOIN gravitino.hive_catalog.zeek_conn AS z
    ON c.sourceIPAddress = z.src_ip
WHERE c.eventTime > NOW() - INTERVAL '1' DAY;

Trino talks to Gravitino. Gravitino routes the metadata lookups for each table to Polaris, PostgreSQL, and Hive. The actual join happens in Trino's query engine, reading data directly from S3 and the underlying PostgreSQL instance.

Maturity hedge

Gravitino is early. Take the production claims with caution.

Apache Gravitino entered the Apache Software Foundation incubator in 2024. The original development came out of Datastrato, the project is Apache 2.0 licensed, and the community has grown to a few thousand GitHub stars and several dozen contributors. That is a healthy early trajectory, but it is not the same maturity profile as Iceberg or Delta Lake, both of which have multi-year production deployments at Netflix, Apple, and Adobe scale behind them.

What that means in practice is that production references for Gravitino are still emerging. The performance numbers vendors cite (single-digit-percent cold-query overhead, near-zero warm-query overhead with metadata caching, 85%-plus cache hit rates on stable schemas) match the reference architecture and match practitioner reports I've seen from a handful of organizations testing it, but these are not yet the kind of multi-petabyte, five-year-track-record validations that Iceberg has from Netflix or Delta has from Adobe, so I'd treat the performance claims as plausible and directionally correct rather than as settled industry consensus.

The alternative pattern, when Gravitino feels too early, is materializing cross-catalog joins through dbt rather than federating at query time. The dbt approach is operationally heavier but uses entirely mature tooling. If you only need cross-catalog queries occasionally (quarterly compliance reports, ad-hoc investigations) dbt may be the lower-risk choice. Gravitino's value rises sharply when cross-catalog queries are a daily operational need, not a quarterly one.

When meta-catalogs earn their keep

Three scenarios where federation pays back.

Meta-catalogs earn their keep on organizational complexity more than on technical complexity, because the technical complexity (joining tables across heterogeneous catalogs) can usually be solved by dbt and patience, while the organizational complexity is what makes Gravitino worth deploying.

1. Multi-decade data platform evolution

A financial services SOC with fifteen years of data infrastructure. Roughly 1.5 PB in Hive (still queried for historical investigations), 800 TB in Delta Lake (active SOC analytics), 500 TB in Iceberg (the modern lakehouse), and 100 GB in PostgreSQL (threat intel, asset inventory, CMDB).

Without a meta-catalog, engineers write four queries (Hive via Beeline, Delta via Databricks SQL, Iceberg via Trino, PostgreSQL via psql) and then combine the results in a spreadsheet. The composite case studies I've seen put that work at two to four hours per cross-platform investigation. With Gravitino, the same investigation collapses to a single SQL query and lands in the fifteen-to-twenty minute range, so the savings show up in incident response time rather than in storage cost.

2. Federated conglomerate with independent business units

A global conglomerate with twelve acquired companies, each running independent security operations. BU-A on Databricks with Unity Catalog. BU-B on Snowflake. BU-C on Iceberg with Polaris. BU-D on Google BigQuery. When corporate security investigates a supply chain attack that touches three of those BUs, the choice without a meta-catalog is CSV exports and manual joins in a spreadsheet, measured in days. With a meta-catalog, it's a single query that joins all three sources in seconds. This is the scenario where Gravitino's value is least ambiguous, because the alternative is so painful.

3. Data migration with a dual-write window

A security team migrating from Hive to Iceberg over eighteen months. During the dual-write phase, detection engineers need to query both catalogs to validate that the migration preserved correctness. A simple consistency check looks like this:

-- Validate Hive -> Iceberg migration consistency
WITH hive_counts AS (
    SELECT COUNT(*) AS cnt
    FROM gravitino.hive_catalog.cloudtrail
    WHERE date = '2025-11-01'
),
iceberg_counts AS (
    SELECT COUNT(*) AS cnt
    FROM gravitino.polaris_catalog.cloudtrail
    WHERE eventTime::date = '2025-11-01'
)
SELECT
    h.cnt AS hive_row_count,
    i.cnt AS iceberg_row_count,
    ABS(h.cnt - i.cnt) AS discrepancy
FROM hive_counts h, iceberg_counts i;

Any discrepancy above zero indicates data loss in the migration. Gravitino enables this kind of systematic validation across a year of dual-written data without rewriting the validation harness for each catalog.

Picking the primary catalog

Polaris versus Unity Catalog, before you even consider Gravitino.

The question of whether to deploy a meta-catalog only makes sense after you've decided what the primary catalog is. For most security teams in 2026, that decision narrows to Apache Polaris or Databricks Unity Catalog. See catalog decision for the longer version of this comparison; the short version below is enough for the meta-catalog conversation.

Apache Polaris

The vendor-neutral Iceberg catalog, with Apache Software Foundation governance, a REST API, and multi-cloud deployment, which makes it the right choice for pure Iceberg deployments where the priority is query-engine flexibility (Trino, Dremio, Spark, StarRocks, Athena) and avoiding vendor lock-in. Governance here is table-level role-based access control, so there are no row filters or column masks at the catalog layer, and you pair it with engine-level policies (Trino row filtering) when you need finer-grained control.

Unity Catalog

The Databricks-centric governance layer that supports Delta Lake natively, plus Iceberg and Hudi, and it's the right choice when the deployment is already Databricks-committed and the team needs fine-grained access control at the catalog layer, meaning row-level security and column masking enforced before the query engine sees the data. The cost is a roughly five-to-fifteen-percent query overhead from catalog-side policy evaluation, and you can see catalog governance for the deeper policy-versus-performance treatment.

When Gravitino enters the picture

Choose a unified catalog (Polaris or Unity, not both, not Gravitino) when a single catalog type is sufficient, when governance consistency matters more than catalog diversity, and when the team values simplicity. The honest answer is that most security teams under 10K assets fit this profile and should not deploy Gravitino.

Choose a meta-catalog like Gravitino when three or more catalog types are already in production, when the organization is federated across business units with independent catalogs, when a data migration is in progress and side-by-side validation is needed, when cross-catalog lineage is a requirement, or when unified data discovery (a single search across all catalogs) is part of the team's mandate.

The asset context gap

Knowing where the data lives is not knowing what it means.

Meta-catalogs solve a structural problem, in that they tell the query engine where data lives across multiple catalogs, but they do not tell anyone what that data means in the context of the organization, and that gap is where CAASM (Cyber Asset Attack Surface Management) comes in.

The cleanest way to motivate CAASM is with an alert that does not need it, and then an alert that does. At 10:47 AM on a Tuesday, the SIEM fires. Sensitive S3 bucket accessed from an unusual IP. User alice@company.com. IP 203.0.113.45. Action s3:GetObject. Bucket finance-pii-customer-data. Severity HIGH.

Without asset context, the analyst's playbook is roughly: check Active Directory, check the CMDB for the IP (no results), query thirty days of CloudTrail, call Alice's manager (voicemail), get transferred through three teams, and resolve the alert as a false positive fifty minutes later. Alice's home office IP, approved work-from-home setup, never seen the alert subsystem before because nothing had ever joined her IP record against her bucket access pattern.

With CAASM enrichment, the same alert arrives with annotations attached. The user is Executive Assistant in the CFO Office. The IP owner is Alice, Home Office Router, approved work-from-home, first seen five months ago. The bucket is classified Critical, holding PII and Financial Data. The expected behavior baseline says Alice accesses this bucket daily from her home office. The investigation collapses to twelve seconds. The severity auto-adjusts from HIGH to INFO.

CAASM doesn't eliminate alerts here, but it does move the false-positive filtering out of the analyst's head and into the data, where the same logic can run across every alert instead of being re-derived by hand each time.

What CAASM does

One question, twenty-plus sources, one answer.

CAASM (Cyber Asset Attack Surface Management) is a category Gartner formalized in 2021. The product shape is consistent across vendors: aggregate asset data from twenty-plus sources to answer a single question: what do we own, who owns it, and what's the risk?

The four jobs CAASM platforms do:

Asset discovery: pull every device, user, application, and cloud resource from identity providers, EDR, cloud APIs, vulnerability scanners, and network discovery.
Asset normalization: merge "Alice's laptop" from five tools into a single canonical record, deduplicating across data sources that don't share keys.
Risk context: attach criticality, vulnerability status, owner, location, and ownership to every record.
Ownership attribution: answer "who do I call?" during an incident, which is usually the bottleneck in federated response.

The integration count matters less than what the platform exposes via API, because the value to the security data platform comes from what the CAASM system can export into an Iceberg table and join against at query time — which is what the next section covers.

Federation amplifies the value

Why CAASM matters more when the organization is federated.

CAASM is useful in most organizations, but it becomes close to essential in a federated one, because federation introduces four asset problems that single-platform organizations don't have at the same severity.

Siloed CMDBs. BU-A uses ServiceNow, updated quarterly. BU-B keeps a spreadsheet, updated "when we remember." BU-C runs BMC Remedy. The central SOC alerts on device 10.15.23.87 and the owner-identification workflow turns into emails to fifteen IT teams and a three-day wait. With CAASM, the device is auto-discovered from BU-C's network scanners and Active Directory, and the owner is identified in twelve seconds, and the SOC value comes less from the discovery itself than from the cross-BU ownership graph that ties it together.

Inconsistent naming. BU-A calls a thing a "laptop," BU-B an "endpoint," BU-C a "workstation," BU-D a "mobile device." CAASM normalizes those to a canonical "Windows Device" so the SOC can write a single detection that doesn't need four OR-clauses.

Stale data. The quarterly CMDB update means a device that appeared on February 5 may not show up in the system of record until April 1, which leaves the security team blind to it for two months, whereas CAASM platforms run continuous discovery (typical refresh intervals are fifteen minutes to twenty-four hours, depending on the data source) so that gap closes from months to hours.

Shadow IT. BU-C spins up an AWS account for a proof-of-concept, never reports it to Central IT, runs fifty EC2 instances at $15K a month, and is invisible to the SOC until something goes wrong. CAASM that's been wired into AWS Organizations discovers the shadow account on its next sync.

Integration pattern

Wiring CAASM into the security data lake.

The integration pattern I see working most reliably is daily export from the CAASM platform into an Iceberg table, enrichment via dbt at query time, and context-aware detection rules that join the enriched logs against the asset table.

Step 1: export CAASM inventory

Every major CAASM platform exposes an export API. The shape is roughly:

curl -X POST https://api.axonius.com/api/assets/export \
  -H "Authorization: Bearer $AXONIUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "format": "json",
    "fields": ["asset_id", "asset_type", "owner", "department",
               "location", "ip_addresses", "mac_address",
               "criticality", "vulnerabilities", "last_seen"]
  }' > caasm_inventory.json

Daily cadence is sufficient for most workflows. Faster than daily (hourly, real-time) adds complexity with diminishing returns, because the CAASM platform's upstream sources usually don't refresh faster than once an hour anyway.

Step 2: dbt enrichment

Load the inventory into an Iceberg table and join security logs against it at query time. The dbt pattern looks like this:

-- models/enriched/ocsf_with_asset_context.sql
{{ config(materialized='incremental') }}

SELECT
    ocsf.time,
    ocsf.src_endpoint_ip,
    ocsf.activity_name,
    ocsf.actor_user_uid,
    -- CAASM enrichment
    caasm.asset_owner,
    caasm.asset_type,
    caasm.criticality,
    caasm.department,
    caasm.location,
    caasm.vulnerabilities,
    caasm.last_seen,
    -- Context flags
    CASE
        WHEN caasm.criticality = 'Critical'
             AND ocsf.activity_name IN ('DeleteBucket', 'ModifySecurityGroup')
            THEN 'high_risk_action_on_critical_asset'
        WHEN caasm.location != caasm.expected_location
            THEN 'asset_location_anomaly'
        WHEN caasm.last_seen < CURRENT_TIMESTAMP - INTERVAL '7 days'
            THEN 'stale_asset_activity'
        ELSE 'normal'
    END AS risk_flag
FROM {{ ref('cloudtrail_ocsf') }} AS ocsf
LEFT JOIN {{ ref('caasm_asset_inventory') }} AS caasm
    ON ocsf.src_endpoint_ip = ANY(caasm.ip_addresses)
{% if is_incremental() %}
WHERE ocsf.time > (SELECT MAX(time) FROM {{ this }})
{% endif %}

The performance characteristic that matters: the CAASM inventory is typically 10K to 100K rows. The security logs it joins against are 100M to 10B rows. This is the canonical broadcast-join pattern. Replicate the small CAASM table to every query node, stream the large log table past it. Query engines like Trino and Spark handle this automatically when the small side is below the broadcast threshold.

Step 3: context-aware detection rules

The payoff is detection rules that filter on context rather than on raw event signatures alone. A traditional, context-free rule:

SELECT * FROM cloudtrail_ocsf
WHERE activity_name = 'DeleteBucket'
  AND time > NOW() - INTERVAL '1 hour'

The composite case studies I work from suggest this kind of rule produces somewhere in the range of 40 to 50 alerts per hour in a mid-size enterprise, most of which are Alice deleting yesterday's backup bucket. The context-aware version:

SELECT
    time, src_endpoint_ip, activity_name,
    asset_owner, location AS current_location,
    expected_location, criticality
FROM enriched.ocsf_with_asset_context
WHERE activity_name = 'DeleteBucket'
    AND time > NOW() - INTERVAL '1 hour'
    AND (
        (criticality = 'Critical' AND location != expected_location)
        OR (last_seen < NOW() - INTERVAL '30 days')
        OR (vulnerabilities LIKE '%CRITICAL%')
    )
    AND NOT (asset_owner = 'alice@company.com'
             AND location = 'Home Office')

Vendors and practitioners cite false-positive reductions loosely in the 70% to 95% range for context-aware rewrites, though I haven't found a single rigorous published study behind those numbers, so treat the band as directional. My honest read: the upper end of that range is achievable for a handful of well-understood rules, and the typical result across a full detection library is closer to the middle of the range once you account for rules that don't have obvious context to add.

Cost and ROI

What CAASM actually costs and where it doesn't pay back.

CAASM pricing in 2026 is per-asset, with material discounts at volume. The bands I see in RFP data for the major vendors run roughly:

Deployment size	Assets	Price per asset / yr
Small	under 1,000	$40 to $50
Mid-market	1,000 to 10,000	$15 to $30
Enterprise	10,000 to 50,000	$8 to $15
Global	above 50,000	$5 to $10

A 10,000-asset deployment lands somewhere in the $50K to $300K per year range depending on vendor, contract, and features included.

The ROI math vendors publish (a 40% to 60% reduction in incident investigation time, $300K to $500K in annual analyst productivity for a 50-person SOC) only holds when the baseline is bad. If the existing CMDB is reasonably accurate and the SOC already has decent cross-BU coordination, the marginal value is meaningfully smaller. CAASM works best when it replaces hours of manual asset chasing per incident, not when it adds a fourth source of truth to an already-functional ownership graph.

Three honest limitations of CAASM that vendors don't lead with:

Cost is real for small organizations. Under 1,000 assets, the per-asset pricing makes manual CMDB curation or an open-source tool like NetBox more cost-effective.
Integration overhead is non-trivial. CAASM platforms need API access to a dozen or more source systems. If BU IT teams won't issue API keys, the platform aggregates nothing.
Data quality is upstream. Gartner's 2022 CMDB survey found 40% to 60% accuracy is typical, and CAASM correlation inherits that ceiling. Continuous discovery raises the ceiling modestly; it doesn't replace the work of cleaning the underlying sources.

The framing that matters most here is that CAASM is not a detection tool and does not replace the SIEM or the EDR, because it is the data foundation that makes detection more accurate by giving each rule the organizational context it needs to filter false positives, which is why teams that treat CAASM as a detection product so often end up disappointed by it.

Putting it together

Meta-catalog plus CAASM equals federated visibility.

The two layers answer different questions. The meta-catalog answers where does the data physically live across our catalogs. CAASM answers what does any given asset in that data mean to the organization. A federated security architecture needs both because each one alone is incomplete.

A meta-catalog without CAASM lets a detection engineer write a single SQL query against four catalog types and still not know who owns the device the query returned. A CAASM platform without a meta-catalog produces a beautifully curated asset inventory that nobody can join against the four different log stores where the actual events live. The combination is what makes "we have visibility across the federation" a defensible statement rather than a marketing one.

For the rollout sequence (when to deploy which layer first, how to phase the budget, how to avoid the common ordering mistakes) see the federated rollout playbook, though the short version is that you stand up the primary catalog first (Polaris or Unity), wire CAASM into it as the first enrichment source, and only add Gravitino once you have three or more catalog types that need to be queried in a single statement on a daily basis.

Takeaways

Six things to walk away with.

Meta-catalogs solve organizational complexity, not just technical complexity. Three-plus catalog types across federated business units is when Gravitino starts to earn its operational footprint, and the cold-query overhead in the testing I've seen is in the single-digit percent range, with near-zero overhead once the metadata cache is warm.
Choose Polaris for vendor-neutral Iceberg or Unity Catalog for fine-grained governance first, and only then consider Gravitino. Most security teams under 10K assets do not need a meta-catalog and should resist deploying one for theoretical-future-use reasons.
Gravitino is early. Apache incubation status, emerging production references, performance benchmarks that match the reference architecture but lack the multi-year multi-petabyte validation that Iceberg and Delta have. Treat it as plausible and directionally correct, not as settled.
CAASM transforms detection quality by moving false-positive filtering from analyst heads into the data. The published 70% to 95% false-positive reductions are upper-bound; expect the middle of that range across a typical detection library, with the bigger wins on rules that have obvious context to add.
Federated enterprises see disproportionate CAASM value because siloed CMDBs, inconsistent naming, stale data, and shadow IT are federation-amplified problems. The marginal value is smaller for single-platform organizations with disciplined CMDB hygiene.
Start with daily dbt-based enrichment, not real-time. Daily CAASM export into Iceberg plus dbt joins covers most use cases. Real-time lookup (Cribl plus a Redis cache, for example) is the next increment of complexity, and most teams reach for it before they have evidence the daily pattern is the bottleneck.