Security Data Works

Technology deep-dive

Catalog governance without native support.

Catalog choice often dictates governance, so if you chose Apache Polaris, Nessie, or Hive Metastore for vendor neutrality or multi-cloud reach, you walked away from the built-in row-level security and column masking that Unity Catalog and Snowflake provide, though you didn't walk away from governance itself, because you can still implement it one layer up at the query engine as long as you accept the overhead that comes with that decision honestly.

Reading time: about 18 minutes. Evidence tier: B (vendor documentation from Trino, Dremio, Databricks, and Apache Polaris, plus architectural patterns I've validated with practitioners). Performance numbers are vendor-published and have not been independently reproduced on security workloads, so hedge accordingly.

The governance gap

Polaris enforces table-level permissions. That isn't enough.

Picture the scenario. You've deployed Apache Polaris in front of your Iceberg tables. You've achieved vendor neutrality: Trino queries the same tables from AWS that Spark queries from Azure. The architecture diagram looks clean. Then a compliance officer asks the question that every multi-regional SOC eventually faces: "How do we ensure EU analysts only see logs from EU regions?"

The honest answer is that Polaris, in its current form, can't help you with that question, because it enforces table-level permissions, so it can decide whether an analyst can read the cloudtrail table at all but it cannot filter rows inside that table based on who's asking.

Two governance primitives address this gap. Row-level security (often abbreviated RLS) filters which rows a user is allowed to see based on their identity attributes: region, business unit, customer tenant, clearance level. Column masking redacts or obfuscates sensitive column values for users who don't have access to the underlying data. IP addresses become "REDACTED," usernames get hashed, email addresses get truncated. Both fall under what's sometimes called attribute-based access control (ABAC), where the policy decision depends on user attributes rather than only on role membership.

Unity Catalog provides both natively, enforced at the catalog layer. If you chose Polaris, Nessie, or another catalog without that machinery, you have to implement equivalent governance somewhere else in the stack. This essay covers where that "somewhere else" can be, what it costs in query latency, and which patterns I'd actually recommend.

Four enforcement layers

Where governance can live.

Data access governance can be enforced at four layers of the stack. The trade-offs are different at each one.

Catalog layer

Policies live in catalog metadata and apply to every query engine that uses the catalog, which is what Databricks Unity Catalog and Snowflake provide, and the advantage is that you write the policy once and it enforces everywhere, though the cost is that you've now coupled your governance story to a specific catalog vendor, so for organizations that chose Polaris or Nessie precisely to avoid that coupling, this layer isn't an option.

Query engine layer

Policies are configured in Trino, Dremio, or Starburst and applied during query execution, and because this works with any catalog it preserves vendor neutrality, though the downside is that each query engine has its own policy syntax, so Trino policies don't transfer to Dremio and Dremio policies don't transfer to Spark, which means that if you're a multi-engine shop, you may end up maintaining the same policy in two or three places.

Lakehouse platform layer

This is platform-native governance from Databricks, Snowflake, or Dremio, and it integrates well with the platform's identity management (Azure AD, Okta, native RBAC), with performance that is often better because the platform can optimize the policy enforcement path, though the lock-in cost is real because these policies don't survive a platform migration.

View layer

These are SQL views with WHERE clauses that encode the policy, and because they work with any catalog and any query engine they're the most portable option, but they're also the most operationally painful for reasons I'll cover in detail later, so treat this as a last resort rather than a primary strategy.

For Polaris-plus-Trino stacks, my recommendation is the query engine layer: Trino row filters and column masks defined in file-based access control configuration. This preserves vendor neutrality at the catalog layer (which was why you chose Polaris) while still giving you fine-grained access control. See my catalog decision essay for how this choice interacts with the broader Unity vs Polaris vs Nessie comparison.

Trino row filters

How regional access actually gets enforced.

Trino (formerly PrestoSQL) supports row filters: policies that inject a WHERE clause into the user's query based on their identity. The mechanism is straightforward once you've seen it.

Take the scenario the compliance officer asked about. Your security team operates across three regions: US, EU, and APAC. Analysts should only see CloudTrail events from their assigned region. The policy lives in a JSON file Trino reads at startup.

{
  "catalogs": [
    {
      "catalog": "iceberg_polaris",
      "allow": "all",
      "rowFilters": [
        {
          "table": "security_logs.cloudtrail",
          "filter": "aws_region IN (SELECT region FROM user_attributes WHERE username = '${USER}')"
        }
      ]
    }
  ]
}

The $${USER} variable is the identity Trino received from whatever authentication plugin it's running (LDAP, OAuth, Kerberos). The user_attributes table maps usernames to regions, and you populate it from whatever HR or directory system holds the authoritative answer.

When an analyst named Alice runs a typical threat-hunting query asking for AssumeRole events across all regions, Trino rewrites the query to inject the regional filter before execution, so Alice gets back the rows from her region (us-east-1, say) and nothing else, and the dashboard and detection rules she's using never have to know about the policy because no code changes ripple through downstream tools.

The same pattern extends naturally. Multi-tenant MSSPs filter by customer_id against a user_customers mapping table. Compliance teams enforce time-based access (only the last 90 days, satisfying GDPR retention rules) by injecting a timestamp predicate, and tiered SOC operations filter junior analysts to low and medium severity alerts while letting tier-2 leads see critical events, and in each of these cases the mechanism is a WHERE clause Trino injects, sourced from a lookup table you maintain.

Trino column masks

Redacting IP addresses without breaking correlation.

Column masking is the second half of the pattern, and the mechanism is similar, because Trino reads a JSON configuration that maps columns to CASE expressions and applies those expressions during query execution.

{
  "columnMasks": [
    {
      "table": "security_logs.cloudtrail",
      "column": "sourceIPAddress",
      "mask": "CASE WHEN is_member('pii_access') THEN sourceIPAddress ELSE 'REDACTED' END"
    }
  ]
}

A tier-1 SOC analyst investigating console logins without MFA sees the source IP column as "REDACTED" and can do their triage work without seeing the raw address. A tier-2 lead with the pii_access role sees actual IP addresses and can correlate, so when five failed logins from 203.0.113.45 likely mean a brute-force attempt that needs blocking, tier-1 can identify that something's wrong while tier-2 can identify what to block, which is the escalation pattern encoded into the data layer by the masking itself.

Three masking patterns are worth knowing. Partial masking shows the last octet of an IP address (xxx.xxx.xxx.45) which is sometimes enough to spot patterns without exposing the full address. Hash-based masking replaces the IP with a deterministic hash, so IP-a1b2c3d4 means the same IP every time it appears, and an analyst can correlate without ever seeing the actual value. Conditional masking leaves internal RFC 1918 addresses visible (those are usually safe to expose) and only redacts external IPs.

The hash-based pattern is the one I see most often in mature SOCs, because it preserves the analytical utility of the column. Counting distinct IPs, joining against threat intelligence feeds, looking for repeated offenders: all of those work on a deterministic hash exactly the same way they work on the raw value, with the privacy property that the hash never reverses back into an identifiable address.

Spark and Dremio

What changes when the query engine isn't Trino.

Spark SQL session filters

Spark doesn't have a file-based access control system the way Trino does. The closest equivalent is to inject session-level configuration that wraps every query in a view derived from the user's identity. In practice this means either building a custom Spark SQL extension that intercepts query planning and rewrites it, or wrapping tables in dynamic views whose definitions reference current_user(). Apache Ranger plugs into Spark to do this externally, though Ranger's Iceberg support is community-driven rather than vendor-backed and the production references are mostly Hadoop-era.

The honest read is that Spark-only governance is the hardest of the three engines I'm covering here. If your stack is Spark-dominant and you need fine-grained governance, the path of least resistance is either to put Trino in front of Spark for analyst queries, or to adopt Unity Catalog and accept the Databricks coupling.

Dremio's semantic layer

Dremio takes a different approach. Policies are defined in a visual UI (Catalog → Table → Access Policies, then Catalog → Table → Column → Masking Policy) rather than JSON files, and they live in Dremio's unified semantic layer. The advantage that matters for security teams is that downstream BI tools (Tableau, Power BI, Grafana) inherit those policies automatically, so when a Tableau dashboard queries CloudTrail through Dremio the row filter applies and you don't have to duplicate the policy in every BI tool's data source configuration.

The cost is that the policies are Dremio-specific, so if you swap Dremio for Trino later the policies don't migrate, and Dremio is also commercial software, with Dremio Cloud priced at a list rate of $0.20 per DCU (a 32-DCU engine runs $6.40/hour), which adds a recurring cost the Trino path doesn't have.

OPA integration

When policy logic outgrows JSON files.

Trino's file-based access control works well when your policies are stable and the lookup logic is simple. It works less well when you have hundreds of policies, frequent updates, or policy logic that depends on context the JSON can't easily express: time-of-day rules, break-glass exemptions, or compliance frameworks that require explicit justification before access.

Open Policy Agent (OPA) is the pattern I'd reach for at that scale. OPA is a general-purpose policy engine that evaluates policies written in a language called Rego. Trino integrates with OPA through a plugin that, instead of reading static JSON files, calls out to an OPA service on every authorization decision. The policies live in Rego, which is much more expressive than JSON, and they can pull attributes from external systems (HR data, ticketing systems, threat intelligence feeds) at decision time.

The trade-off is latency. Every Trino query that touches a governed table now makes an out-of-process call to OPA. That call may add a few milliseconds in the best case (OPA is fast, and it's designed for high-throughput decision making) but it's another network hop, another service to monitor, and another failure mode to handle (what does Trino do when OPA is down? You need to decide that, because the default of "deny everything" may be the wrong answer for break-glass scenarios).

I'd reach for OPA when policy complexity makes the JSON file unmaintainable, or when you need to unify authorization decisions across more than one system (Trino plus Kubernetes plus the API gateway, all reading the same Rego policies). For a single-engine stack with stable regional and role-based policies, file-based access control is usually enough.

The view anti-pattern

Why "just use views" stops working past a handful of roles.

Before row filters and column masks existed in query engines, teams used SQL views for governance. Create a cloudtrail_us view that filters to aws_region = 'us-east-1', a cloudtrail_eu view for the EU equivalent, a cloudtrail_masked view that redacts PII columns, and grant analysts SELECT on the right view for their role.

This works at very small scale, but it collapses once you reach real scale, and it collapses for predictable reasons, because ten regions times five tables times three roles is 150 views to maintain, so adding a new region means writing fifteen new view definitions and remembering to grant the right permissions on each. Views also can't dynamically filter based on current_user() in a way that does what you'd want. Standard SQL views are static definitions, not parameterized policies, so the filtering logic has to be baked into the view definition itself, which is what produces the combinatorial explosion in the first place.

There's an audit-trail problem too. When a row filter is applied through Trino's access control system, Trino logs which filter applied to which query. When the same logic is baked into a view, the audit log just shows that the user queried a view; there's no record of what the view filtered out. That's a real compliance gap for SOC 2 or HIPAA audits, where you may need to demonstrate which policy applied to a given access event.

Views are acceptable for narrow cases: two or three roles, static policies, simple single-column predicates. A alerts_soc view that filters to the last 90 days at severity HIGH or above is fine. For anything beyond that, the row-filter-and-column-mask pattern is the right tool.

The cost honestly

Row-level security has measurable query overhead.

This is the section where the marketing material usually gets quiet. Row-level security and column masking aren't free. They add query-planning work, they add subquery overhead, and depending on how well your filter columns are partitioned, they may force scans that wouldn't otherwise happen.

Vendor-published benchmarks suggest the overhead lands somewhere between 5% and 30% for typical workloads. Unity Catalog claims 5–15% because the predicate pushdown happens at the catalog layer where the file statistics already live, while Trino's file-based access control documentation suggests 15–50% depending on how the filter column is partitioned. Dremio's semantic layer governance falls roughly in the middle. These numbers come from vendor sources, they haven't been independently reproduced on security workloads at scale, and the spread is wide enough that I'd treat any specific figure with skepticism. The directionally honest claim is that row-level security may add measurable query overhead in the 5–30% range for well-partitioned workloads, and substantially more when the filter column doesn't align with the partition scheme. Verify on your own data before quoting a number.

The most important optimization is partitioning. If your CloudTrail table is partitioned by aws_region, a row filter that constrains aws_region only scans the relevant partitions; partition pruning happens before the filter is even evaluated. If the filter column isn't a partition column, every query has to scan the whole table and then filter, which is where the higher end of the overhead range lives.

Practical advice: align your governance dimensions with your partitioning strategy, so if you know you'll filter by region then partition by region, and if you know you'll filter by tenant then partition by tenant, because the overhead of getting this wrong compounds with every query, and at security-data scale, that compounding is what turns a 15% tax into an operational problem.

Audit logging

Policies without audit trails fail compliance reviews.

A row filter that works without leaving a trail isn't enough for most compliance frameworks. GDPR, HIPAA, and SOC 2 all expect you to demonstrate which policy applied to which access event, who could see what, and when policies changed.

Trino logs every query to /var/log/trino/query.log, and the log entry includes which row filters were applied, which user ran the query, and which tables were accessed. Export those logs to your SIEM (or to whatever lakehouse table you use for SOC-on-SOC analytics) and you have a queryable audit trail. The key bits to capture are the policy enforcement events themselves (which filter was applied to which query), policy changes (who modified a policy, what the old and new versions were), and access denials (which users were blocked from which tables).

Unity Catalog provides a more integrated path. It exposes a system.access.audit table that you can query directly with SQL, including which row filters and column masks applied to each query and how many rows were filtered out. That's a cleaner story for a SOC 2 auditor than digging through Trino log files, and it's one of the genuine reasons Unity Catalog is worth its lock-in cost if compliance auditing is high-frequency in your environment.

Implementation path

A pragmatic rollout for Polaris plus Trino.

Step 1: build the user attributes table

Create a single source of truth that maps usernames to whatever attributes your policies need: region, business unit, customer tenant, clearance role. Populate it from your directory or HR system, and refresh on a cadence that matches how often those attributes change (daily is usually enough; for contractor-heavy environments, hourly may be worth it). This table is the foundation everything else leans on.

Step 2: define row filters in Trino access control

Edit /etc/trino/access-control/policies.json, add the row filter referencing user_attributes, restart Trino. Start with one table and one filter pattern (usually regional access for the highest-volume table) and validate end to end before generalizing.

Step 3: define column masks for PII columns

Identify the PII columns in your security telemetry: usually source IP, usernames in principalId-style fields, and any email addresses or session tokens. Add column masks using is_member('role') CASE expressions. The hash-based masking pattern is usually the right default because it preserves analytical utility.

Step 4: test, measure, log

Log in as a non-privileged user and verify the filter and mask both apply. Measure query latency before and after on a representative workload. This is where the 5–30% overhead estimate becomes either a real concern or a non-event for your specific tables. Export Trino's query log to wherever your audit trail lives, and confirm the filter-applied events are captured.

Conclusion

Governance is non-negotiable. The layer is the choice.

You don't get to skip governance because your catalog doesn't support row-level security natively, because the moment sensitive logs land in a data lake you need row-level filtering and column masking, which compliance frameworks expect and security operations require, with the audit trail still having to be there when someone asks.

The choice is which layer enforces the policy. Catalog-level governance via Unity Catalog or Snowflake gives you the cleanest implementation and the best performance, at the cost of vendor coupling. Query engine-level governance via Trino, Dremio, or OPA preserves catalog neutrality and works with Polaris or Nessie, at the cost of per-engine policy duplication and somewhat higher query overhead. View-based governance is an anti-pattern for anything beyond trivial scale, but it remains useful for narrow static policies.

My recommendation for the multi-cloud Polaris stack: row filters and column masks at the Trino layer, file-based access control with policies referenced against a maintained user_attributes table, partition-aligned schemas to keep the overhead manageable, and OPA when policy complexity outgrows the JSON file. It's not as clean as Unity Catalog, but it preserves the vendor-neutrality choice you made when you picked Polaris in the first place, and the governance still gets done, it just lives one layer up.

One part of governance the row-filtering conversation tends to skip is the audit trail itself, and the table format hands you most of it for free. I checked this on the stack: three commits to an Iceberg table leave three immutable snapshots in an unbroken parent-to-child chain, and time-travel reads back the exact state as of each one, so "what did this table say when the analyst queried it" has a verifiable answer and a deletion or backfill can't quietly rewrite history. A schema-on-read SIEM that stores telemetry in a mutable index has no equivalent — there's no tamper-evident record of the past — which is precisely the property Reg SCI and 17a-4(f) evidence demand, and it holds under any of the REST catalogs because it's a property of the table format, not the catalog vendor.