Security Data Works

Technology deep-dive

dbt for security data.

dbt is the SQL transformation framework the analytics-engineering world has been using for almost a decade, and detection engineers haven't picked it up, mostly because dbt Labs markets the tool to people building customer-360 dashboards rather than to the people normalizing CloudTrail to OCSF, even though the fit is much better than the marketing suggests. This essay walks through how I'd actually use dbt to manage OCSF normalization, detection logic, and data-quality checks against a security data lake, written for engineers who know SQL and Git but have never run dbt run.

Reading time: about 18 minutes. Evidence tier: B (SnowAlert as historical production deployment, LinkedIn's Moonbase as conceptual parallel, Databricks and Snowflake vendor architectures), with one Tier C hedge on dbt Cloud pricing and one explicit Tier D speculation about row-lineage-based incremental models that haven't shipped in the dbt-iceberg adapter yet.

The pain

Eighty SQL files in a shared drive is not a pipeline.

The pattern I see often enough that it qualifies as a category: a detection engineering team maintains some number of SQL files that transform raw log sources into a normalized schema. Eighty files is roughly typical. Each file is a few hundred lines, covers one log source (CloudTrail, Azure AD, Windows Security events, VPC Flow, EDR telemetry), and produces a table the detection rules query against. The files live in a shared drive or a Git repo that nobody runs CI on.

The schema is OCSF, or some internal pre-OCSF normalization, or both. When OCSF v1.3 ships and the team needs to migrate, the work is exactly what you'd expect: you open eighty files, find the field mappings that changed, edit each one, and hope you got them all, then you test by running the downstream detection rules and watching for the ones that go silently to zero, and you discover the missed field three weeks later during a postmortem.

The data engineering team next door doesn't work this way. They've been using dbt for years. When the warehouse schema changes, they update one shared piece of SQL (dbt calls it a macro), run the test suite, watch the build pass or fail in CI, and ship. The discipline isn't novel, because it's the same version control, modular code, and automated testing that any working software team uses, and dbt is mostly packaging it for SQL.

Detection engineers have rebuilt this pattern badly several times. LinkedIn's Moonbase platform (their internally-built CI/CD for security detections) is the most public example, and the engineering-blog write-up reads as parallel to dbt almost line-for-line, just with custom tooling. dbt is the off-the-shelf version of the same idea, and the cost to adopt is a Python install plus a few hours of learning curve, not a year of platform engineering.

What dbt actually is

A SQL build tool, not a database.

dbt (the name stands for "data build tool") is an open-source framework from dbt Labs that handles the T in ELT (extract, load, transform). It isn't a database or a query engine or a warehouse, because what it does is compile SQL files into queries and submit them to whatever engine you already have: Spark on Iceberg, Snowflake, BigQuery, Databricks, ClickHouse via the community adapter, and a long tail besides, so the engine does the work while dbt manages the SQL.

Four concepts cover roughly ninety percent of what you'll touch in the first three months:

  • Models are SQL files that each describe one transformation. A model is a SELECT statement. dbt wraps it in the right CREATE TABLE or CREATE VIEW boilerplate for your engine and runs it. One model per output table is the default mental model.
  • Materializations are how dbt builds the model. A view materialization re-runs the SQL every time someone queries it. A table materialization runs the SQL once, stores the output, and serves queries from the stored copy. An incremental materialization runs the SQL only over new rows since the last build. That last one matters most for high-volume security logs.
  • Tests are assertions about the data. dbt ships with four built-in tests (not_null, unique, accepted_values, relationships) and lets you write custom ones in SQL. A test is a SELECT statement that should return zero rows; if it returns rows, the test fails and the build can be configured to stop.
  • Exposures are how you declare what downstream system uses a model. For detection engineering, an exposure is the link from a normalized OCSF table to the detection rules that depend on it. When you change the table, dbt's lineage graph tells you which rules will be affected.

The reason these four primitives cover so much ground is that dbt is opinionated about Git. Every model, every test, every macro lives in a Git repo. dbt run is reproducible against any commit. CI runs dbt test before merging. The auditability story for detection-as-code falls out of the tool's defaults, not as a feature you have to bolt on.

Pattern 1

Incremental models for high-volume logs.

The single biggest reason dbt makes sense for security data is the incremental materialization. CloudTrail at a 10,000-employee enterprise produces somewhere around 1 TB of raw events per day. Over a year of retention, that's roughly 365 TB of accumulated history. A naive transformation pipeline reprocesses the whole 365 TB every night to produce the next day's normalized table. That's ridiculous, and it's also exactly what a lot of homegrown SQL pipelines do, because the alternative is hand-rolling watermark logic.

An incremental model in dbt looks roughly like this:

-- models/normalized/cloudtrail_ocsf.sql
{{ config(materialized='incremental', unique_key='metadata_uid') }}

SELECT
  eventTime AS time,
  '3005' AS class_uid,
  eventName AS activity_name,
  userIdentity.principalId AS actor_user_uid,
  sourceIPAddress AS src_endpoint_ip,
  awsRegion AS cloud_region,
  eventID AS metadata_uid
FROM {{ source('raw_logs', 'cloudtrail_raw') }}

{% if is_incremental() %}
  WHERE eventTime > (SELECT MAX(time) FROM {{ this }})
{% endif %}

The config block at the top tells dbt to use the incremental strategy. The is_incremental() block is dbt's way of saying "only include this WHERE clause on incremental runs, and on the first build, scan everything." The double-braces are Jinja templates, which dbt uses to inject references and conditionals into otherwise-plain SQL.

On the first dbt run, the model processes the entire history. On every subsequent run, it processes only events newer than the latest row already in the target table. For a 365 TB historical dataset with 1 TB/day of new data, the daily delta is 1/365th of the work. The compute savings are roughly two orders of magnitude. The dbt community reports 85% to 99% reductions depending on workload shape, and that range matches my experience for batch-shaped security workloads. I'd treat any single percentage you see quoted as directional rather than precise; the real number depends on your retention window, your data volume, and how often you have to do a full refresh (schema changes, data-quality issues, occasional cleanups).

The honest caveats: incremental models require idempotency. If the same source row gets processed twice, the unique key tells dbt to merge rather than duplicate. They also require a reliable watermark column. CloudTrail's eventTime works because AWS guarantees it; a log source where timestamps are unreliable will silently drop late-arriving events that show up after the next dbt run. And schema changes typically force a full refresh, which means the historical compute bill comes due once every few months. None of this is a dealbreaker, but it's the kind of caveat you need to know going in.

One forward-looking note, labeled as such: Iceberg V3's row lineage opens a path to incremental models that don't need a watermark column at all, because the dbt model could query by row-lineage timestamp instead of by eventTime, which is robust to late-arriving data in a way the watermark approach is not. This is Tier D evidence: the dbt-iceberg adapter support for row-lineage-based incremental hasn't shipped as of early 2026, and the V3 features themselves are still rolling out across query engines through the year. Worth tracking if Iceberg is your table format. The feature mechanics are in the Iceberg V3 thesis-shift piece.

Pattern 2

Macros for the OCSF migration that would otherwise take a week.

A macro is a reusable piece of SQL, the analog of a function in any other programming language. The canonical example for security data is user-identity normalization. CloudTrail gives you an IAM ARN. Azure AD gives you a user principal name with an @onmicrosoft.com suffix. Windows Security events give you DOMAIN\\username. A macro that handles all three looks like:

-- macros/ocsf_normalize_user.sql
{% macro ocsf_normalize_user(user_field) %}
CASE
  WHEN {{ user_field }} LIKE 'arn:aws:iam%'
    THEN REGEXP_EXTRACT({{ user_field }}, 'user/(.+)')
  WHEN {{ user_field }} LIKE '%@%onmicrosoft.com'
    THEN SPLIT_PART({{ user_field }}, '@', 1)
  WHEN {{ user_field }} LIKE '%\\%'
    THEN SPLIT_PART({{ user_field }}, '\\', 2)
  ELSE {{ user_field }}
END
{% endmacro %}

Each downstream model calls the macro instead of repeating the CASE statement: {{ ocsf_normalize_user('userPrincipalName') }} AS actor_user_name. When OCSF v1.4 ships and the user-name field gets renamed or its format expectations change, you edit the macro once. Every model that uses it inherits the change on the next dbt run.

The migration story this enables is straightforward: an OCSF schema bump that would have taken a week of hand-editing eighty SQL files becomes a half-hour of macro edits, a CI run, and a deploy. I don't want to oversell the number; there's usually some logic that varies by log source and can't be macro-ized, so a real migration is partly macro work and partly per-model work. But the ratio of per-source edits to shared edits shifts substantially. In my experience the edits-per-migration count drops by roughly an order of magnitude.

The other thing macros do is enforce consistency. If three different models had three slightly different ways of parsing the same field, collapsing them into a single macro forces the team to pick one canonical interpretation, an alignment exercise that often turns up bugs nobody had noticed.

Pattern 3

Tests as detection assertions.

This is the pattern I think detection engineers will find most directly useful, because it maps onto a problem they already have: detection rules that fail silently when upstream data shifts. The textbook example: a CloudTrail field gets renamed, the rule's WHERE clause matches nothing, the rule produces zero alerts forever, and nobody notices until the postmortem on an incident that should have fired.

dbt tests are written in YAML for the simple cases and in SQL for the complex ones. A YAML test declaration looks like this:

# models/normalized/schema.yml
version: 2
models:
  - name: cloudtrail_ocsf
    columns:
      - name: time
        tests:
          - not_null
      - name: class_uid
        tests:
          - not_null
          - accepted_values:
              values: ['3005']
      - name: actor_user_uid
        tests:
          - not_null

Running dbt test against this schema produces pass-or-fail output for each assertion. If the upstream CloudTrail schema changes and userIdentity.principalId is suddenly null on a meaningful share of rows, the not_null test fails before any detection rule runs against the broken data. In a CI-gated workflow, the deploy blocks. In a scheduled-run workflow, the alert goes to the detection engineer rather than going silent.

Custom tests are SQL files in tests/ that return zero rows on success. The detection analog is direct: a custom test can encode "this detection rule should never fire below threshold X on known-good test accounts" or "yesterday's CloudTrail volume should not have dropped more than 50% from the trailing seven-day average." Both are assertions about data shape that you'd otherwise have to check manually, and both fail loudly when the upstream pipeline breaks rather than letting the failure go unnoticed until someone goes looking.

The freshness check is a special case worth calling out. Source freshness is a built-in dbt feature that runs SELECT MAX(loaded_at) FROM source_table and warns or errors if the answer is older than a configured threshold. "Warn if CloudTrail is more than 2 hours stale, error if more than 6 hours" turns into three lines of YAML. The threat-hunter complaint that opens half of the incident retrospectives I've seen ("I queried CloudTrail and got zero results; was my query wrong or was the data missing?") has a structural answer that doesn't require a human to remember to check.

Pattern 4

Exposures as the detection-as-code linkage.

Exposures are the dbt primitive I think most detection teams don't realize they want. An exposure is a declaration, in YAML, that says "this downstream thing (a dashboard, a detection rule, a notebook, an API consumer) depends on these models." It doesn't change how dbt builds anything. What it does is wire downstream consumers into dbt's lineage graph.

# models/detection/exposures.yml
version: 2
exposures:
  - name: credential_stuffing_detection
    type: application
    owner:
      name: SOC Detection Team
    description: >
      Detects credential stuffing patterns in authentication logs.
      Threshold: >10 failed attempts from 3+ distinct IPs in 1 hour.
    depends_on:
      - ref('authentication_logs_ocsf')
      - ref('threat_intel_ips')

Why this matters: when a detection engineer changes the authentication_logs_ocsf model, dbt docs generate produces a lineage graph that includes the credential_stuffing_detection exposure as a downstream consumer. The change-impact question ("what detection rules will I affect if I rename this field?") has a graph answer rather than a tribal-knowledge answer. New team members can read the graph and see what depends on what.

The detection-as-code framing this enables is straightforward: the detection rules themselves can also live as dbt models, with their own tests, and the exposure links them to the upstream normalized tables. A rule becomes a tested, version-controlled SQL artifact with declared inputs and a visible place in the lineage graph. That's the LinkedIn Moonbase model, expressed as ordinary dbt rather than as custom platform code.

I want to be careful here about the limits. Detection rules implemented as dbt models run on dbt's schedule, typically every fifteen minutes for hourly builds, or every hour for daily ones, which is fine for threat-hunting analytics and post-event review but not fine for real-time alerting, so nobody should treat dbt as a replacement for stream-processing tools like Cribl or Tenzir when latency matters. The pattern works for the slow-tier detections (beaconing, anomalous access, long-window behavioral patterns) and you keep your streaming pipeline for the fast tier, so the two run alongside each other rather than competing for the same job.

Engine layer

dbt is the build tool; Spark or Snowflake does the work.

A point of confusion I see often: dbt is not the engine. It compiles your SQL and submits it to whatever engine your profiles.yml points at. For a security data lake on Iceberg, that engine is most often Apache Spark, which handles the ACID writes, partition management, and distributed execution, while dbt's job is to manage the SQL, the test suite, the dependency order, and the documentation without ever moving the data itself.

The practical implication is that the choice of engine matters as much as the choice of dbt. If your team is on Databricks, the dbt-databricks adapter ships Spark execution against Delta or Iceberg tables under Unity Catalog. If you're on Snowflake, dbt-snowflake handles native Snowflake SQL. For a self-managed Iceberg lakehouse, dbt-spark with Iceberg configuration is the typical path. For a ClickHouse-fronted analytical layer, the community-maintained dbt-clickhouse adapter exists but is less mature. I'd treat it as usable but not as battle-tested as dbt-spark or dbt-snowflake.

A typical Iceberg incremental config in dbt looks like:

{{ config(
  materialized='incremental',
  file_format='iceberg',
  partition_by=['date(time)'],
  incremental_strategy='merge',
  unique_key='metadata_uid'
) }}

The incremental_strategy='merge' path is the one that benefits most from Iceberg V3's deletion vectors. Under V2, merge-strategy incremental models accumulated delete files that needed periodic compaction, which is one of the honest reasons dbt teams sometimes avoided merge-strategy on Iceberg, and V3 turns those point updates into metadata operations so the compaction tax mostly disappears. As with everything V3-related, engine support is rolling out through 2026, so check your Spark or Trino version before assuming the simpler operational profile. The deletion-vector mechanics are covered in the Iceberg V3 thesis-shift piece.

Where dbt fits in the schema debate

The transformation layer in a schema-on-read world.

The longer thesis I work with (covered separately in schema-on-read versus schema-on-write) is that security data benefits from landing in the lake in its native shape and being transformed into analyst-ready forms on demand. dbt is the practical answer to "and how do you actually run those transformations." Raw logs land in S3 (or equivalent object storage) in their native JSON or Parquet form, registered as Iceberg tables. dbt models produce the OCSF-normalized views downstream. Detection rules query those views. Threat hunters query whichever layer makes sense for their question.

This is the opposite of a schema-on-write SIEM, where the parsing rules are applied at ingest and the original event is often discarded, because the dbt model is repeatable, testable, and reversible, so if the normalization logic was wrong, you re-run against the raw data with corrected logic, and the downstream consumers see the fix on the next build, which you can't do when the SIEM threw away the raw event at ingest.

The composition point is where dbt earns its keep as a foundational piece, because Iceberg holds the data while Spark or Snowflake executes the SQL and OCSF defines the target schema, and dbt is the build tool that ties the three together into a reproducible, tested pipeline rather than a folder of ad-hoc SQL files, so none of the four pieces does the others' job and the value is in how they compose.

Production references

Who's actually done this, and what's still missing.

I don't have a SOC team blog post titled "How we use dbt for OCSF normalization" to cite as Tier A validation. What I have is three Tier B references that make the case directionally.

SnowAlert. Snowflake's open-source security analytics framework, active from 2018 to 2024, used dbt to manage detection rules as version-controlled SQL views. Detection rules lived in a snowalert.rules schema, with pre- and post-hooks for ingest automation, and the full dbt machinery (tests, dependency graph, modular SQL) applied to security detections rather than business intelligence. Snowflake deprecated SnowAlert in 2024 when they built first-party Alerts and Notifications, which I read as productization rather than abandonment, because Snowflake saw enough demand for detection-as-code to bring it into the platform, so the pattern was validated even though the specific framework was superseded.

LinkedIn Moonbase. LinkedIn rebuilt its threat detection infrastructure between 2020 and 2022 using "engineering-first" practices: version control for detection rules, CI/CD before deployment, modular code, automated testing. The LinkedIn Engineering blog write-up reads as a parallel to dbt's design without naming dbt. Data collection expanded roughly tenfold (gigabytes to petabytes) and triage time dropped from hours to minutes, and the gap LinkedIn filled with custom tooling is the same gap I'd argue dbt fills off the shelf.

Databricks and Snowflake security architectures. Both vendors document dbt as a recommended transformation layer in their security data lake reference architectures. The dbt-databricks and dbt-snowflake adapters are first-party and well-supported. The case studies published by the vendors mention dbt for security log processing, usually anonymized. This is Tier B (vendor documentation and anonymized case studies) rather than Tier A (named production deployment with disclosed methodology).

What's missing: a named SOC team writing publicly about how they manage their OCSF normalization and detection-as-code with dbt. I'd treat that gap as an opportunity for the team that's willing to write the blog post, not as a reason to wait. The structural argument (that detection engineers have the same software-engineering problems analytics engineers solved with dbt a decade ago) is strong enough that I'd start now and build the case study as you go.

Cost and adoption

What it actually costs to adopt.

dbt Core is open source under Apache 2.0. You install it with pip install dbt-core plus whichever adapter matches your warehouse: dbt-spark, dbt-snowflake, dbt-databricks, dbt-bigquery. The Core install is free, runs on a laptop or a CI runner, and produces the same compiled SQL that the paid product produces.

dbt Cloud is the managed offering from dbt Labs: browser-based IDE, scheduled job runs, a web-served lineage graph, enterprise governance. Pricing has shifted multiple times since 2023, so I'd consider any specific dollar figure stale by the time you read this; check the dbt Labs site for current numbers. Cloud is useful for teams without existing CI/CD infrastructure, but it is not required. A team running CI on GitHub Actions or GitLab can self-host every Cloud capability except the IDE.

The learning curve, in my experience, is moderate for someone fluent in SQL and Git. The dbt official tutorial takes a few hours. Building your first three or four models (pick the easiest log source first, usually CloudTrail or Azure AD sign-in logs) takes another five to ten hours including writing the tests. Productive independent use lands somewhere around the ten-to-fifteen-hour mark, and the skill gap that slows people down most often isn't SQL but the version-control discipline (branches, pull requests, code review) that dbt assumes you already have, so a team that's been writing SQL into a shared drive without Git will spend more time on the Git muscle memory than on dbt itself.

My recommendation for the first three months: install dbt Core. Pick one log source. Write a staging model that flattens the raw JSON. Write a normalized model that produces the OCSF table. Write five tests: not_null on the required OCSF fields, accepted_values on the class_uid, a custom volume-anomaly test. Run dbt docs serve and click through the lineage graph. That's enough to know whether the pattern fits your team. If it does, the next ninety days are about expanding the model count and converting your detection rules into dbt-managed artifacts with exposures linking them to upstream.

What dbt isn't

The honest limits.

The gaps and constraints worth knowing before adoption, in roughly the order they tend to surface:

  • Real-time alerting is out of scope. dbt's smallest meaningful unit of work is a scheduled run, which lands at the fifteen-minute-to-hourly cadence in practice. Stream-processing tools like Cribl or Tenzir do the sub-second detection work. Use both, not one.
  • Complex parsing belongs in Python. dbt is SQL-first. If your transformation needs non-trivial JSON parsing, ML feature engineering, or external API enrichment, the right tool is Python: either as a pre-step (a Python script that lands cleaned data in the lake, which dbt then transforms) or as a dbt model in Python (the dbt-python feature, available in dbt-spark and dbt-databricks, less mature elsewhere). The split is per-step, not per-pipeline.
  • Orchestration is separate. dbt schedules dbt runs. It does not orchestrate upstream ingest, downstream alerts, or dependencies across different systems. Airflow, Dagster, or Prefect (or dbt Cloud's scheduler if you're paying for it) handle the orchestration layer.
  • Some adapters are less mature than others. dbt-spark, dbt-snowflake, dbt-databricks, and dbt-bigquery are first-party and well-tested. dbt-clickhouse and dbt-duckdb are community-maintained and usable but rougher around the edges. If your stack is built around a less-common engine, expect to find more sharp edges than the marketing suggests.
  • Tests are assertions, not guarantees. A dbt test that passes today on yesterday's data does not guarantee tomorrow's data passes. The test is a structural check at build time. You still need monitoring on the runs themselves, not just confidence that the SQL compiled.

None of these are reasons to skip dbt, but they're reasons to set expectations correctly and to build the surrounding infrastructure (orchestration, streaming, monitoring) in parallel rather than assuming dbt covers all of it.

Conclusion

The detection-as-code argument is the dbt argument.

Detection engineers want their rules to be version-controlled, tested before deployment, traceable to their data sources, and documented enough that the next analyst doesn't have to start from scratch. Analytics engineers solved the analogous set of problems for business intelligence between 2016 and 2020, and the tool they consolidated around was dbt. The reason the same tool hasn't been picked up by security teams is mostly marketing: dbt Labs sells to analytics engineers, the dbt documentation uses ecommerce examples, and security teams don't see themselves in the framing.

The fit is good, because incremental models cut the daily compute bill on high-volume logs by something like an order of magnitude or two, and macros collapse OCSF schema migrations from a week of hand-editing eighty files to half a day of macro edits plus CI, while tests catch upstream schema drift before detection rules go silent and exposures wire detection rules into a lineage graph so the change-impact question has a graph answer rather than a tribal-knowledge one.

None of this is novel, because it's the same software-engineering discipline that working teams have used for a decade, applied to a domain that hasn't adopted it yet. My recommendation: pick one log source, install dbt Core, build a working pipeline including tests and one exposure, and use that as the evaluation, so the decision after ninety days rests on what your team experienced rather than on this essay. The structural argument is strong, but the proof is whether your detection engineers find it easier to ship and harder to break things, and you only get that proof by running it.