Security Data Works

Implementation anti-patterns

Flattening away your detection logic.

When migrating from SIEM to lakehouse, teams flatten nested JSON to optimize columnar storage. In the process they lose semantic relationships — and detections silently break. This isn't schema conversion. It's semantic translation.

Reading time: 17 minutes. Evidence tier: B–C (architectural analysis plus practitioner patterns; production case studies pending).

V3 update first

Iceberg V3's Variant type changes the fix.

This essay argues that flattening CloudTrail's nested JSON loses critical semantic relationships — most notably, the difference between "field absent" and "field is NULL" for mfaAuthenticated, which silently broke a privilege-escalation detection. The diagnosis is unchanged: flattening still loses semantics. What changed is the recommended fix.

The original recommendation was ClickHouse Nested types as the primary architectural answer. Apache Iceberg V3 introduces the Variant type (semi-structured / nested JSON, ratified in Parquet October 2025), purpose-built for exactly this scenario: it preserves JSON absence-vs-null semantics natively, supports queryable nested structures inside a columnar table format, and works across multiple query engines.

For new lakehouse deployments in 2026 and beyond, Variant should be the default for security log types where field presence carries information (CloudTrail, Kubernetes audit, Azure Activity, Okta system logs). ClickHouse Nested remains valid for ClickHouse-only stacks; Variant is the cross-engine answer for Iceberg-based deployments. Both preserve the relationships this essay warns about.

Caveat: V3 Variant engine support is still rolling out across Spark, Trino, DuckDB, and Snowflake through 2026. Verify your engine supports variant_get() and equivalent semantics before relying on Variant in production.

The opening case

The migration that broke 23 detections.

A financial services firm migrated 50 TB/day of AWS CloudTrail logs from Splunk to ClickHouse in December 2024. The team celebrated: query performance improved 10×, infrastructure costs dropped 85%, and analyst complaints about slow searches vanished overnight.

Six weeks later, during a routine SOC 2 Type II audit, the auditor asked: "Show me all IAM privilege escalations without MFA for the past 90 days." The detection engineer ran the query. Zero results.

That couldn't be right. The team manually reviewed CloudTrail logs and found 23 privilege escalation events where users attached IAM policies without multi-factor authentication — exactly the scenario the detection rule was supposed to catch.

The detection logic hadn't failed. It had been quietly broken during migration, returning empty results instead of alerts. For six weeks, a critical security control was blind.

What went wrong? The migration team flattened CloudTrail's nested JSON structure to optimize ClickHouse's columnar storage. In the process, they lost a critical semantic relationship: in CloudTrail, the absence of the mfaAuthenticated field means "no MFA," not "NULL." The flattened schema treated absence as NULL, and the detection rule — checking WHERE mfaAuthenticated = false — returned nothing.

Why it matters

The flattening tax.

When security teams migrate from traditional SIEMs (Splunk, Elastic) to modern lakehouses (ClickHouse, Iceberg, DuckDB), they face a fundamental architectural trade-off.

Traditional SIEM: stores logs as nested JSON documents. Relationships preserved — userIdentity.sessionContext.mfaAuthenticated remains hierarchical. Semantic meaning intact: field presence and absence carry information. Detection logic naturally expresses security relationships. Weakness: query performance degrades at scale (billions of events become 30+ second queries).

Modern lakehouse: stores data in columnar format for compression and speed. Flattening common — userIdentity.sessionContext.mfaAuthenticated becomes userIdentity_sessionContext_mfaAuthenticated (flat column). Semantic meaning obscured: NULL vs absent vs false distinctions lost. Strength: 10–100× faster queries, 10–20× better compression. Weakness: complex hierarchical relationships harder to express, easier to break.

The migration from SIEM to lakehouse isn't just schema conversion — it's semantic translation. And most teams don't realize that until detections fail.

Pattern 1

User → session → context → attributes.

Security decisions require context — who did what, under what circumstances. User identity alone isn't enough; you need session context (MFA status, assumed role, source identity).

AWS CloudTrail example:

{
  "eventName": "AttachUserPolicy",
  "userIdentity": {
    "type": "IAMUser",
    "userName": "alice@example.com",
    "sessionContext": {
      "attributes": {
        "mfaAuthenticated": "false",
        "creationDate": "2025-02-10T14:23:01Z"
      },
      "sessionIssuer": {
        "type": "Role",
        "arn": "arn:aws:iam::123456789:role/AdminRole"
      }
    }
  },
  "requestParameters": {
    "userName": "bob",
    "policyArn": "arn:aws:iam::aws:policy/AdministratorAccess"
  }
}

Detection rule: "Alert on IAM privilege escalation (AttachUserPolicy, PutUserPolicy, AddUserToGroup) by users without MFA." Requires traversing userName → sessionContext → attributes → mfaAuthenticated. The critical semantic: the field mfaAuthenticated is only present when MFA was used. Absence equals no MFA.

The naive ClickHouse translation fails silently:

-- Naive ClickHouse translation (FAILS SILENTLY)
SELECT eventTime, userIdentity_userName, eventName, requestParameters_policyArn
FROM cloudtrail_flat
WHERE eventName IN ('AttachUserPolicy', 'PutUserPolicy', 'AddUserToGroup')
  AND userIdentity_sessionContext_attributes_mfaAuthenticated = 'false';  -- Never matches!

When MFA is used the field is present with value "true". When MFA is not used the field is absent from JSON — not present, not "false". The flattened schema converts absent to NULL. The detection query checks = 'false' against a NULL column. Result: zero alerts, silent failure.

The fix: either correct the query semantics or preserve the structure.

-- Approach 1: fix query logic — treat NULL as "no MFA"
SELECT eventTime, userIdentity_userName, eventName, requestParameters_policyArn
FROM cloudtrail_flat
WHERE eventName IN ('AttachUserPolicy', 'PutUserPolicy', 'AddUserToGroup')
  AND (userIdentity_sessionContext_attributes_mfaAuthenticated != 'true'
       OR userIdentity_sessionContext_attributes_mfaAuthenticated IS NULL);

-- Approach 2: use Nested types
CREATE TABLE cloudtrail_nested (
    eventTime DateTime,
    eventName String,
    userIdentity Nested(
        type String,
        userName String,
        sessionContext Nested(
            attributes Nested(
                mfaAuthenticated String,
                creationDate DateTime
            )
        )
    ),
    requestParameters Nested(
        userName String,
        policyArn String
    )
);

Lesson: absence is meaningful in security logs. Flattening must preserve semantic distinctions (NULL vs absent vs false).

Pattern 2

Event → action → target → target attributes.

Security detections often filter on action plus target combination — delete a sensitive S3 bucket, modify an admin user, access a PII database table. Flattening can break the association between action and target attributes.

CloudTrail events have 1–N resources (an S3 event touches the bucket plus the IAM user). Flattening to a single resource_ARN column loses every resource beyond the first. A detection rule checking "sensitive" in ARN misses events where the S3 bucket sits at index 1 instead of index 0.

Approach 1 (pick first resource only) and Approach 2 (fixed-width array of three) both fail:

-- Approach 1: wrong — loses data
CREATE TABLE cloudtrail_flat (
    eventTime DateTime,
    eventName String,
    resource_ARN String,     -- only resources[0].ARN
    resource_type String     -- only resources[0].type
);

-- Approach 2: wrong — brittle
CREATE TABLE cloudtrail_flat_array (
    eventTime DateTime,
    eventName String,
    resource_ARN_0 String,
    resource_ARN_1 String,
    resource_ARN_2 String   -- what if event has 4 resources?
);

The correct approach uses nested arrays:

CREATE TABLE cloudtrail_nested (
    eventTime DateTime,
    eventName String,
    userIdentity_userName String,
    resources Array(Tuple(
        ARN String,
        accountId String,
        type String
    ))
);

-- Query: Alert on DeleteBucket for S3 buckets with "sensitive" in ARN
SELECT eventTime, userIdentity_userName, eventName, resources
FROM cloudtrail_nested
WHERE eventName = 'DeleteBucket'
  AND arrayExists(r -> r.3 = 'AWS::S3::Bucket' AND r.1 LIKE '%sensitive%', resources);

A real-world pattern: a SaaS security team used Approach 1 for "simplicity" after migrating CloudTrail from Elasticsearch to ClickHouse. Detection rule alerting on S3 bucket deletions for buckets with "backup" or "archive" in the name missed 17 deletion events over 4 weeks — the backup buckets were always resources[1], not resources[0]. Discovered when a customer reported missing backups. Four-week blind spot, customer data loss incident, CISO escalation.

Pattern 3

Process → parent → network → destination.

EDR detection logic chains process execution to network activity ("PowerShell downloads script from external IP," "Office app spawns cmd.exe then connects to C2"). Flattening can break the association between a process and its network connections.

Wrong approach: denormalize (one row per network connection):

CREATE TABLE edr_flat (
    eventTime DateTime,
    hostname String,
    process_name String,
    process_pid UInt32,
    process_commandLine String,
    network_destinationIp String,
    network_destinationPort UInt16,
    network_destinationDomain String,
    network_protocol String
);
-- Result: multiple rows per event (one per network connection)

The query "process with HTTP connection to non-CDN" requires aggregation across rows: 45 seconds on ClickHouse versus 3 seconds on Elasticsearch (where the nesting was preserved). PID reuse can mix unrelated processes, generating false negatives.

Correct approach: nested arrays:

CREATE TABLE edr_nested (
    eventTime DateTime,
    hostname String,
    process Nested(
        name String,
        pid UInt32,
        commandLine String,
        parentProcess Nested(
            name String,
            pid UInt32
        )
    ),
    network Array(Tuple(
        direction String,
        protocol String,
        destinationIp String,
        destinationPort UInt16,
        destinationDomain String,
        bytesTransferred UInt64
    ))
);

-- Query: PowerShell with encoded command connecting to non-CDN
SELECT eventTime, hostname,
       process.name[1] as proc_name,
       process.commandLine[1] as cmd,
       arrayFilter(n -> n.2 = 'HTTP' AND n.5 NOT IN ('cloudfront.net', 'akamai.net'), network)
FROM edr_nested
WHERE process.name[1] = 'powershell.exe'
  AND process.commandLine[1] LIKE '%EncodedCommand%'
  AND arrayExists(n -> n.2 = 'HTTP' AND n.5 NOT IN ('cloudfront.net', 'akamai.net', 'fastly.net'), network);

Performance: no aggregation overhead, query executes in under a second on billions of events. Denormalizing arrays for "simplicity" breaks performance at scale. Nested types preserve relationships and perform better.

Pattern 4

Entity resolution across data sources.

Advanced detections correlate user activity across multiple data sources (O365 login + VPN connection + endpoint activity). Entity resolution — mapping the same user across different identifier schemes — requires dimension tables.

O365 sign-in log: userPrincipalName = alice@corp.com. VPN connection log: username = CORP\alice. EDR process execution: userSid = S-1-5-21-...-1105. Detection rule: "Alert on O365 login from VPN IP where endpoint shows suspicious process execution within 5 minutes." Requires correlating all three identifiers as the same person.

The naive approach joins on mismatched identifiers and fails. The correct approach builds a user dimension table:

CREATE TABLE users (
    user_id String PRIMARY KEY,
    upn String,                -- alice@corp.com (O365)
    domain_username String,    -- CORP\alice (VPN)
    sid String,                -- S-1-5-21-...-1105 (EDR)
    department String,
    manager String,
    risk_score UInt8,
    mfa_enrolled Bool
);

-- Fact tables reference user_id
CREATE TABLE o365_logins_normalized (
    eventTime DateTime, user_id String, ipAddress String, appDisplayName String);
CREATE TABLE vpn_logs_normalized (
    eventTime DateTime, user_id String, sourceIp String, assignedIp String);
CREATE TABLE edr_logs_normalized (
    eventTime DateTime, user_id String, hostname String, processName String);

-- Query with entity resolution
SELECT o.eventTime, u.upn, u.department, u.risk_score, v.assignedIp, e.processName
FROM o365_logins_normalized o
JOIN users u ON o.user_id = u.user_id
JOIN vpn_logs_normalized v ON v.user_id = u.user_id AND v.assignedIp = o.ipAddress
JOIN edr_logs_normalized e ON e.user_id = u.user_id
WHERE u.risk_score > 6
  AND e.eventTime BETWEEN o.eventTime AND o.eventTime + INTERVAL 5 MINUTE;

Entity resolution happens once during ingestion, not on every query. The user dimension is small (10K–100K rows), indexed lookups are fast, and 200+ detection rules leverage shared user context. Flattening doesn't directly break cross-source correlation, but the absence of an entity dimension table does. Normalize identifiers at ingest; enrich at query time.

Pattern 5

Temporal chains (event sequences).

Advanced threats follow multi-step patterns. Temporal chains within events (authentication attempts) are broken by denormalizing arrays. Temporal chains across events (lateral movement: RDP → SMB → exec) are not schema problems — they're query-complexity problems.

Azure AD authentication example with detection rule "2+ failed password attempts followed by successful SMS (MFA bypass indicator)":

-- Correct: nested array preserves sequence
CREATE TABLE azuread_auth_nested (
    eventTime DateTime,
    userPrincipalName String,
    appDisplayName String,
    authenticationDetails Array(Tuple(
        authMethod String,
        succeeded Bool,
        attemptTimestamp DateTime
    ))
);

-- Query: 2+ failed password, then successful SMS
SELECT eventTime, userPrincipalName, appDisplayName, authenticationDetails
FROM azuread_auth_nested
WHERE arrayCount(a -> a.1 = 'Password' AND a.2 = false, authenticationDetails) >= 2
  AND arrayExists(a -> a.1 = 'SMS' AND a.2 = true, authenticationDetails);

Denormalizing to one row per attempt loses array ordering — the sequence has to be reconstructed via timestamp grouping, which is fragile. Use nested types for within-event sequences.

Decision framework

When to flatten vs. preserve.

The question isn't "should I flatten?" — it's "which fields can I safely flatten without breaking detection logic?"

Flatten to columns when

  • Field is always present (not optional).
  • 1:1 relationship (one value per event, not an array).
  • No semantic dependency on field presence (NULL = missing data, not "false").
  • Query pattern is simple filters (WHERE field = value).
  • High query frequency (80%+ of queries touch these fields).

Preserve as nested/array when

  • Variable-length arrays (0–N elements).
  • Hierarchical structure with semantic dependencies.
  • Field presence is meaningful (absence ≠ NULL).
  • Detection logic depends on array ordering or grouping.
  • Query pattern is "ANY element in array matches condition."

Use dimension tables when

  • Entity data (users, hosts, services) referenced across many log types.
  • Reusable context (attributes change infrequently).
  • Slowly-changing dimensions (weekly, not per-event).
  • JOIN overhead acceptable (query latency measured in seconds).

Solutions

Architectures that preserve both performance and relationships.

Solution 0: Iceberg V3 Variant type (recommended for new deployments)

For semi-structured logs where field presence carries information — CloudTrail, Kubernetes audit, Azure Activity, Okta system logs — and for multi-engine query environments, this is now the preferred answer. Iceberg V3 (with Parquet's matching Variant type, ratified October 2025) introduces a native semi-structured column type that preserves JSON's absence-vs-null distinction, supports queryable nested structures, and works across all engines that adopt V3 (Spark, Trino, DuckDB, Snowflake — engine support rolling out through 2026). Unlike ClickHouse Nested, Variant is engine-portable.

CREATE TABLE cloudtrail_v3 (
    eventTime TIMESTAMP,
    eventName STRING,
    userIdentity VARIANT,        -- preserves the entire nested userIdentity object
    requestParameters VARIANT,
    sourceIPAddress STRING,
    awsRegion STRING
) USING iceberg
TBLPROPERTIES ('format-version' = '3');

-- Detection rule: still works, semantics preserved
SELECT eventTime,
       variant_get(userIdentity, '$.userName', 'STRING') as userName,
       eventName
FROM cloudtrail_v3
WHERE eventName IN ('AttachUserPolicy', 'PutUserPolicy', 'AddUserToGroup')
  AND variant_get(userIdentity, '$.sessionContext.attributes.mfaAuthenticated', 'STRING') IS NULL;
-- IS NULL correctly captures "field absent = no MFA"

Pros: cross-engine portable; preserves field-absence semantics by construction; no per-engine learning curve; future-proofed. Cons: engine adoption uneven through 2026 (verify your query engine supports variant_get and Variant-aware predicate pushdown); fewer production references than ClickHouse Nested or Hydrolix.

Solution 1: ClickHouse Nested types

Still valid for ClickHouse-only stacks. Relationships preserved, performance nearly as good as flat columns, production-validated at Netflix scale (tag sharding 3s → 700ms). Trade-offs: Nested syntax learning curve, limited JOIN support on arrays, ClickHouse-specific.

CREATE TABLE cloudtrail_nested (
    eventTime DateTime,
    eventName String,
    userIdentity_userName String,
    resources Nested(
        ARN String,
        type String
    )
) ENGINE = MergeTree()
ORDER BY (eventTime, eventName);

SELECT eventTime, userIdentity_userName, eventName
FROM cloudtrail_nested
WHERE eventName = 'DeleteBucket'
  AND arrayExists(r -> r LIKE '%sensitive%', resources.ARN)
  AND arrayExists(t -> t = 'AWS::S3::Bucket', resources.type);

Solution 2: Hybrid architecture (hot flat + cold nested)

For mixed workloads (80% simple aggregations, 20% complex forensics):

Logs → Kafka → Dual-Write
  ├─→ ClickHouse Hot Tier (Flattened, 7–30 days)
  └─→ Iceberg Cold Tier (Nested, 90–365 days)

Query Router:
  - Time < 30 days → ClickHouse (sub-second, flattened)
  - Time > 30 days → Trino + Iceberg (5–10 seconds, nested)

Benefits: performance for recent data, structure for forensics, cost-optimized. Trade-offs: dual maintenance, storage duplication during overlap window.

Validation checklist

Before you flatten.

Schema validation

  • Document all optional fields (presence/absence meaningful vs NULL).
  • Map all array fields.
  • Document hierarchical relationships.
  • Flag fields where NULL ≠ absent ≠ false.

Detection logic validation

  • Export all detection rules from current SIEM.
  • Translate each rule to lakehouse SQL.
  • Test translated rules on sample data.
  • Validate results match SIEM.

Parallel run

  • Dual-write to SIEM and lakehouse for 30+ days.
  • Run detection rules in both systems.
  • Compare alert volume (zero missed detections).
  • Validate alert parity achieved.

Performance vs. correctness

  • Measure query latency: flat vs nested.
  • Decide: is 2–5× slower acceptable to preserve relationships?
  • Priority: correctness over performance. Missed detections are unacceptable.

Takeaways

The flattening tax is real.

Flattening isn't just schema conversion — it's semantic translation. Relationships matter as much as field values.

  1. Absence is meaningful in security logs. MFA field absent equals no MFA, not NULL. Flattened schemas must preserve semantic distinctions.
  2. Arrays are common in security data. CloudTrail resources[], EDR network_connections[], Azure AD authenticationDetails[]. Denormalizing breaks relationships and performance. Use nested types — or Iceberg V3 Variant.
  3. Test correctness, not just performance. Parallel runs must validate alert parity. If SIEM produces 1,250 alerts/week and lakehouse produces 1,248, investigate the missing two. Slow and correct beats fast and wrong.
  4. Hybrid architectures balance trade-offs. Hot flat plus cold nested solves the 80/20 problem. 80% of queries hit recent data with simple filters (sub-second, flattened). 20% hit historical forensics (5–10 seconds acceptable, nested).
  5. Document relationship dependencies before flattening. Map detection logic dependencies first. Schema translation comes second. Teams that focus on schema mapping miss semantic dependencies — and break detections.