Writing
Essays from a practitioner.
Essays where the analysis is more prescriptive than the research surface, or where the topic doesn't yet have enough evidence to anchor as a tracked hypothesis. The voice is the same; the framing is essayistic. Updated as the work warrants, not on a schedule. These essays are the reasoning behind the Capability Matrix scores, and the head-to-head benchmark evidence they cite lives in the Lab.
Organized by pillar. The ordering below reads top-to-bottom as the dependency stack: the foundations (lakehouse formats, catalogs) up through OCSF and Sigma standards, engines and pipelines that sit on top of them, the detection and migration practices that consume them, and finally the economics and vendor-watch layer that frames the whole.
Reading paths
If you read three, read these.
The full essay collection is a lot to land on cold. Each path is a three-essay arc through one question, cross-pillar, in order.
Leaving Splunk without breaking detections
The cost case, the migration trap most teams walk into, and what the timeline actually costs.
Whether you can trust your data
The quietest failures in security data — the parsing layer, and the measurement problem underneath it.
Picking the query engine
Where each engine wins, from petabyte-scale detection down to an analyst's laptop.
When hunting becomes data science
The path threat hunters are already on, made reproducible.
Jump to pillar
Pillar · 12 essays
Lakehouse foundations.
Open table formats and the interop layer beneath them. Iceberg, Delta, V3/V4 features, and the Arrow standards that make engine portability real.
-
Iceberg V3 changed the security lakehouse thesis.
Earlier essays argued Iceberg over Delta on vendor-neutrality. V3's Puffin deletion vectors, Variant type, row-level lineage, and default values plus V4 proposals change the trade-off space. Some prior recommendations need revision; others are reinforced.
Read →
-
The encoder is the read lever, not the table format.
Register byte-identical Parquet into both an Iceberg and a DuckLake catalog, read both with DuckDB, and at a billion rows three of four queries come back at parity. The read-speed difference people attribute to the format is the Parquet writer: at a matched codec PyArrow writes 193 MB where DuckDB writes 114 MB on the same data. Pick the format on the write path, not a read benchmark.
Read →
-
Same codec, different sizes.
Two Parquet writers told to use the same codec at the same level produce different file sizes on identical data, because compression is an encoding strategy the writer makes, not one knob. The 'Iceberg is more storage-efficient' claim turns out to be a ZSTD-vs-Snappy default, and at a matched codec the DuckLake-written files were smaller. Only registering identical bytes removes the writer variable.
Read →
-
The write pattern is the architectural decision.
Iceberg and DuckLake are nearly read-neutral on identical data but diverge sharply on how data enters. Over a 10-to-200-file ladder Iceberg's planning grows 17.6x while DuckLake's SQL-catalog stays flat; on tiny streaming commits DuckLake's inlining runs 2-4x faster with zero data files. Map the format to the write contract: streaming hot tier versus bulk and forensic cold tier.
Read →
-
Iceberg vs Delta Lake for security data.
Production evidence from Netflix (5 PB/day Iceberg), Insider (90% cost reduction), Adobe (5,000+ Delta tables), and InMobi (GDPR/CCPA on Delta) anchors a format decision that governs engine portability, vendor neutrality, and migration cost.
Read →
-
V4 relative paths vs DuckLake's database-metadata.
Different facets of the same problem space, not the same fight. What V4's relative paths solve, where DuckLake's database-as-metadata wins, and which security-data workloads each fits.
Read →
-
Iceberg table maintenance at scale.
Compaction, snapshot expiration, orphan file cleanup. What goes wrong at petabyte scale, what the maintenance budget looks like, and where vendor catalogs help or hurt.
Read →
-
Deletion vectors and GDPR.
Iceberg's right-to-erasure story, honestly. What Puffin deletion vectors actually deliver, where the gaps remain, and what an audit looks like.
Read →
-
Variant type ends the flattening wars.
The Iceberg V3 Variant type is the structural fix for what flattening anti-patterns broke. CloudTrail nested JSON, EDR alerts, anything semi-structured: store it native, query it native.
Read →
-
Row lineage as the missing CDC primitive.
Iceberg V3 row lineage may close the audit-trail gap that has dogged lakehouse detection engineering. What works Spark-side today, what's blocked on pyiceberg + Nessie, what an end-to-end CDC pattern looks like.
Read →
-
Arrow and ADBC: a foundational pillar.
Why Arrow Database Connectivity is the wire-protocol layer that lets you swap engines without rewriting client code. The columnar-throughput case versus JDBC/ODBC.
Read →
-
Arrow Flight and Flight SQL.
The columnar wire protocol for security data. Where Flight SQL fits in a federated stack, what it costs at the analyst's tool layer, and which engines actually speak it.
Read →
Pillar · 4 essays
Catalogs.
Polaris, Unity, Nessie. Governance reach, RBAC depth, meta-catalogs for asset context, and the lock-in surface where most lakehouse buyers underestimate the risk.
-
The catalog became the control plane.
The 2026 consensus is that the catalog is the new control plane and the AI grounding layer. Right for analytics, an open question for security: enforcement is distributed, delegated, beta, and non-portable. With first-party lab evidence on the half nobody scores.
Read →
-
Unity Catalog vs Polaris vs Nessie.
Choosing the catalog for security data. RBAC depth, multi-engine reach, branching semantics, the procurement-defensibility math. With first-party evidence from the Q3 2026 catalog benchmark.
Read →
-
Catalog governance without native support.
When the catalog you picked doesn't carry the governance reach you need. Compensating patterns, what they cost, and which gaps don't close cheaply.
Read →
-
Meta-catalogs and asset context in federated environments.
When a single catalog can't span the estate, the meta-catalog layer becomes load-bearing. Patterns, the asset-context join problem, and what production deployments actually do.
Read →
Pillar · 10 essays
OCSF & schema.
Normalization, mapping, and the anti-patterns from migrations gone sideways. Schema-on-read vs schema-on-write, OCSF reverse mapping, flattening detection logic.
-
Schema-on-read vs schema-on-write.
Splunk at $31K/month for 1 TB/day. Elasticsearch with ECS at $8–12K. Hybrid lakehouse (raw on cheap object storage, OCSF on warm) at $7.5K with parity on detection-engineer workflows. The schema-on-read tax compounds at retention scale.
Read →
-
LLM-assisted OCSF mapping.
What the migration tax actually looks like when you use LLMs to translate vendor schemas to OCSF. Where it accelerates, where it produces silent-loss errors, and how to validate.
Read →
-
OCSF ontological grounding: D3FEND for federal-ready.
OCSF anchored to MITRE D3FEND gives the ontology you need for federal-grade detection. What works today, what's still aspirational, and where the gaps sit.
Read →
-
OCSF reverse mapping.
Answering the legal-team objection. When OCSF normalization erases the original-event fidelity required for chain-of-custody, here's how reverse-mapping preserves the evidentiary record.
Read →
-
OCSF and operational technology.
OCSF has no native fields for Modbus, DNP3, BACnet, or S7comm. The Issue #1515 proposal adds six fields under an ics namespace, anchored on production-validated Zeek-to-OCSF mapping work. The architecture argument for IT/OT convergence at the schema layer.
Read →
-
The field-mapping anti-pattern.
Field-by-field mapping during SIEM-to-lakehouse migration looks like the safe play and drops detection coverage on the way. Five patterns that break and what to do instead.
Read →
-
Flattening away your detection logic.
Migrating from SIEM to lakehouse is semantic translation. Treating it as schema conversion is how detection coverage breaks: flattening CloudTrail's nested JSON silently broke a privilege-escalation detection for six weeks at a financial services firm.
Read →
-
Context collapse, measured on real attack data.
The companion measurement to the flattening essay. Unmodified SigmaHQ rules on real MITRE APT29 telemetry, scored over a coarse store and a faithful one: the adversary-tagged rules lose nearly twice the recall the routine rules do (Δ +0.19), and 9 of 29 go fully blind. The honest part is the gap came out smaller than my own lab testbed had shown.
Read →
-
Six schemas into OCSF: the mapping is the hard part.
Field-level crosswalks of Splunk CIM, Google Chronicle UDM, Microsoft Sentinel ASIM, Elastic ECS, OpenTelemetry, and Zeek into OCSF 1.8.0. The empty cells are the finding: five recurring seams, most of them the standard's own (missing disposition, missing certificate class, an invented-severity contract).
Read →
-
From field mappings to the controls layer.
Up one layer from the schema crosswalks: OCSF class to digital artifact to D3FEND defense to ATT&CK offense to NIST 800-53 / SCF control. Measured hop by hop (79 D3FEND techniques to 402 controls via 606 SKOS edges; 98% of the defensive matrix reaching a governance control through ATT&CK), and honest about the one direct link that does not exist.
Read →
Pillar · 3 essays
Sigma & detection portability.
Sigma 2.0 correlations, pySigma backend reality, and Sigma as the fourth foundational standard alongside Iceberg, Arrow, and OCSF.
-
Why Sigma won the detection-sharing decade.
A decade of attempts to share security use cases — Sysmon configs, hunt notebooks, MITRE CAR, Atomic Red Team, Sigma — and only some endured. The difference wasn't quality; the best Sysmon config froze in 2021. Five structural properties predict what lasts, and Sigma has them by construction.
Read →
-
Sigma and detection portability.
The fourth foundational standard. Why Sigma sits alongside Iceberg, Arrow, and OCSF as the standards that decouple security data from any single vendor's analytics engine.
Read →
-
Sigma 2.0 correlations and the pySigma backend reality.
Sigma 2.0 added correlation semantics that the backends haven't fully caught up to. Which backends ship the new constructs, which don't, and where that breaks in production.
Read →
Pillar · 6 essays
Query engines.
ClickHouse at petabyte scale, DuckDB for analyst hunting, materialized views, push vs pull, dbt as the SQL-transformation layer.
-
One engine in front: StarRocks over shared Iceberg.
The strongest one-engine answer to the question a SOC actually asks: who is going to run four engines. StarRocks alone over the same Iceberg tables covers three of the four analyst-facing workloads from one MySQL-wire endpoint, and the scored join bench (Tier B, single host) prices the giveaways — no SOC-shaped join exceeded 1.5 s on any engine, so the spread that justified four endpoints is single-digit. The Splunk DB Connect leg stays labeled Tier-D untested.
Read →
-
ClickHouse at petabyte scale.
Netflix's 5 PB/day ClickHouse optimization journey — fingerprinting (216 µs → 23 µs), native protocol serialization, tag sharding (3 s → 700 ms). Plus Huntress's 93% cost reduction migrating from Elastic.
Read →
-
DuckDB for analyst-driven threat hunting.
Single-node SQL on parquet that an analyst can run from a laptop. Where DuckDB fits as the analyst-overlay tier, what it doesn't replace, and how the Lambda-native pattern works for ad-hoc hunts.
Read →
-
Materialized views for security data.
What actually works at petabyte scale. Where MVs are the right answer (recurring dashboards, Sigma correlations), where they fall apart (high-cardinality, schema churn), and which engines deliver the lifecycle math.
Read →
-
Push vs pull query engines.
Vectorized execution models for security analytics. Push engines win on throughput; pull engines win on adaptive replanning. What the difference means for your workload mix.
Read →
-
dbt for security data.
Transformation patterns for detection engineers. The dbt-Sigma-OCSF model layer, where dbt is a clean fit, and where it competes with your pipeline tooling.
Read →
Pillar · 12 essays
Pipelines & streaming.
Cribl, Tenzir, Vector. Kafka, NATS, streaming-database decisions. Where pipeline lock-in moved after the SIEM lock-in eased.
-
Cribl vs Tenzir vs alternatives.
Choosing your security data pipeline. The procurement-evidence-vs-OCSF-fidelity split, where Vector fits as the open-source third option, and what the v1 Capability Matrix puts at #1 across three archetypes.
Read →
-
The pipe layer: what's missing from your AI security platform.
Tenzir as the OCSF-native pipe layer. What it ships today, what the production evidence floor actually is, and where it's the upgrade path from Cribl.
Read →
-
Vector: the data router Datadog open-sourced.
Vector at the Archetype-C #1 spot for cost-and-lock-in-led shops. What VRL fluency costs, where Datadog stewardship raises trust questions, and the OCSF gap.
Read →
-
Pipeline lock-in.
Where switching costs moved next. The SIEM lock-in eased; the pipeline-tooling lock-in took its place. Which patterns reduce it; which vendors aggravate it.
Read →
-
The parsing layer nobody owns.
Security data quality breaks at the boundary between vendors, time most of all, and no one in the chain is paid to fix it. A first-hand case from a Palo Alto Splunk-app pull request, Zeek timestamps, and Tenable's nested data, and the argument for a fair broker.
Read →
-
Pipeline-based detection in stream processing.
Detection at the pipeline tier, before the lake. When it's appropriate, what it costs in operational complexity, and the trade-off against retroactive lake-side detection.
Read →
-
Observability pipelines and the security overlap.
Datadog OP, Edge Delta, and the boundary question. Where observability pipeline tooling does the security job credibly, and where the security workload still needs purpose-built tools.
Read →
-
ETL vs ELT for security data.
Who owns the schema, and when. The shift from upstream-normalized ETL to downstream-normalized ELT, what it costs in storage, and what it earns in flexibility.
Read →
-
Kafka architecture deep-dive.
Kafka as the security-data stream bus. Partition design, retention, the broker-replication math, and where it leaks operational complexity at scale.
Read →
-
Kafka to Iceberg: the integration hidden costs.
Streaming Kafka into Iceberg looks straightforward and isn't. Connector quality, schema evolution under load, exactly-once semantics, and the small-files problem.
Read →
-
The streaming database decision.
Materialize, RisingWave, Flink SQL. Where streaming databases earn their keep for security workloads and where the batch lakehouse is still the right call.
Read →
-
NATS JetStream: lightweight Kafka alternative, disqualified.
An honest disqualification. NATS JetStream looks like a Kafka simplification for security workloads; the durability and retention story disqualifies it. What it's actually good for instead.
Read →
Pillar · 17 essays
Detection & hunting.
Detection-engineering maturity ladder, MLOps for hunters, latency tiers, feature stores, PEAK methodology on a modern data stack.
-
The ground you're already standing on.
Detection engineers are doing data modeling every day without naming it, and a rule can compile clean and quietly never fire because one of those mappings was wrong with nothing set to catch it. The vocabulary, the silent-failure mode, and a deductive check that caught 100% of injected type-crossings on a 925-row corpus.
Read →
-
The assurance gap no single tool closes.
No single security tool knows your estate: on a planted 140,000-cell benchmark the best single tool recovers 47.7% of the truth and a freshness-scored cross-tool merge recovers 75.6%, while a 24.4% residual stays dark no matter how many tools you own. The transferable claim is the ordering, which holds across the whole parameter sweep, plus the entity-resolution tax a contested join key adds.
Read →
-
Detecting the OT you can't parse.
Most industrial protocols will never get a deep parser, so the question that decides whether you can monitor an OT network is how much you can detect from behavior — timing, flow, who talks to whom — before a parser ever runs. Why behavioral-first gives immediate east-west coverage across the long tail, why it fits the NERC CIP-015-1 internal-monitoring mandate, where it stops, and the falsifier that would change my mind. A practitioner argument, Tier B, not a benchmark.
Read →
-
What your data means vs what shape it is.
Schema conformance checks the shape of a field, not its meaning, so a mapping can fit OCSF exactly and still mean the wrong thing. Eight real meaning-crossings from a six-schema crosswalk corpus, made visible by a deductive check the syntax pass waves through.
Read →
-
Catching the mistake that kills a detection.
The wrong field mapping that silently kills a detection is catchable with a reasoner and a few disjointness assertions D3FEND never shipped. How to run that check yourself against your own mappings, and the honest limits of what it catches.
Read →
-
The query engine returned the wrong answer and didn't tell you.
Over an open format the engine is supposed to be interchangeable. In my lab one returned a filtered count tens of rows short of the others on the identical Parquet, no error raised, and a timing-only benchmark would have published it as a win. Why a cross-engine answer-equality gate is the only thing that caught it.
Read →
-
The better the model, the quieter the wrong answer.
I answered one OCSF question battery six ways across three model tiers. On compute-over-population questions no LLM-authored arm was ever correct, and the more capable model made the wrong answer quieter rather than rarer; an execution loop and self-consistency both failed the same way. Safety lived in the layer that refuses a query it can't answer, not in the model.
Read →
-
Parquet doesn't hash the way your security tools assume.
Chain-of-custody, WORM, and dedup workflows key on a file hash, but Parquet isn't byte-reproducible by default: parallel row-group order makes the same logical data produce a different SHA-256 every write. A faithful re-export breaks the hash and looks like tampering. Force determinism (threads=1 or ORDER BY, ~20% smaller too), and hash the logical content, not the bytes.
Read →
-
Who actually does the hunting.
The analyst reverse-engineering a broken field mapping at 2am is doing applied ontology, and naming the work hands them open tooling and a community that's been building the same map from the other side.
Read →
-
The tools you can use today.
A practitioner's map of the open detection-grounding stack: OCSF, D3FEND, Sigma, the ROBOT/ELK reasoner toolchain, and the six-schema crosswalks, with an honest read of what each piece is for and exactly where it's thin.
Read →
-
The detection engineering maturity ladder.
From ad-hoc rules to coded detections to detection-as-code with CI. What each rung costs, what becomes possible, and where most security teams stall.
Read →
-
PEAK and the lakehouse.
How modern data stacks enable threat hunting. The Splunk PEAK framework ported to lakehouse infrastructure, what works, and what doesn't.
Read →
-
Three latency tiers: detection, hunting, analysis.
Sub-second for detection, sub-minute for hunting, sub-hour for analysis. The architecture that serves all three honestly, and the trade-offs at each tier.
Read →
-
MLOps tools for threat hunters.
Reproducibility for threat-hunt notebooks. What MLOps tooling brings to hunters that they don't already have, and where the friction lives.
Read →
-
Jupyter to MLflow for reproducible threat hunting.
From notebook-driven hunting to versioned, reproducible artifacts. The pattern that turns a hunt into a tracked experiment.
Read →
-
Where detection-as-code notebooks should live.
The notebook is how security analytics gets shared, and the live decision is the proprietary-but-production Databricks notebook versus the open, reproducible marimo .py. Why the lock-in moved up to the authoring layer, what marimo provably fixes, and why the detection you ship is text either way.
Read →
-
Feature stores for security.
When Feast and Kedro pay off, and when they don't. The case for feature stores in security ML workloads, and the cases where they add ceremony without value.
Read →
Pillar · 4 essays
Migration & federation.
Migrating 800 detection rules across seven parallel ingest buses. Hidden costs. The federated rollout playbook. Splunk Federated Search as bridge or lock-in extension.
-
Migration: hidden costs and timeline reality.
The $300K project that became $1.2M. 67% of security data platform migrations require external consulting; actual costs run 40–100% above technology-only estimates. With a phased funding playbook your CFO will approve.
Read →
-
Migrating 800 detection rules across seven parallel ingest buses.
The conversion playbook from one detection rule corpus to another while seven ingest pipelines run in parallel. What automation closes, what it doesn't, and how to phase the cutover.
Read →
-
The federated rollout playbook.
Federated detection at scale across business units and clouds. The ordering that works, the common mistakes that compound, and the governance discipline that keeps it coherent.
Read →
-
Splunk Federated Search: bridge or lock-in extension?
Splunk's federated-search story positioned as the bridge to a lakehouse. What it actually does, where it's a credible bridge, and where it's a lock-in extension.
Read →
Pillar · 9 essays
Economics & measurement.
The cost optimization paradox in security data, the storage-media economics under the bill, and the cloud-versus-on-prem case for security telemetry. Why vendor benchmarks are the only benchmarks, and what to do about it.
-
The write endurance security data never spends.
Drive media is the majority of a security data platform's bill, which makes the NVMe endurance tier a first-order cost decision. The contrarian read: mixed-use and write-intensive drives are sold at a premium a write-once-read-rarely security lake almost never consumes, and the industry stopped publishing the data that would let you check.
Read →
-
The index pays twice.
A hot search index stores 4.2× the bytes at 3.5× the price per byte, so it costs about 14.8× a warm Iceberg-on-S3 lakehouse for the same events. At thirty days the gap is a rounding error; at the seven-year retention horizon a regulated firm has to plan for, it is the difference between a line item and a project. A measured storage floor, Tier B and first-party single-host, explicitly not a TCO model.
Read →
-
How to run a benchmark that doesn't lie to you.
Rule zero, learned the hard way: verify the answer before you trust the clock, because one engine returned a filtered count tens of rows short over byte-identical Parquet and a timing-only run would have published it as a win. Then the rest of the method — report the CV, scale until the signal clears the noise, register identical bytes, isolate the run, control the power plan, hash logical rows not file bytes.
Read →
-
Scale before you measure.
Most security-tool micro-benchmarks run at a scale where the result is inside the noise. The same workload showed a 55% coefficient of variation at small scale where queries finished in milliseconds, collapsing to ~4% only at a hundred million rows. Find your noise floor first, report the CV with every number, and discount any gap smaller than the run-to-run variation.
Read →
-
The repatriation case for security data.
Cloud repatriation is well-covered macro ground. The narrower, stronger claim: security telemetry is the textbook workload to bring home — steady-state, write-heavy, retained for years, increasingly bound by sovereignty rules — exactly the profile cloud's variable pricing overcharges. The cheap, correctly-specced on-prem media is what closes the math.
Read →
-
The storage is moving, the engine isn't (yet).
The loud off-Splunk story says the open lakehouse is replacing the SIEM analytics engine at scale, now. The quieter, better-sourced read: storage and pipeline are moving fast and measurably while the engine stays put, and the gap between the two is where a buyer gets oversold.
Read →
-
The "This Changes Everything" Index.
I tried to measure the AI hype cycle in Google Trends and the instrument kept breaking. The way it broke was the finding: a measurement goes blind in precisely the direction the technology is moving, and the blind spot itself is the reading. The same instrument-skepticism security telemetry demands.
Read →
-
Why vendor benchmarks are the only benchmarks.
Most enterprise security data platforms restrict their customers from running competitive performance tests. The implication: every published 'X% faster' benchmark is vendor-funded by structural design. The fix is a different distribution model.
Read →
-
The cost optimization paradox in security data.
Cost optimization in security data often makes the overall security posture worse. The patterns where tighter budgets degrade detection coverage, and the patterns that release headroom.
Read →
Pillar · 4 essays
AI, automation & vendor watch.
Emerging analysis, tracked to read direction rather than claimed as core thesis: the NANDA agent-identity question, RAPTOR and the duct-tape era of agentic security, MCP beyond chat, and vendor watch on Databricks Lakewatch and the SDPP cohort. The security-data measurements are the anchor; the broad-AI framing is here to map where the field is heading.
-
The security market has no one defining what you can own.
Vendors could ship security tools you can run, inspect, and air-gap, but nothing makes them, and no independent force scores whether a given tool lets you own it. The agentic rush is widening the gap. Sigma is the proof the open pattern wins when it has a champion; vendor MCP servers are the tell; the missing piece is a referee with a practitioner-ownability axis.
Read →
-
The Gatsby Summer of AI.
AI maturity read through early-automotive history: past the horseless-carriage stage, building AI-native, but in a chaotic pre-seatbelt era where capability outran the safety infrastructure. The bill the glamour hides is measured, not felt — NL2KQL runs clean 97–99% of the time and returns the correct result set only about 58%.
Read →
-
Agentic analysis reinvented Kimball. It skipped the measurement.
The 2026 consensus that AI agents need dimensional modeling, semantic layers, and governed ontologies is mostly right — and quiet about the one step that matters when the data is evidence. Autonomously-built mappings fail sound-but-wrong and silently; the answer is a deductive check that fails on a mapping that looks correct, model-independent, where a human review pass structurally can't. The AI-generated-OCSF-parser claim is the cleanest worked example: transformative only at production-grade accuracy.
Read →
-
What's real vs marketed in the agentic SOC.
Claims about agentic security-data deserve a definition demand and an evidence tier, not a headline. The honest practitioner ceiling sits at 30-40% end-to-end, not the 90%+ the slides claim; the binding constraint is cross-vendor identity and trust (MIT's NANDA), not the automation percentage; the pipeline layer is where the migration risk consolidates. The durable fair-broker read on what ships, what doesn't, and what to plan for.
Read →
Notifications
Get a note when a new essay or benchmark publishes.
Low-volume. Essays as they ship; quarterly benchmark reports; nothing else. No drip campaigns.
The hypothesis-grounded work — ten anchor hypotheses with evidence tiers, twenty-two contradictions tracked over time, and the method-in-practice essay — lives on the research page. The program POV that connects them is on thesis.