Detection engineering

Where detection-as-code notebooks should live.

A notebook is how a hunt, an investigation, or a candidate detection gets shared so a second analyst can re-run it six months later. Security teams have done this for years, so the open question isn't whether to use notebooks, it's which notebook to commit the durable work to. There's a real fork here: the Databricks notebook that works in production today but moves the lock-in up to the authoring layer, and the open, reproducible marimo notebook that fixes the things that make Jupyter untrustworthy in production but has no security ecosystem standing behind it yet. Underneath both, the detection that actually ships is text.

Reading time: about 18 minutes. Evidence tier: B–C overall, with the reproducibility mechanics graded higher. The notebook-mechanics claims rest on Pimentel et al. (MSR 2019, Tier A) and the marimo and Jupyter project docs; the Databricks platform claims are vendor marketing and flagged Tier C in line; the "marimo is better for security work" claim is labelled Tier D, because no security team has published it yet.

The decision under the decision

Choosing a notebook is choosing where lock-in and reproducibility sit.

When a team decides to put detection logic in notebooks, the decision feels like a tooling choice, the kind you make once and stop thinking about. It isn't, because the notebook format determines two things that matter for years: whether the work is reproducible (does the file produce the same answer when someone else runs it next quarter?) and whether it's portable (does the detection logic survive a change of engine, or is it welded to one vendor's runtime?). Those are architecture questions wearing a tooling costume, and the security-notebook world has spent most of a decade discovering that the default answer, plain Jupyter, is weak on both.

I want to walk through the fork honestly, because the two serious options are good at opposite things. Databricks has made the notebook genuinely production-ready for security, with shared accelerators that run real detections, and the cost of that is a proprietary authoring layer that re-creates exactly the lock-in the open table formats spent 2024 dismantling one level down. marimo, the reactive Python notebook, fixes the reproducibility and review problems that have dogged Jupyter from the start, and the cost of that is an ecosystem that, for security specifically, does not exist yet. Neither is a clean win, and the empirically honest version of this essay refuses to pretend otherwise.

The reconciliation, which I'll get to at the end, is that the thing you actually ship, the detection that fires at 3 a.m., is a Sigma rule or a SQL statement or a small Python module, and the notebook is the document that explains how that detection was derived. Keep that distinction clear and the fork gets a lot easier to work through, because you stop asking which notebook owns your detections and start asking which notebook is the better place to do the work that produces them.

The corpus that cooled

The security-Jupyter movement peaked around 2021 and went quiet.

If you went looking, a few years ago, for security use cases shared as notebooks, you found one cluster and one library. The cluster is Roberto Rodriguez's: the ThreatHunter-Playbook (ATT&CK-organized hunt analytics published as interactive notebooks and a Jupyter Book, around 4,600 stars), Security-Datasets, formerly Mordor (replayable event data that made those notebooks reproducible, around 1,760 stars), HELK (an ELK-plus-Spark-plus-Jupyter hunting stack, around 3,900 stars), and the Infosec Jupyterthon, the online event where researchers presented their favorite notebooks. The library is Microsoft's msticpy, which gives a hunting notebook data providers for Microsoft Sentinel, Defender, Graph, and Splunk, plus enrichment and analysis. (Repo metadata Tier A, pulled from GitHub in June 2026; the maintenance reads below are from commit history.)

Here is the part that surprised me when I went back to check. Most of that corpus is dormant. Security-Datasets has not had a commit since March 2024, HELK since mid-2024, and the Infosec Jupyter Book primer since 2020; the Jupyterthon's last confirmed edition was February 2024. The ThreatHunter-Playbook still shows a 2026 commit, but it's mid-reinvention toward LLM agent tooling rather than steady notebook curation. The one project that's unambiguously healthy is msticpy, last touched the day before I wrote this, and msticpy is a library, not a notebook collection. So the durable thing that survived is the code you import into a notebook, not the notebooks themselves.

I don't read that as "notebooks failed." I read it as notebooks finding their level. They turned out to be very good for exploration and for sharing a worked analysis, and weak as the system of record for production detections, and the engineering that needed to be durable migrated into libraries (msticpy, Google's secops SDK) and into text-based tooling that makes notebooks behave in git (jupytext, nbdime, both actively maintained). That a thriving toolchain exists specifically to paper over notebooks-in-version-control is the tell that the version-control problem is real, and it's the problem the rest of this essay is about.

What Jupyter actually costs you

Hidden state and JSON-in-git are the reproducibility tax.

The classic critique is Joel Grus's "I Don't Like Notebooks" (JupyterCon 2018, Tier B), and its core is the gap between what a notebook shows and what its program state actually is. Cells run in whatever order you click them, every assignment stays alive in one mutable namespace, and so a notebook's on-screen output can reflect a cell you deleted ten minutes ago, or cells run out of order, with no warning. The empirical measure people cite is Pimentel et al. (MSR 2019, Tier A), who analyzed close to a million public notebooks and found that only about 24% of those with a recorded execution order re-ran top to bottom without error, and only about 4% reproduced their original results. For a hunt notebook that someone is supposed to trust and re-run during an audit, a 4% reproduction rate is not a footnote, it's the whole problem.

The second cost is in version control. A .ipynb file is JSON containing code, rendered outputs, metadata, and an execution counter, so a one-line change to a query produces a diff full of JSON noise, rendered charts and images are unreviewable in a normal pull request, and large notebooks frequently fail to render a diff on GitHub at all. A security team trying to run detection content through change management, with review and approval, hits this immediately, because the reviewer can't see what changed. The mitigations are real and widely used (restart-and-run-all as a discipline, nbstripout to drop outputs before commit, jupytext to mirror the notebook as a plain script), and the fact that a whole tooling layer exists to make notebooks reviewable is the strongest evidence that, untreated, they aren't.

I covered the day-to-day discipline that keeps Jupyter honest, restart-and-run-all, commit the notebook, freeze the dataset, in the Jupyter-to-MLflow piece, and everything there still holds: if Jupyter is what your team knows, those habits get you most of the way. The point of the next two sections is that the two production paths, Databricks and marimo, take the reproducibility and portability problems out of the realm of discipline and into the realm of architecture, in opposite directions.

The Databricks path

Databricks made the notebook a production sharing surface.

Databricks' security story runs on notebooks in a way the Jupyter corpus never quite managed, because the notebook there is also the production runtime. The security solution accelerators ship as runnable Databricks notebooks: the DNS-analytics accelerator ("Detecting Criminals and Nation States through DNS Analytics") and the IOC-matching accelerator are GitHub repos you clone into Databricks Repos and run, with the IOC one wiring continuous matching through Delta Live Tables. So a shared, runnable security use case at Databricks is a notebook, and unlike a ThreatHunter-Playbook notebook you run on a laptop, it's already attached to the engine that holds the data and the scheduler that runs it in production.

The platform has leaned into this hard. Over fourteen months the framing went from "augment your SIEM" to a "Data Intelligence for Cybersecurity" agent platform (September 2025) to Lakewatch, announced 24 March 2026 as an "open agentic SIEM," with Anthropic's Claude as the reasoning layer and acquisitions of Antimatter and SiftD.ai alongside (all Tier C, vendor announcements; SiftD.ai's claimed pedigree as Splunk SPL's creator is their claim, not independently verified). Zerobus, generally available 23 February 2026, is a serverless API that writes events straight into Delta tables and skips the message bus. The honest read on adoption is that named customers have presented their own migrations, most usefully Standard Chartered's "self-managed SIEM on Databricks" talk at the 2025 Data and AI Summit (Tier B, a customer speaking for itself), but nearly every quantified claim routes through Databricks marketing and the specific numbers disagree across sources, so I'll cite the direction and not the percentages.

Taken at face value, this is the most complete answer to "where do production detection notebooks live" that anyone is shipping. Real-time co-editing, per-notebook permissions, git integration, a one-click path from notebook to scheduled job, natural-language querying through Genie. If the question were only about capability, Databricks would be the recommendation and the essay would end here. The reason it doesn't is the format the capability is delivered in.

The catch

The lock-in moved up one more layer, to the notebook itself.

The lock-in story in security data has a clear shape. For years it lived in the storage layer, in proprietary formats you couldn't read with anything but the vendor's engine. Open table formats broke that in 2024: Iceberg and Delta, Snowflake's Polaris, Unity Catalog federation, and the data became genuinely portable. The honest follow-on, which I've written about as pipeline lock-in, is that the control point didn't disappear, it migrated up the stack to the pipeline and control plane, through a wave of acquisitions where "vendor-neutral" routing tools became single-vendor on-ramps. The Databricks notebook adds a third stop to that migration, one layer higher still: the authoring layer.

Databricks deserves real credit at the storage layer. Managed Iceberg, Iceberg v3, and an Iceberg REST catalog mean Trino, DuckDB, or Snowflake can read your security data without moving it, which is a genuine improvement and undercuts the lazy "it's all proprietary" critique. But a Databricks security notebook is not a Jupyter notebook. It's Spark plus dbutils plus Unity Catalog bindings plus display() and widgets plus Delta Live Tables decorators, and the native bundle format, .dbc, is a proprietary archive that carries outputs and doesn't open in plain Jupyter without conversion (Tier A, Databricks export-format docs). The data is portable; the detection logic written on top of it is not. A cell like this runs in exactly one place:

# Databricks notebook source
# A detection cell that only runs inside Databricks

creds = dbutils.secrets.get(scope="soc", key="ti_api")     # dbutils
df = spark.read.table("security.ocsf.dns_activity")        # spark + UC
matches = df.filter(beaconing_score(df.query) > 0.9)
display(matches)                                           # display()

# Schedule as a job, write back to a Delta table — all Databricks-bound.

Move that to another engine and you rewrite it. The portable unit of a detection program isn't the notebook, it's the Sigma rule or the SQL underneath, and the .dbc wraps the portable thing in a container that isn't. None of this makes Databricks a bad choice; for a shop that has already standardized on it, the notebook-as-production-runtime is a strong, coherent answer. It does mean the "open, no lock-in" line is true about where the bytes sit and misleading about where your detection engineering sits, and a team buying in on the open-formats promise should know the promise stops at the table layer.

The marimo path

marimo fixes the reproducibility problem by construction.

marimo is a reactive Python notebook, and the reactive part is the answer to Grus's complaint. It parses each cell statically into a dataflow graph and runs cells in dependency order rather than click order, so when a cell changes, the cells that depend on its variables re-run, and when you delete a cell, its variables are scrubbed from memory. Hidden state, the thing that makes a Jupyter notebook lie about what produced its output, is gone by construction rather than by discipline. The price is two structural rules (no cycles between cells, no variable defined in more than one cell), which is exactly the friction that trips up someone porting a messy notebook, and exactly the constraint that makes the execution deterministic.

The second thing marimo does is store the notebook as a pure .py file, no JSON envelope and no embedded output blobs, so a one-line change to a detection is a one-line diff a reviewer can actually read. SQL is a first-class cell type that serializes as a normal Python call and runs against DuckDB or against in-memory dataframes, which fits a lakehouse-backed SOC well. And the same file runs three ways: marimo edit to author interactively, marimo run to serve it as a locked-down app, and python notebook.py to run it headless as a scheduled job, with no marimo server in the loop. The "notebook to production job" step that's a rewrite elsewhere is, here, the same file.

# A marimo notebook IS this .py file — git sees readable code.

import marimo
app = marimo.App()

@app.cell
def _():
    import duckdb
    df = duckdb.sql("""
        SELECT src_ip, query, beaconing_score(query) AS score
        FROM read_parquet('s3://security-lake/dns/*.parquet')
        WHERE event_time > '2026-01-01'
    """).df()
    return (df,)

@app.cell
def _(df):
    # Reactive: edit the threshold and this cell — and only this cell's
    # dependents — re-run. No stale state, no out-of-order surprises.
    matches = df[df.score > 0.9]
    return (matches,)

There's one more piece that turns "git-friendly" into "reproducible months later": marimo can inline its dependencies into the file as PEP-723 metadata and run in a uv-managed sandbox, so the notebook is pinned down to its packages. For detection content you need to trust and re-run during an investigation or an audit, that combination, deterministic execution, readable diffs, pinned dependencies, is the property you actually want, and it's the property plain Jupyter has never had without a stack of bolt-ons. Graded on the notebook mechanics alone, this is the better-built tool, and that claim is Tier A.

The honest gap

There is no security ecosystem on marimo yet.

Here's where I have to slow down, because the temptation, for someone who values open formats and reproducibility as much as I do, is to call marimo the answer and move on. It isn't the answer for security, at least not today. The pre-built security-notebook ecosystem is effectively all Jupyter. msticpy's data providers, the Microsoft Sentinel and Defender integrations, the OTRF hunt content, the Jupyterthon corpus, all of it assumes the Jupyter and IPython display model and the %magic conventions. None of that is marimo-native. A team that adopts marimo for hunting today loses the pre-built security toolchain on day one and becomes the team that builds the missing layer: the data providers to Splunk, Sentinel, and the EDR APIs, the secrets handling, the scheduling glue, a thin equivalent of what msticpy already gives the Jupyter world. That's undifferentiated work that the incumbent ecosystem has already done.

So the claim "marimo is better for security detection notebooks in production" is Tier D, aspirational. I searched for a counterexample, a documented case of a security team running detection or hunting on marimo, and found one-off demos and nothing operational. The architectural fit is genuine, every property in the previous section maps onto something a detection-as-code program needs, but architectural fit is an argument, not a deployment, and I'd be overclaiming to present it as more than a bet you'd have to prove.

There's also a governance wrinkle worth tracking. marimo was acquired by CoreWeave, announced 30 October 2025, and folds into the Weights and Biases developer platform. It stays Apache-2.0 and the open-source commitments are stated, and the license already in the wild can't be clawed back, but the project's strategic direction now serves a GPU-cloud company's agenda rather than a neutral notebook mission. That's a reason to watch the roadmap, not a reason to distrust the tool, and it belongs in any honest evaluation of betting a security practice on it.

The reconciliation

The detection that ships is text, and the notebook is how you got there.

The way out of the fork is to stop treating the notebook as the place detections live and start treating it as the place detections are derived. The artifact that runs in production, that fires an alert and that an analyst reads during triage, should be text: a Sigma rule, a SQL statement, a small versioned Python module, committed to the same repository as the rest of your detection content and reviewable like any other code. The notebook is the lab record, the document that explains why the rule has an entropy check and a signed-binary filter and an off-hours condition, so that when the rule fires six months from now and someone asks why it exists, the answer is in the notebook rather than in someone's memory. That's the same point I made about Jupyter and MLflow; the notebook doesn't replace the detection rule, it justifies it.

Hold that distinction and the fork loses most of its teeth. On Databricks, the answer is to keep the durable detection in text, a Sigma rule or a SQL file under version control, and treat the .dbc notebook as scratch and derivation rather than the system of record, so that the proprietary authoring layer never becomes the thing you can't leave. On marimo, the .py file is already text and already diffable, which makes the derivation itself a first-class reviewable artifact, and the missing-ecosystem problem shrinks to "can I get the data into a dataframe," which DuckDB over Parquet or Iceberg largely solves. Either way, the test of whether a team has reached real detection-engineering maturity is the same: a new analyst can answer "why does this detection exist" by reading the file that produced it, and the production detection is portable text that doesn't care which notebook wrote it.

This is the applied-bridge move I keep coming back to, taking a discipline the data-engineering world already worked out, reproducible analysis as reviewable, version-controlled, dependency-pinned code, and carrying it into security, where the default is still a query pasted into a console and lost when the tab closes. The data world solved the reproducible-notebook problem; security mostly hasn't adopted the solution, and naming the fork honestly is the first step toward carrying it across.

The agentic wrinkle

Agents don't moot the format question, they raise the stakes on it.

The strongest objection to this whole framing is that it's about to be obsolete. Roberto Rodriguez's recent work is on agentic SOC workflows where an LLM executes code inside notebooks to drive the analysis, and Databricks' Genie and Lakewatch push the same direction, the analyst asking in natural language and an agent generating and running the query. If the durable interface becomes the agent plus its tool calls, then arguing about Jupyter versus marimo, or open versus proprietary notebooks, looks like arguing about a layer that's being abstracted away.

I think that gets it backwards. An agent that writes detection code needs the reproducible, reviewable, importable form even more than a human does, because the failure mode of an agent is confident wrong output, and the only defense is being able to re-run what it produced and read the diff of what it changed. A notebook format with hidden state and unreviewable diffs is a worse foundation for agent-generated detections than for human-written ones. So the format question doesn't dissolve under agents; it becomes the thing that determines whether you can trust what the agent did. That's an argument for the reproducible, text-first model, whichever notebook you run it in.

Where to start

Keep the detection in text; pick the notebook for the work it does.

If you're already on Databricks, the practical guidance is to use the notebook for derivation and keep the production detection as a Sigma rule or SQL file in version control, so the accelerators stay a fast start rather than a place your logic gets trapped. If you're building fresh and you weight open formats and reproducibility the way I do, marimo over DuckDB is the bet worth making, with eyes open about the ecosystem you'll be building yourself and the CoreWeave roadmap you'll be watching. And if you're on Jupyter and it's working, the discipline in the Jupyter-to-MLflow piece gets you most of the reproducibility without a tool change.

The experiment that would settle the open side of this is small and concrete, and it's on my lab list: take a real Zeek or OCSF hunt, build it as a marimo .py notebook over DuckDB, run it all three ways, and put the git diff and the reproducibility next to the same hunt as a .ipynb and a Databricks .dbc. If the one-line-change-equals-one-line-diff claim holds and the headless run reproduces, that moves "marimo is better for security work" off the Tier-D shelf. Until someone runs that, the honest position is the one this essay lands on: marimo provably fixes what's broken about notebooks, security hasn't built on it yet, and the detection you ship should be text either way.