Practical implementation

Jupyter to MLflow for reproducible threat hunting.

Analysts who hunt in a SIEM console lose their work the moment they close the browser tab. Analysts who hunt in Jupyter notebooks keep the hypothesis, the SQL, the pandas transforms, and the charts in one file that another analyst can re-run six months later. MLflow is what comes next: the step where ten variants of a detection become a comparable record instead of a row in a spreadsheet. This is the HMM3 bridge, written for people who are already doing the work.

Reading time: about 16 minutes. Evidence tier: B–C overall (tool documentation from Project Jupyter and MLflow plus practitioner patterns from Microsoft Sentinel, Secureworks Taegis, and SpecterOps). Learning timeline figures are practitioner estimates, flagged in line. No vendor benchmarks are claimed.

The Tuesday morning problem

Reproducibility is the gap between hunting and detection engineering.

It's Tuesday. I open the SIEM console, paste a hunting query I spent an hour crafting yesterday, and run it against two weeks of CloudTrail logs. The query completes. Zero results. I stare at the screen. Was the query wrong? Is the environment actually clean? I cannot remember the exact parameters from yesterday: was it fourteen days or seven? Did I filter out service accounts?

A week later my manager asks: "Can you re-run that CloudTrail hunt from last Tuesday? We need to validate the results." I try. But the data has changed (new logs arrived, old logs aged out), the exact query logic is gone, and my notes are scattered across three Slack messages and a Word doc.

This is the reproducibility problem in threat hunting, and it's the same problem data scientists faced ten years ago, so the data-science community already built the tools to solve it. They are free, open source, and well-documented, but most threat hunters haven't picked them up yet.

The Hunting Maturity Model (HMM) describes a ladder, where at HMM1 and HMM2 hunts live in heads and chat logs, at HMM3 the team starts treating hunts as artifacts that are documented, repeatable, and reviewable, and at HMM4 the artifacts feed automated detection pipelines. The Jupyter-to-MLflow path is the HMM3 work, and you can't skip it, though you also don't need anything more elaborate until the HMM3 work is in place.

Why two tools, in this order

Jupyter captures one hunt. MLflow compares many.

Jupyter and MLflow solve adjacent problems and they're easy to confuse. Jupyter is for the single-hunt artifact: hypothesis, dataset, query, refinement, finding, prototype detection rule, in one file. MLflow is for the comparative record: when I run the same hunt with ten different filter combinations, MLflow gives me a sortable table so I can see which variant produced the best precision-recall balance.

Many of the analysts I see try to adopt MLflow without first being comfortable in Jupyter, and they end up frustrated, because MLflow assumes you already have a script or notebook that runs end-to-end, so if the hunt itself is still living in the SIEM console there's nothing for MLflow to track. Start with Jupyter and add MLflow once you find yourself running the same hunt with three or four variants and losing track of which one performed best.

Both tools are free and open source. Both have moderate learning curves. Jupyter is approachable for anyone who knows Python and SQL; MLflow takes a bit longer because the conceptual model (runs, experiments, parameters, metrics, artifacts) is new even if the API is straightforward. Practitioner estimates put the combined time-to-productive at roughly fifteen to twenty-five hours of focused work across both tools, based on training feedback from RSA and similar conferences. Your mileage will vary with prior Python and Git fluency.

Jupyter, concretely

A hunt notebook is a lab notebook for SOC work.

The notebook mechanics get a fuller treatment in the MLOps-tools overview; here the relevant point is that security analysts use Python and SQL inside notebooks, querying a data lake through DuckDB, PySpark, or pandas, and the pattern is well established, because Microsoft Sentinel ships Jupyter integration, Secureworks Taegis uses it, and SpecterOps publishes notebook-based hunt content.

What a notebook captures that a SIEM console doesn't:

Query logic. SQL saved as a code cell, not lost when the session expires.
Analysis code. The pandas filter that removed the compliance scanner false positives, with a comment explaining why.
Visualizations. Charts rendered inline at the point in the workflow where they matter, not screenshotted into a separate slide deck.
Narrative. Markdown cells explain why each step happened, not just what it produced. This is the difference between an artifact a teammate can use and one they have to reverse-engineer.
Reproducibility. Six months from now, another analyst opens the notebook, runs every cell, and gets the same answer, assuming the underlying data is preserved (which is what the dataset-versioning conversation is for in a later post).

The notebook doesn't replace the detection rule that eventually runs in production, because it's the document that explains how the detection rule was derived, so when the rule fires six months from now and someone asks "why does this exist?", the notebook is where they find the answer.

A worked example

What a CloudTrail hunt notebook looks like end to end.

Here is the structure I use for a hunt campaign that starts with the hypothesis "adversaries are using compromised cloud credentials to disable logging and evade detection." Cells alternate between Markdown for narrative and code for execution.

The first Markdown cell sets the frame:

# Hunt Campaign: Suspicious Cloud API Activity

## Hypothesis
Adversaries are using compromised cloud credentials to disable
logging and evade detection.

## Dataset
- Source: CloudTrail logs (S3 / Iceberg)
- Timeframe: 2026-01-01 to 2026-01-14 (14 days)
- Volume: ~2.3 TB, 847M events
- Dataset version: cloudtrail_baseline_v2.3 (frozen for reproducibility)

The first query asks the broad question. Across the data lake, who has been calling the APIs that suppress or remove logging?

SELECT
    eventTime,
    userIdentity.principalId,
    eventName,
    sourceIPAddress,
    userAgent
FROM iceberg_scan('s3://security-lake/cloudtrail/')
WHERE
    eventName IN ('StopLogging', 'DeleteTrail', 'PutEventSelectors')
    AND eventTime BETWEEN '2026-01-01' AND '2026-01-14'
ORDER BY eventTime DESC;

Forty-seven results come back, which is too many to triage without filtering, so the next code cell is pandas narrowing the candidate set against two known patterns: an automated compliance scanner and routine business-hours activity. The cell is intentionally a separate step from the SQL, because the filters encode operational knowledge that belongs in narrative rather than buried in a WHERE clause.

import pandas as pd

df = pd.read_sql(query_1, engine)

# Compliance scanner — known automated false positive
df = df[~df['userAgent'].str.contains('ComplianceScanner')]

# Business hours (9 AM to 5 PM EST) — also dominated by legitimate use
df = df[(df['eventTime'].dt.hour < 9) | (df['eventTime'].dt.hour >= 17)]

print(f"Remaining candidates: {len(df)}")

Twelve candidates remain. A Markdown cell records what the analyst learned: the true-positive pattern is off-hours activity plus an unfamiliar geographic source plus the specific logging-disable APIs. The false-positive pattern is the compliance scanner during scheduled maintenance windows. A second cell prototypes the detection rule that follows from those findings, the same logic, now expressed as production SQL with the filters inline.

The structure is what matters here, running from hypothesis to dataset to broad query to filter to finding to prototype detection, because six months later a different analyst can open this file and see what I did and why, which makes the notebook the audit trail for the eventual detection rule.

Connecting to the lake

DuckDB is the cheapest path from a notebook to Iceberg.

A notebook is only useful if it can reach the data, and the lowest-friction option for a lakehouse-backed SOC is DuckDB, which you install as a Python package, point at S3, and use to query Iceberg tables directly from a notebook cell, with no server, no driver-installation argument with the SRE team, and no JDBC connection string. For more on this pattern, including its limits, I cover DuckDB for threat hunting in a separate post.

import duckdb

con = duckdb.connect()
df = con.sql("""
    SELECT * FROM iceberg_scan('s3://security-lake/cloudtrail/')
    WHERE event_time > '2026-01-01'
""").df()

print(df.head())

For larger workloads, the notebook can connect to a shared engine instead, whether that's StarRocks or ClickHouse via SQLAlchemy, Trino via its Python client, or PySpark for very large jobs, and the notebook code stays structurally the same because only the connection target changes. That portability is one of the reasons I push analysts toward notebooks early, since the same code runs against a local DuckDB cache for development and a production engine for the real hunt.

Getting started takes about fifteen minutes if you already have Python, since you install JupyterLab and DuckDB with pip, launch jupyter lab, open a new notebook, and run a SELECT, and if you know SQL and basic Python you'll be productive in a working day. Practitioner training data suggests two to four hours for the first useful hunt, with longer for analysts new to pandas, and the interface is intentionally simple, so that Shift+Enter executes a cell, Markdown cells render inline, and charts appear where the code runs.

MLflow, concretely

A run is a hunt variant. An experiment is a hunt campaign.

MLflow is the open-source experiment tracker, and the MLOps-tools overview covers its data model in full. What matters for hunting is that the model carries over cleanly: a run (one execution of a script or notebook, logging parameters, metrics, and artifacts) maps to a hunt variant, and an experiment maps to a hunt campaign. The web UI lets you sort, filter, and compare runs side by side.

The case for MLflow becomes obvious the third or fourth time you find yourself tracking ten variants of a PowerShell detection rule in a spreadsheet. Variant one alerts on any Invoke-Expression. Variant two adds a signed-script filter. Variant three excludes known admin users. Variant four adds an entropy score for obfuscated commands. Variant five adds parent-process checks. Each variant has a true-positive count, a false-positive count, a precision number, an estimated recall. Without MLflow, you track that in a Google Sheet and lose it the moment someone closes the tab. With MLflow, the notebook itself records the run as a side effect of executing.

import mlflow

with mlflow.start_run(run_name="powershell_hunt_v3"):
    # Inputs — what configuration was this run?
    mlflow.log_param("dataset", "windows_events_v2.1")
    mlflow.log_param("timeframe_days", 14)
    mlflow.log_param("detection_logic",
                     "invoke_expression + signed_filter + admin_filter")

    # Execute the hunt
    results = run_detection_query()

    # Outputs — how effective was it?
    mlflow.log_metric("true_positives", 12)
    mlflow.log_metric("false_positives", 47)
    mlflow.log_metric("precision", 0.20)
    mlflow.log_metric("recall_estimated", 0.85)

    # Files — what did this run produce?
    mlflow.log_artifact("detection_rule.sql")
    mlflow.log_artifact("results_chart.png")

After ten variants, you open the MLflow UI and sort by precision, and the winning approach is the row at the top, so you click it, download the attached detection_rule.sql, and you have the production artifact. The UI's comparison view overlays parameters and metrics across selected runs, which is the part that's hard to replicate in a spreadsheet, because it lets you see two runs that differ only in the admin filter with their precision deltas plotted on the same chart.

A few honest hedges are worth stating here. The precision and recall numbers above are illustrative, and real recall is hard to compute in hunting because you rarely know the full denominator (every true positive in the environment), so what gets logged is usually an "estimated recall" against a labelled subset, and that estimate may carry a wider error bar than the precision figure. MLflow doesn't fix that, but it does ensure that whatever number you computed, the inputs and the computation are recoverable.

Reading the UI

Side-by-side comparison is the feature analysts under-use.

The MLflow UI shows four things worth knowing about: the run list (every variant with parameters and metrics inline), the comparison view (select two or more runs, see them stacked), the charts panel (metric trajectories across runs, so you can see if precision is rising with each iteration), and the artifacts panel (download any file attached to any run).

A typical comparison table from a PowerShell-detection campaign:

Run	Detection logic	TP	FP	Precision
v1	invoke_expression	85	412	17%
v2	+ signed_filter	73	189	28%
v7	+ signed_filter + entropy_scoring	68	6	92%

The jump from v1 (17%) to v7 (92%) is the story this table tells. Without MLflow, that story tends to get reconstructed in retrospect from chat-log fragments, after someone asks "why does the production rule include the entropy check?" With MLflow, the answer is two clicks away.

Getting MLflow running

Local first, shared backend later.

For a first run, MLflow needs almost nothing. Install with pip, start the UI on localhost, point your notebook at it, and log a test run.

pip install mlflow
mlflow ui   # serves http://localhost:5000

# In a notebook
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
with mlflow.start_run(run_name="test_hunt"):
    mlflow.log_param("test_param", "hello")
    mlflow.log_metric("test_metric", 42)

That's enough to learn the concepts. For a team, you'll want a shared deployment so multiple analysts log runs to the same place. The standard pattern is a PostgreSQL backend for the run metadata plus an S3 bucket for artifacts, both of which a security team usually already has provisioned.

mlflow server \
    --backend-store-uri postgresql://mlflow:password@localhost/mlflow \
    --default-artifact-root s3://security-lake/mlflow/artifacts/

Production hosting cost is modest. A small PostgreSQL instance plus modest artifact storage runs in the range of fifty to two hundred dollars a month at typical SOC scale, based on AWS list pricing for a db.t3.medium RDS plus a few hundred gigabytes of S3. Managed offerings exist (Databricks MLflow, AWS SageMaker Experiments) if you'd rather not run it yourself, and they cost two to five times the self-hosted figure. I lean self-hosted for security teams because the artifacts are the detection rules themselves, and keeping them inside your existing S3 footprint avoids a separate data-residency conversation.

There is a security catch worth stating plainly, since the whole premise here is that you will run this yourself. The MLflow tracking server has a documented remote-attack-surface record, and the things it holds (your detection rules, your dataset pointers, sometimes the warehouse credentials a run needs) are exactly what an attacker would want. The 2025–2026 advisories are not theoretical. CVE-2025-11201 is an unauthenticated directory traversal that reaches remote code execution through an unvalidated source path, and CVE-2026-2614 lets an unauthenticated request read arbitrary files by smuggling a file:// source past a validation bypass. CVE-2026-2635 shipped exploitable default credentials in basic_auth.ini that hand over admin to whoever finds the port before you rotate them (fixed ahead of 3.8.0rc0), while CVE-2026-2734 lets any authenticated low-privilege tenant read model metadata across tenant boundaries it should not cross. A service that stores the logic your SOC runs on should not be the easiest box on the network.

None of that is a reason to avoid MLflow; it is a reason to run it like the production service it is. Keep the tracking server off the public internet, bound to localhost or behind the same VPN the rest of your SOC tooling already sits behind, and rotate the default credentials on day one, ideally replacing basic auth with your existing SSO. Constrain artifact source paths to an allowlist and reject file:// and absolute filesystem paths, so a traversal has nowhere to escape to. And because a model artifact is executable code, prefer pickle-free serialization like ONNX or Safetensors over raw pickle, joblib, or torch weights, which run arbitrary code on load and turn a poisoned model file into an entry point. The principle is the one this site keeps returning to: the tooling that holds your security logic is security-relevant in its own right, and it earns the same scanning, patching, and network hygiene you would give anything else in production.

What MLflow doesn't solve

Hunt reproducibility needs more than experiment tracking.

MLflow tracks the run, but it does not freeze the dataset the run ran against, so if the underlying CloudTrail data changes between Tuesday and the following Tuesday (new events arrive, old events age out of hot storage), re-executing the run will produce different numbers, and MLflow will dutifully log the new numbers without telling you the data drifted.

The two pieces that complete the HMM3 toolkit are dataset versioning (so the data the hunt ran against can be reproduced) and data-quality validation (so silent schema changes don't break every detection at once). DVC and Great Expectations are the open-source options I'd reach for first; I'll cover both in a follow-up post focused on dataset reproducibility and the ways schema drift can cause detections to fail without anyone noticing.

The shortest version: do Jupyter first, add MLflow when you have more variants than your spreadsheet can track, and add dataset versioning when reproducibility across time windows becomes a stated requirement (audit, retro investigation, compliance review). Don't try to adopt all four at once. The combined learning curve is real, and analysts who try to do everything at once tend to bounce off all of it.

Patterns that fail

Four ways notebook adoption tends to go sideways.

I've watched a number of SOCs try to roll out notebook-based hunting and stall. The failure modes cluster into a handful of patterns that are worth naming, because they're avoidable once you know to look for them.

The shared-notebook-server trap. Some teams stand up a multi-user JupyterHub deployment as the first step, because IT culture defaults to "shared infrastructure." This usually slows adoption rather than accelerating it. The friction of arguing with the SRE team about authentication, the SSO integration, the storage quota, and the kernel-environment management consumes the window where analysts would otherwise be writing their first hunt notebook. Start with analysts running JupyterLab locally on their laptops against a development data slice. Move to shared infrastructure only once people are demanding it because the workloads outgrew a laptop.

Notebooks as throwaway scratch. The opposite failure mode is treating notebooks as disposable. An analyst opens a fresh notebook, writes a hunt, finds something, closes the laptop. The notebook never enters version control. Six months later it's gone, the detection rule it produced is in production with no audit trail, and the next analyst has to re-derive the same logic from scratch. The fix is procedural: every hunt notebook gets committed to a Git repository before the hunt is considered finished. The repository can be the same one that holds detection rules. Treat the notebook as a first-class deliverable, not a side artifact.

Hidden state. Notebooks let cells execute out of order, which means the on-screen state can diverge from the state you'd get by running the cells top to bottom. Six months later, a teammate opens the file, runs every cell sequentially, and gets a different result. The discipline is boring but effective: before declaring a notebook "done," restart the kernel and run all cells in order. If it doesn't reproduce, fix it before checking it in. JupyterLab has a single menu command for this, "Restart Kernel and Run All Cells."

MLflow as another spreadsheet. MLflow's value compounds when teams agree on metric names and parameter names. If one analyst logs precision and another logs prec and a third logs positive_predictive_value, the sortable table the UI is supposed to produce becomes three sortable tables that can't be compared. A short style guide (roughly a one-pager listing the canonical names for true positives, false positives, precision, estimated recall, dataset version, timeframe) pays for itself within the first month. It's a small investment that prevents a slow drift into MLflow-as-spreadsheet.

None of these failures kill the adoption outright, but they produce a slow-motion stall where the tools are nominally in use while the team isn't realizing the benefits. The signal that a team has crossed into actual HMM3 work is concrete, because it shows up when a new analyst joins and can answer "why does this detection exist?" by reading the notebook that produced it rather than asking the analyst who wrote it, and until you can pass that test the work isn't done.

The HMM3 to HMM4 boundary

MLOps tooling sets a ceiling on automation. That's by design.

I want to be explicit about what this toolkit does and doesn't enable. Jupyter plus MLflow (plus, eventually, dataset versioning and data-quality validation) are the foundation for the kind of automation that mature SOCs achieve, and that automation tops out around the thirty-to-forty percent of investigation steps that Expel reports for its Ruxie pipeline, with the louder agent-native claims in the high nineties still short of production validation. I work through why the ceiling sits there, and why the agent-native numbers don't yet hold up, in the agentic-SOC reality piece; the detection-maturity essay puts the same ceiling in HMM terms in a separate piece.

For teams at HMM3 today, the practical guidance is unchanged: start with notebooks, add MLflow when you find yourself comparing variants, and add dataset versioning when reproducibility crosses a quarterly boundary. Agents may or may not change the picture, but what's clear is that an HMM3 team without a notebook habit cannot leap directly to agent-orchestrated investigation, so the intermediate work remains the only path validated today.

Where to start tomorrow

Open Jupyter. Document your next hunt. That is the entire first step.

The minimum viable adoption looks like this. Pick the next hunt on your queue, and before you open the SIEM, open JupyterLab, write the hypothesis as a Markdown cell, write the SQL as a code cell, and run it. Then filter the results in pandas as a separate cell so the rationale for the filter has a place to live, write up the findings as another Markdown cell, and save the notebook into a Git repository alongside your detection rules, at which point you've produced the most reproducible artifact your hunt program has ever generated.

Do that three or four times before reaching for MLflow. The point of MLflow is to compare variants, and you won't feel the need to compare variants until you've already started generating them. Adding MLflow before that point usually produces empty-experiment fatigue, clicking through a UI that has nothing useful to show.

Once the notebooks become normal and the variant-comparison problem becomes real, install MLflow locally, instrument one hunt with start_run, and let the value show itself. The conceptual model is small enough to learn in an afternoon. Everything that comes after that (shared backends, dataset versioning, data-quality validation, eventual agent integration) is incremental on top of work that already exists. None of it is incremental on top of nothing, which is why the notebook step is the one that matters.