Project 3 · Research territory

MLOps-enabled model threat hunting.

Models surface anomalies and prioritize hunts; MLOps manages the model lifecycle — training, drift detection, retraining, evaluation — with the same discipline DetectFlow applies to detection content. The point isn't "AI in the SOC." It's treating models as production artifacts with the discipline that detection content has earned, in an industry that has earned a healthy skepticism about both. This page names what isn't yet known more directly than the other two.

The framing

Why this is research territory rather than a productized service.

The current AI-in-the-SOC vendor wave overwhelmingly fails the empirical-skepticism test the thesis runs every other claim through. "Autonomous SOC analyst." "AI-driven triage." "Self-healing detection." The marketing pattern is the AI-augmented-versus-AI-native distinction captured on the research surface — chatbots layered on top of the same BI-era infrastructure, marketed as if the chatbot changed the data platform underneath. The infrastructure doesn't carry agent-scale workloads cleanly yet. The model accuracy claims don't survive production validation. The "AI talent reduction" promises shrink to 40–50% from 60–80% once the human-oversight workload is added back in. The category isn't ready, and pretending otherwise would compromise every other recommendation I make.

At the same time, the underlying argument has merit. Models do surface anomalies that rules don't catch. Models do prioritize hunts in ways that human analysts can't scale to. The hunting program at a mature SOC has analyst time as the binding constraint, and the model-prioritized hunt is a real productivity lever — when it works. The "when it works" is doing a lot of load-bearing work in that sentence.

My position: this is the right direction; the field hasn't yet produced the evidence to operationalize it; I track the territory as research and ship thought-leadership rather than commercial services until the foundation work has produced reference clients and the surrounding evidence has caught up. The MLOps-enabled hunting category becomes a Phase 2 service line after that — not on a fixed timeline, but on a fixed evidence threshold.

The discipline

Models as production artifacts. The lifecycle is the work.

The foundation of "MLOps for hunting" isn't novel — MLOps as a discipline is mature in adjacent fields. Reproducible training pipelines. Lineage-traceable features. Versioned model artifacts. Drift detection against production traffic. Retraining on a defined cadence with documented evaluation. Rollback through version control. The disciplines exist; the application to security-data hunting is what's underspecified.

Training data lineage.

Models are no better than the training data they're built on. Hunting models trained on historical telemetry need to know exactly which telemetry, in which schema version, with which OCSF normalization applied. The Iceberg lakehouse gives this for free — time-travel queries reconstruct the exact data state a model was trained on, and lineage tracking back through the pipeline shows which sources fed it. Without that surface, model retraining drifts silently across schema changes, and the model's behavior in production diverges from its training-time evaluation in ways nobody can debug.

Drift detection.

Production telemetry doesn't stay shaped like the training set. New cloud services land. New attack patterns surface. EDR vendors update their event schemas. Without active drift detection, the model's accuracy degrades across all of these without anyone noticing. The MLOps discipline runs continuous statistical comparison between the production telemetry distribution and the training distribution, with alerting when the divergence crosses defined thresholds. The alert says "this model is now operating outside its training envelope; treat its output with caution until retraining lands."

Evaluation discipline.

Every model carries a defined evaluation suite — known-bad samples it should flag, known-good samples it should not flag, and a measurable false-positive rate against unlabeled production data. Promotion to production requires the evaluation suite to pass; demotion is automatic if production performance drifts below threshold. The same discipline DetectFlow applies to detection content, applied to model output. The mechanism that prevents "the model worked in the notebook; we have no idea what it's doing in production."

Why most current attempts fail

Three patterns that recur across the AI-in-the-SOC failure cases.

Notebook research, not production engineering.

A data scientist trains a model in a Jupyter notebook against a curated subset of historical telemetry. The model performs well on the curated subset. It gets shipped to production with no MLOps infrastructure underneath — no drift detection, no retraining cadence, no evaluation suite. Six months later the model's output is noisy in ways nobody can characterize, the analysts have learned to ignore it, and the program quietly admits the experiment failed. The model wasn't bad; the operationalization wasn't there.

Models that surface anomalies the analysts can't act on.

Anomaly detection without context is alert fatigue dressed up. If the model surfaces a behavioral pattern that's statistically rare but the analyst can't determine within reason whether it's malicious, the model has produced workload rather than insight. The mature pattern: models that surface anomalies with the supporting context graph attached, so the analyst can triage in seconds rather than start a five-tab investigation. The supporting context comes from the data-health foundation — entity resolution, asset ownership, recent activity. Without that foundation, the model output is noise.

Treating model output as authoritative.

The "autonomous SOC" framing assumes model output can drive action without human review. The empirical evidence — including the AI-asymmetry hypothesis tracked on the research page — does not support that framing for security-critical decisions in 2026. Models hallucinate. Models exhibit goal misalignment under adversarial pressure. Models encode biases from training data that are invisible until they fire on production data and produce false positives at category-level rates. Human-in-the-loop oversight is architectural, not transitional. Programs that try to skip it produce expensive incidents that get retroactively rebadged as "model tuning issues" rather than recognized as the architecture decision they actually were.

When the work becomes scoped

The conditions that have to be in place first.

The MLOps-hunting service line gets built when three conditions stack:

The data-health foundation has to be in place. Models trained on telemetry of unknown completeness, freshness, and schema conformance produce output of unknown reliability. The data-health gate on the thesis page is the precondition — without it, the model output inherits the foundation's gaps without anyone seeing them.
The hunting program has to be mature enough to use prioritization. A hunt program running 5–10 named investigations a quarter doesn't need model-driven prioritization; the analysts already know what to chase. The leverage shows up in programs running 50+ hunts per quarter where analyst time is the binding constraint and the next investigation is a meaningful triage decision.
The team has to have appetite for model lifecycle discipline. Notebook-driven model deployment is the default; production MLOps is the discipline that has to replace it. Programs that aren't ready to invest in the lifecycle infrastructure aren't ready to run production hunting models. The framework's organizational-constraints phase explicitly catches this.

For shops where these conditions aren't yet stacked, the right work is the foundation first, then DetectFlow on the data platform. The MLOps-hunting work waits until the conditions land naturally — usually 12–24 months after the foundation engagement, sometimes longer.

The current track

Thought-leadership now. Service line when the evidence catches up.

The current posture: track the territory as research, publish what's defensible, hold off on selling a commercial service until the foundation engagements have produced reference clients and the surrounding evidence has caught up. The thought-leadership surface lives across three artifacts: the AI-native-versus-augmented research tracking the infrastructure rebuild signals, the MCP-for-data-engineering essay tracking the AI-generated-infrastructure pattern, and the long-form treatment in the forthcoming MOAR book where one chapter covers the model-as-production-artifact discipline applied to security data.

The lab roadmap covers the testable parts. Q1–Q4 2026 work on AI-generated parser accuracy (Tenzir's MCP claims), production query patterns under agent workloads, and OCSF mapping accuracy on real EDR and cloud logs all feed the underlying evidence base. When the lab work produces reproducible numbers, the thesis updates and the service-line scoping conversation becomes operationally grounded rather than speculative.

For prospects whose current programs are running into the binding constraints this project addresses — analyst time as the limit, hunting backlog dominated by prioritization decisions, willingness to invest in model lifecycle discipline — I can run a focused workshop or advisory engagement scoping the model lifecycle posture, the evaluation framework, and the MLOps tooling decisions. That scoping work is the only commercial vehicle for this project today; the productized service line comes later. Selling Phase 2 as if it were Phase 1 is exactly the failure mode the empirical-skepticism method is meant to prevent.

The right project, at the wrong time, is still the wrong project.

The foundation gates everything; MOAR is the data platform; DetectFlow is the discipline that makes production model deployment operationally tractable. MLOps-hunting is what the program ships to next.

Back to thesis → See the research