MLOps for security

Feature stores for security: when Feast and Kedro pay off, and when they don't.

Feature stores solve a real problem in ML platforms: multiple models reading the same raw data and defining the same derived signal three different ways. In security, that problem exists, but it usually doesn't exist yet, because most SOC teams I talk to are running zero supervised ML models in production rather than the eighty that would make the problem bite. So the honest framing is that most security teams don't need a feature store, and the ones that do are running supervised ML for triage or hunting at the upper end of the SANS Hunting Maturity Model, where the public production evidence in that corner is still thin.

Reading time: about 20 minutes. Evidence tier: B for Feast and Kedro as tools (Linux Foundation AI & Data governance, mature ecosystem documentation, fraud-detection production deployments). C to D for security-specific application. I could not find a single publicly documented production deployment of Feast inside a SOC or detection-engineering team in early 2026. Treat the security patterns here as architectural hypotheses, not validated practice.

The honest framing

Most security teams don't need this yet.

I want to put the recommendation first, because the rest of this essay covers tools you may not need. SANS research puts roughly 8% of organizations at HMM4, "Leading" maturity, where systematic automation, ML pipelines, and reproducible workflows are the working pattern. The other 92% are at HMM2 (procedural hunting on community playbooks) or HMM3 (innovative analytics on home-grown queries). At HMM2 and HMM3, the right toolchain is Jupyter for hunt prototyping and MLflow for experiment tracking, because a feature store at that stage is mostly overhead you pay for before there's enough model count to give it something to do.

The clearest signal you need a feature store is operational rather than architectural. If your detection team maintains more than ten or fifteen supervised ML models and the same derived features (failed_logins_24h, geo_diversity_score, rare_process_count, parent-process-chain-depth) are being extracted independently by each model, with slightly different windowing logic in each, that's when the duplication and inconsistency cost starts to outweigh the platform cost. Below that threshold, a shared SQL view or a dbt model probably does the job.

The second signal is that an analyst investigating an alert from Model A pulls their own ad-hoc query and gets a different number from what the alert reports, because Model A counts a rolling 24-hour window and the analyst's ad-hoc query counts the calendar day, so the analyst starts to stop trusting the alerts. That's the feature-definition-drift problem, and a feature store is one solution to it, though it isn't the only one.

Vocabulary

What a feature store actually is.

Three pieces of vocabulary do most of the work in this conversation. A feature is a derived numeric or categorical signal computed from raw data: failed_logins_24h is a feature, login_geo_diversity_score is a feature, rare_process_count is a feature. The raw authentication log line is not a feature; the count of failures in a 24-hour window for a given user is.

A feature store is a system that centralizes the definition, computation, and serving of those features, so there's one definition and one computation served consistently to every model that needs it. The store has two faces. The offline store holds historical features for training (typically Iceberg tables, Snowflake, BigQuery). The online store holds the latest precomputed values for real-time inference at low latency (typically Redis or DynamoDB), because production detection cannot wait several seconds for a feature extraction query to run against raw logs.

The third piece is point-in-time correctness. This one is subtle and matters more than people realize. When I train a model to predict whether a user session was compromised at 09:00 on Tuesday, I can only use features whose values would have been computable at 09:00 on Tuesday. If I accidentally let the training set see failed_logins values that were aggregated as of 17:00 the same day, the model "cheats." It has access to future data that won't exist at inference time. The production model will then perform far worse than the training metrics promised. A feature store handles this correctly by joining historical entity-and-timestamp rows against the feature value as of that timestamp. Hand-rolled feature pipelines almost always get this subtly wrong on the first few iterations.

Point-in-time correctness is one of the strongest standalone arguments for using a feature store rather than rolling your own, and the reason isn't that the SQL is impossible to write but that it's easy to write incorrectly and the failure mode (overfit training metrics, underperforming production model) is hard to debug after the fact.

The duplication problem

Eighty models, eighty definitions of "failed logins in 24 hours."

The motivating scenario is useful to picture. Imagine a detection engineering team maintaining eighty supervised ML models (credential stuffing, account compromise, brute force, impossible travel, suspicious privilege escalation, rare process execution, beaconing, exfiltration, a long tail of variants). Each model needs a handful of derived features, and many of those features overlap across the eighty.

One detection counts failed logins in a rolling 24-hour window from the current time. Another counts failed logins since midnight today (calendar day). A third counts failed logins in the last 1,440 minutes, which is functionally equivalent to the rolling window but written by a different engineer in different code. All three call the result failed_logins_24h. The first time an analyst investigates an alert and finds the model's reported count disagrees with their ad-hoc query, the trust erosion starts.

There's also a compute cost. Eighty models each running their own feature extraction against the same raw authentication logs is roughly eighty times the cluster work for one feature, repeated for every shared feature. I've seen architecture reviews where this duplication ran to a sustained 60–90× over what a single shared extraction would cost, on EMR or Databricks clusters that were running 24/7 specifically to keep these features fresh, which is the strongest dollar argument for centralizing.

I want to flag the size of the leap, though, because the "eighty supervised ML models" scenario shows up in fraud detection and at the largest, most ML-mature security teams and is uncommon elsewhere, since most of the SOC teams I see have zero models in production. So the cost of duplication only matters once you've crossed into the territory where you have enough models for it to bite.

Feast

The open-source feature store, briefly.

Feast began at Google and Gojek, was open-sourced in 2019, and now lives under Linux Foundation AI & Data governance with an Apache 2.0 license. It supports more than fifteen offline stores (BigQuery, Snowflake, Redshift, Iceberg, DuckDB, and others) and around ten online stores (Redis, DynamoDB, Cassandra, PostgreSQL). The maturity story on the platform itself is solid, but the maturity story on security-specific deployments is the gap I keep coming back to.

Feast has four moving parts. A feature registry in S3 or Postgres holds the definitions: the YAML and Python that say what features exist and how they're computed. The offline store (Iceberg in the lakehouse architectures I work with) holds the historical feature values for training. The online store (Redis is the typical choice) holds the latest values for low-latency inference. A feature server is an optional REST API for non-Python clients to read features.

A minimum-viable feature definition in Feast looks like this:

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64
from datetime import timedelta

user = Entity(name="user", join_keys=["user_id"])

user_behavior_fv = FeatureView(
    name="user_behavior_features",
    entities=[user],
    ttl=timedelta(days=7),
    schema=[
        Field(name="failed_logins_24h", dtype=Int64),
        Field(name="successful_logins_7d", dtype=Int64),
        Field(name="login_geo_diversity_score", dtype=Float32),
        Field(name="privileged_action_count", dtype=Int64),
    ],
    online=True,
    source=iceberg_source,
)

Two things to notice. First, the schema definition is the single source of truth, so every model that reads user_behavior_features:failed_logins_24h reads the same value computed by the same definition. Second, online=True tells Feast to materialize the values into the online store, which is what makes the feature available for sub-50ms inference rather than only for batch training.

The training-side API is where point-in-time correctness shows up. You pass an entity dataframe of (user_id, event_timestamp) rows and the feature names you want, and Feast joins back against historical feature values as of each timestamp:

from feast import FeatureStore

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_behavior_features:failed_logins_24h",
        "user_behavior_features:login_geo_diversity_score",
        "user_behavior_features:privileged_action_count",
    ],
).to_df()

The corresponding online call returns the latest precomputed values for a given entity in a few milliseconds from Redis, which gives you the basic shape: you define a feature once, train against the historical features and infer against the online ones, and you stop copy-pasting the same SQL into eighty different model repos.

The vendor landscape

Feast, Tecton, and the build-versus-buy line.

Feast is not the only feature store in the market. Tecton is the commercial managed offering founded by the team that built Uber's Michelangelo platform. Databricks ships a feature store as part of Unity Catalog. Hopsworks is an independent open-source option with a strong on-premises story. The build-versus-buy decision matters more in security than in fraud-detection ML, because security teams almost never have data-engineering staffing as their core skillset.

My read: if you're already a Databricks shop and you have one team that owns both the lakehouse and the detection-engineering ML work, the Unity Catalog feature store is the path of least resistance. You pay for the integration, not for re-implementing point-in-time correctness. If you're not on Databricks and you have engineering capacity to operate Redis and a scheduled materialization pipeline, Feast is the open-source default; the cost of ownership is real but the lock-in is minimal. Tecton is the right call when you need a managed offering with strong point-in-time correctness guarantees and you have budget that won't survive a build-it-yourself delay.

I have not seen a security-specific deployment of Tecton or Databricks Feature Store either. The vendor neutrality call here is essentially "pick the one that fits the surrounding stack and the team's skill profile," because there is no security-domain validation that would tip the scales for any of them yet.

Kedro

Pipeline orchestration, in the place Airflow doesn't quite fit.

Kedro is an open-source framework for organizing data science and ML pipeline code into modular nodes and pipelines. It was developed at QuantumBlack (McKinsey's analytics arm) and open-sourced under Apache 2.0. The honest one-liner is that Kedro structures data-science workflows to enforce modularity, configuration management, and reproducibility on code that would otherwise be a pile of notebooks.

The functional structure is simple. A node is a Python function with declared inputs and outputs. A pipeline is a directed acyclic graph of nodes wired together by name. A data catalog defines where each named input or output lives: an Iceberg table, a parquet file, a feature view, a Redis key. Running kedro run materializes the pipeline; running kedro viz opens an interactive DAG view in the browser.

A condensed detection pipeline in Kedro looks like:

from kedro.pipeline import Pipeline, node
from .nodes import (
    ingest_cloudtrail, normalize_ocsf, extract_features,
    load_feature_store, run_detection_models,
    filter_high_confidence, send_alerts,
)

def create_detection_pipeline() -> Pipeline:
    return Pipeline([
        node(ingest_cloudtrail, "params:cloudtrail_path", "raw_cloudtrail"),
        node(normalize_ocsf,    "raw_cloudtrail",         "ocsf_events"),
        node(extract_features,  "ocsf_events",            "user_features"),
        node(load_feature_store,"user_features",          None),
        node(run_detection_models, "ocsf_events",         "raw_alerts"),
        node(filter_high_confidence, "raw_alerts",        "validated_alerts"),
        node(send_alerts,       "validated_alerts",       None),
    ])

What Kedro buys you, specifically: the OCSF normalization step can be changed without re-testing the feature extraction step. The data catalog tracks lineage automatically, so when an alert fires, you can walk backwards through the pipeline and see exactly which input dataset, which feature materialization, and which model version produced it. Each node is unit-testable in isolation. That last point is the one that matters most for security: detection-engineering code that's been through a Kedro refactor tends to have meaningfully higher test coverage than the equivalent notebook-and-cron sprawl, and the production deployment story improves with it.

The natural question is "why not use Airflow." Airflow is the dominant orchestrator for batch ETL and it deserves to be. The place Airflow does not fit cleanly is the ML-development inner loop: the iteration cycle where a data scientist is changing the feature logic, retraining, evaluating, and redeploying multiple times a day. Kedro is built around that inner loop. The typical mature pattern is to develop the pipeline in Kedro and deploy it to Airflow (or AWS Step Functions, or Kubeflow) for scheduled production execution, so the two end up complementing each other rather than competing.

Kedro adoption in security specifically is rare but not nonexistent. Two security data engineering teams I've talked to (financial services and technology, both anonymized) migrated detection pipeline code from manual Python scripts to Kedro and reported faster deployment, higher test coverage, and better lineage for incident-response forensics. Both also flagged the same thing: the learning curve is real and the data-catalog concept is the part that bounces people, so a team that commits without engineering buy-in tends to abandon the Kedro work partway through.

OCSF and the feature schema

Normalize once, define features once.

One argument for feature stores in security that I do find genuinely persuasive is the interaction with OCSF (Open Cybersecurity Schema Framework). If your authentication, network, and process telemetry are normalized to OCSF at ingestion, your feature definitions stop being vendor-specific. A failed_logins_24h feature is "count where class_uid = 3002 (Authentication) and status_id = 2 (Failure) in the trailing 24-hour window," which is one definition that works across CloudTrail, Azure AD, Okta, Windows Event Logs, and any other authentication source you normalize.

Without OCSF, the same feature needs ten different per-source implementations because each vendor names the outcome field differently and encodes "failure" with a different enum value. The practitioners I've talked to who are running OCSF-normalized ML pipelines report something like 50–70% reduction in feature engineering effort once the OCSF mapping is paid down. I want to flag the sample size on that number: two organizations, both with mature OCSF adoption and a self-selection bias toward teams who would talk to me about it. The directional claim, that OCSF collapses per-source feature work, is sound. The exact percentage is more uncertain than the number suggests.

The complementary observation is that without OCSF, a feature store is partially fighting the wrong fight. If your feature definitions are still vendor-specific underneath the store, you've centralized the symptom (duplicate code in eighty model repos) without centralizing the cause (vendor-specific schemas at ingestion). The combination that pays off is OCSF normalization in the route layer together with feature-store centralization in the ML layer, because either one on its own only gets you partway.

Cost shape

What a production deployment costs to run.

The cost line is one of the places I have to be careful. I have modeled cost estimates from AWS list pricing for Redis ElastiCache, Lambda or ECS for materialization, and S3 for the offline store. I do not have measured cost data from a production security deployment, because I have not been able to find one to measure. With that caveat, the modeled envelope:

Component	Monthly (list price)	Cost driver / sizing
Online store (Redis ElastiCache)	roughly $50–500/month	for `r6g.large` to `r6g.2xlarge` instances, sized to hold 10K to 100K precomputed feature rows. Scale up linearly with entity count.
Materialization compute	roughly $100–500/month	for Lambda or ECS jobs running on an hourly or daily cadence over 1M to 100M source rows. The shape of the cost is dominated by materialization frequency more than data volume.
Offline store and feature registry on S3	roughly $50–200/month	for 100GB to 1TB of Parquet feature data and registry metadata.
Total envelope	$200–1,200/month for typical deployments, $500–2,000/month for high-scale (sub-minute refresh, hundreds of millions of features)	I would size at the high end for any production security deployment until measured data argues otherwise.

These numbers are list-price, single-region, and reserved-instance discounts plus committed-use agreements compress them noticeably. The reason I quote a wide range is that I would rather over-communicate uncertainty than under-communicate it on a number this central to budgeting. Run the spreadsheet for your specific scale before quoting these to a CFO.

The evidence gap

Zero publicly documented production SOC deployments.

This is the part I want to be most direct about. As of early 2026, I could not find a single publicly documented production deployment of Feast inside a SOC or detection-engineering team. I searched the Feast GitHub issues for the "security" tag (eighteen results, all fraud detection, no SOC use cases). I searched the Feast Slack community (around 15,000 members, three threads mentioning security, all exploratory). I searched LinkedIn for profiles listing both Feast and security experience (eight results, six in fraud detection, two in cybersecurity roles both marked as proof-of-concept or exploration).

My best estimate is that fewer than 5% of HMM4 teams are using a feature store in production, and HMM4 itself is only 8% of organizations. The aggregate population of "security teams running a production feature store" is small enough that the absence of public case studies isn't surprising. It is also a constraint on how strongly I can recommend this pattern.

The adjacent validation that I do trust: Feast is widely deployed in fraud detection at fintech and e-commerce, and the workload shape (real-time risk scoring on user-and-event entities, behavior analytics, point-in-time correctness for training) is similar to threat detection. The technical fit is plausible. What's missing is knowledge transfer from fintech deployment experience into security teams who have ML literacy concentrated in a few headcount and a different operational rhythm.

The code examples in this essay are illustrative patterns adapted from fraud-detection use cases. I would call them informed hypotheses rather than validated practice, so treat them accordingly, because if you adopt this pattern, you are part of the first generation of security teams to do it in public. That is a defensible position, with upside (you'll have novel operational learnings worth talking about) and cost (you will not have a peer reference to call when something breaks).

MLOps vs agents

The 30–40% ceiling versus the agent-native claim.

There's a second architectural option worth naming, because some readers will ask. The Feast-plus-Kedro pattern is a human-centric MLOps architecture: automation augments analysts, models score events, analysts review and approve actions, and the proven ceiling for that shape is the roughly 30–40% of analyst time Expel reports for its Ruxie pipeline. The alternative is agent-native investigation, where AI agents orchestrate end-to-end and some vendor and academic framings put the cited numbers in the high nineties, though those figures come from controlled or marketing settings rather than production SOC deployments. I work through both numbers, and why I trust the ceiling but not the high-nineties claim, in the agentic-SOC reality piece.

My recommendation is to plan for MLOps as the proven path and treat agent-native architectures as an option to monitor rather than commit to, because the MLOps tooling is mature while the agent tooling is moving fast but still has the operational and security-of-agents questions open. So Feast and Kedro are real investments with bounded learning curves, and agent platforms in 2026 are still a research-grade bet.

Maturity gate

When to actually adopt this.

The shortest version of the maturity gate: do not adopt Feast and Kedro if your team is at HMM2 or HMM3. The order I'd recommend for a detection-engineering team that's building toward HMM4 is roughly this:

HMM2 baseline: Jupyter notebooks for hunt prototyping. Community playbooks. SQL against the lakehouse. No ML in production. This is the right place for most teams to be, and there is nothing wrong with staying here long enough to mature.
HMM3 capabilities: Add MLflow for experiment tracking. Add DVC for dataset versioning. Add Great Expectations for data quality. These three tools cover the foundational "reproducible analytics" capability that has to exist before a feature store earns its keep.
HMM4 readiness signal: Ten or more supervised ML models in production with measurably overlapping feature definitions, ML-literate detection engineering staffing (not one data scientist on loan, but a real team), and an OCSF normalization layer in the ingestion pipeline. When all three are true, a feature store starts to pay off.
HMM4 adoption: Add Feast for centralized feature definitions, Kedro for pipeline orchestration, and the existing MLflow / DVC / Great Expectations stack for the development and quality side. Combined learning curve is roughly 40–60 hours across the team for the full toolchain.

The reach-for-it-too-early failure mode I see most often is a team at HMM2 adopting a feature store because the architecture diagrams look impressive, then abandoning it twelve months later because the operational cost is real and the value would only have materialized at five times their current model count, which is why the maturity gate is worth holding to before you adopt.

Verification flags

What I'd ask a vendor or practitioner before committing.

I want to leave a short list of the questions I'd put to anyone selling or recommending this pattern, because the evidence gap means a buyer has to do more of the validation work than usual:

Show me one named production deployment of a feature store in a SOC or detection-engineering team. If the answer is fraud detection, that's adjacent but not the deployment. If the answer is a vendor case study without a customer name, that's marketing.
Show me the point-in-time correctness implementation. This is the place hand-rolled feature pipelines most often go subtly wrong. A vendor or team that can't walk you through how they handle it is selling you a feature store that hasn't earned the name.
Where's the OCSF normalization layer? If features are still vendor-specific underneath, the feature store is centralizing the symptom and not the cause. The combination is what pays off.
What's the operational cost of materialization? Modeled cost from list pricing is a starting point. Measured cost from a real deployment at your scale is the answer that matters.
What's the failure mode when the online store goes down? Redis is reliable but not infallible. The detection pipeline's behavior when feature reads fail (fail-open, fail-closed, fail-stale) needs to be a deliberate design decision, not something that comes up during an incident.

None of these are reasons to dismiss feature stores, but they are reasons to size expectations correctly and to commit deliberately, because Feast and Kedro do production work in adjacent domains and the security application is a sound architectural hypothesis that stays a hypothesis until more SOC teams put it into production and publish what they learn.

Conclusion

A small fraction of security teams need this. The ones who do should adopt it deliberately.

Feature stores solve a real problem at HMM4 scale: dozens of supervised ML models converging on the same handful of derived features, with definition drift and duplicate compute as the predictable cost. Feast plus Kedro is the open-source toolchain that addresses that problem end-to-end, because it puts feature definitions in one place, materializes them to both an offline training store and an online inference store, orchestrates the pipeline with lineage and modularity, and handles point-in-time correctness rather than leaving you to re-invent it.

The qualifier I keep coming back to: this is HMM4 territory and the public evidence of HMM4 security deployments using these tools is essentially absent in early 2026. The pattern is sound in fraud detection, the workload shape is similar, and the structural argument for centralization is the same one that has played out across the other ML platforms where this has come up, so security is just earlier on the curve.

If you're at HMM2 or HMM3, stay there long enough to mature the foundations: Jupyter, MLflow, DVC, Great Expectations, OCSF normalization. If you're at HMM4 readiness, Feast and Kedro are worth the 40–60 hour combined learning curve. And if you adopt this pattern in production, please write about it publicly. The next generation of security teams trying to make this decision will thank you.