The agentic SOC
What's real vs marketed in the agentic SOC.
Every pitch for the agentic SOC skips the boring part. We talk about agents triaging alerts and writing detections and closing tickets, and we wave past the two questions that actually decide whether any of it ships: who vouches for an agent the SOC didn't write, and what number actually closes an investigation versus what number a slide rounds up. Every SOC-automation vendor I have talked to in the last eighteen months has put a number on a slide, and the numbers cluster between 90% and 99%; the most aggressive pitch I have personally sat through claimed "near-complete tier-1 automation." Meanwhile the practitioner number with the most public daylight around it sits closer to a third. That gap, where the marketing lives, is what this essay is about.
The way I want to handle the whole topic is to treat it as one dimension of the fair-broker discipline I apply everywhere else, not as a separate AI story. A claim about agentic security-data gets the same two-part demand as any vendor claim: a definition demand (what counted, against what baseline, in which scenarios) and an evidence tier. The agentic SOC isn't a new identity for the practice; it's the place where the durable questions (source health, schema portability, who can prove what an actor is allowed to do) show up wearing newer vocabulary. So this is a consolidation rather than a fresh take. I'm folding what I've written across several vendor moves (Databricks Lakewatch, MIT's NANDA, the RAPTOR demo, the security-data-pipeline vendors crowding behind Cribl) into one read, because the freshness in any single sweep belongs subordinated to the spine rather than standing as its own page.
Reading time: about 30 minutes. Evidence tier: B overall, with Tier A grounding on CVE volume and remediation statistics, a Tier A primary source for NANDA's architecture, and explicit Tier C and Tier D flags on vendor positioning and on my own forward-looking estimates. Security-specific deployment evidence for most of what's described here is not yet published; flagged throughout.
One disclosure
One disclosure, applied to every vendor named below.
Before any vendor name appears, the disclosure that governs the whole essay. The practice takes no reseller margins, no kickbacks, and no privileged access from any vendor discussed here. The Capability Matrix scores each vendor against the alternatives for a given workload archetype, not against itself, and if the matrix ranks an alternative higher for a workload, that is the recommendation regardless of any relationship. The fair-broker thesis applies before any partnership consideration, and it's the discipline that makes a vendor-watch read forwardable rather than a pitch.
There is one live relationship worth naming concretely, because the rest of the essay discusses the vendor in question. The practice is in GTM-partnership discussions with Databricks around Lakewatch. No reseller margins, no kickbacks, no privileged access. The matrix evaluates Lakewatch against Snowflake-with-Hunters, Cribl-augmented Splunk, ClickHouse-on-Iceberg, and StarRocks; if the matrix ranks an alternative higher for a given workload, that's the recommendation. That disclosure sits here, once, instead of as a banner repeated above every vendor section, which is the right shape for it: disclosure first, opinion second, and the same rule applied to the vendor I have a relationship with and the five I don't.
The practitioner ceiling
The practitioner ceiling, and the sleight-of-hand that hides it.
The practitioner figure with the most public daylight around it is Expel's. Expel is a Managed Detection and Response (MDR) provider that ranks highly in analyst coverage (the Forrester Wave: Managed Detection and Response, Q1 2025, and the Gartner Market Guide for MDR Services, 2025), and its Ruxie automation system has been described in the company's Transparency Report and in conference talks by Expel's leadership. The headline metric is that roughly 30-40% of investigations are handled end-to-end by automation, with a mean time to respond under 20 minutes for critical-severity incidents, and what makes that number worth citing is that it is a production figure rather than a demo figure, one that has stayed consistent across multiple reporting periods.
The part that gets quoted less is the other side of the same number: 60-70% of investigations still need human analysts. Expel's own framing, attributed to leadership in industry talks, is that "technology must supplement human expertise, not replace it." I don't have an audited transcript of that line, so I treat it as Tier B paraphrase rather than a verbatim quote, but the underlying philosophy shows up consistently in their public material.
There is a separate Expel number that is sometimes conflated with the investigation figure, and the conflation is the core vendor move worth naming. Alert enrichment automation runs much higher, roughly 95-97% by Expel's reporting, but enrichment and investigation are different work. Enrichment is "decode this PowerShell, look up this IP, attach the ATT&CK technique," which is mostly deterministic plumbing, while investigation is "is this a real incident, what's the scope, what's the response," which requires judgment. Conflating the two by citing 95% enrichment as if it were 95% investigation is one of the most common ways the vendor pitch decks lose the plot, and it's the specific category confusion to listen for.
I want to be careful about how I criticize the 90%+ claims, because the failure mode I'm pushing back against is not lying so much as category confusion. When you trace a 90%+ automation number back, it usually decomposes into one of three things, none of which is "90% of investigations handled end-to-end without an analyst." The first is enrichment automation rebranded as investigation automation: if you measure "alerts the system touched in some way before reaching an analyst," 90% is achievable today and Expel already hits 95-97%, which is real work because it saves analyst time, but it isn't closing the investigation, and the slide that says "automation rate" doesn't say what was actually automated. The second is auto-closure of obvious false positives: a high percentage of alerts in any mature SOC are noise the detection logic should have suppressed upstream, and auto-closing those is valuable, but it's closer to "tuning we didn't bother to push back into the detection rule" than to "AI replaced the analyst." The third is demo-environment numbers generalized: a controlled environment with curated alert types, clean log sources, and known attack patterns can run very high automation rates, and the same system degrades quickly against messy logs, novel attacks, and business-context judgment calls. I have not seen a vendor publish the methodology that would let me distinguish a demo number from a production number, and that absence is the verification flag.
So the verification question is concrete. When a vendor cites 90%+, ask which of those three definitions they mean, and ask for the production-environment number broken down by enrichment versus end-to-end investigation closure, because the honest vendors can give you both while the ones who can't are quoting marketing. The SANS 2024 SOC Survey reports average SOC automation rates of 25-35% across enterprises surveyed, with no segment cluster above 50%, which corroborates the 30-40% practitioner ceiling rather than the slide-deck cluster. I treat 30-40% as the publicly-documented practitioner ceiling for end-to-end investigation automation by a high-performing MDR, not as a hard physical ceiling, but as the highest credible number I can point to in public reporting, stable long enough that I put the burden of proof on anyone claiming materially higher. This essay is the single owner of that number; everything else here cross-references it rather than re-deriving it.
The field anecdote
You can't trust the layer until you can measure the feed.
I watched a sharper version of the validation failure from the field, and it's the bridge between the ceiling number and everything downstream of it. A large SOC that had just turned on an AI triage tool asked me how to undo all of their alert-reduction tuning, the human suppression that exists so the detection engineers aren't buried, so that the tool, sitting at the very top of the stack, could see the full raw feed and do the filtering magic itself. The raw feed at that layer is enormous; a network-telemetry source can run to millions of events a day against a tuned alert stream that is a tiny curated fraction, so "let the AI see everything" means trusting it to replace detection-engineering discipline rather than measure it.
And the tool sits at the top of a long pipeline, downstream of every parsing, mapping, and correlation stage, so it can only act on what survives all of them, which means reverting the human tuning floods it with volume without answering whether the signals that matter even arrive. You cannot trust that layer until you can measure that the data actually flows through the pipeline to it, and almost nobody measures that. Trusting the AI on an unverified pipeline is the same swamp relocated under a fresh coat of paint. That's the connection back to the data-quality work in the ground you're standing on: the top-of-stack agent is only as good as the parse-map-correlate stages beneath it, and the buyer who hasn't instrumented those stages is buying confidence, not coverage.
Where automation works and fails
Where automation works, where it fails, and why the ratio holds.
The 30-40% isn't random; there's a shape to it, and the shape generalizes beyond Ruxie because the structural reasons travel. Three categories of work account for most of what Expel-style automation actually handles. Context enrichment is the first and the largest: decoding obfuscated PowerShell, resolving WHOIS for suspicious domains, cross-referencing IP reputation feeds and known-bad hashes, mapping to MITRE ATT&CK, pulling user/host/asset metadata. That's the 95-97% bucket, and it works because the logic is deterministic, the data sources are well-defined, and the downside of being wrong is negligible: enrichment doesn't require a yes-or-no decision about whether something is bad, it just attaches context and the analyst still decides. High-confidence routine playbooks are the second: password spray, brute-force authentication against externally-exposed services, well-characterized commodity malware downloading from known-bad infrastructure, all investigation types where the decision tree is short, the precision is high, and the action is reversible, and where a confidence threshold (Expel-style numbers around 80% are common) escalates to a human when any step drops below it. This category works because adversaries have used password spray for fifteen years, so the detection logic needs to be precise rather than creative, and the 30-40% number lives mostly here. The third is the one that gets the least attention in vendor pitches and is arguably the highest-leverage automation in a mature SOC: continuous detection-efficacy monitoring, tracking precision over time, flagging rules whose true-positive rate has drifted, auto-disabling rules whose precision has fallen below an organization-specific floor. It doesn't investigate alerts; it watches the alert factory and tells the humans when something is broken, which directly addresses the alert-fatigue-from-drift failure mode that consumes analyst time.
The unhandled 60-70% is unhandled for structural reasons too, not for lack of investment, and three patterns keep showing up. Novel techniques have no playbook by definition, because automation needs a playbook and playbooks come from past investigations, so a genuinely new living-off-the-land binary pattern or a previously-unseen exploitation chain against a recent CVE has nothing to match against; even the best ML model is backward-looking, and the determination of "anomaly that matters" still requires creative judgment about adversary intent. Business-context false positives are the second: an IT admin running a legitimate PowerShell script to modify GPO registry keys looks exactly like an adversary doing the same thing, the signal in the logs is identical, and the difference is whether there's a change-control ticket and whether the activity matches the admin's normal pattern. Industry research (Gartner's Market Guide for Security Operations, 2024; Ponemon's "Cost of False Positives" study, 2024; ESG Research's "Life and Times of Security Operations" survey, 2024) pegs false-positive rates across enterprise SOCs in the 30-60% range, and determining which ones are false requires picking up the phone or reading a change-control system, which automation can't do reliably yet. The third is detection engineering itself, the part of the pipeline with the least overlap with what automation is good at: hypothesizing about adversary motivation, designing rules that balance precision against recall, prototyping against historical data, validating across a red-team exercise. Automation can monitor whether detections are working but can't decide what detections should exist, so as organizations shift routine investigation onto automation, the proportion of analyst time spent on detection engineering goes up rather than down. The operational shift is not that there are fewer analysts but that the analysts are doing different work.
There's an unglamorous architectural reason 30-40% is the ceiling rather than 90%, and it's worth naming because it's structural rather than a function of model quality. The performance constraint that determines whether end-to-end automation is feasible is latency per alert: if automated enrichment-plus-decision takes 30-60 seconds per alert and the SOC sees 10,000 alerts a day, the system needs roughly 80-160 hours of compute per day to keep up even assuming perfect parallelization. The way real automation closes that gap is by pre-computing the expensive parts in materialized views, so the decision query against the MV returns in milliseconds rather than scanning raw event data. The difference between scanning 50 billion raw auth events and querying 720 pre-aggregated hourly rows is roughly four orders of magnitude in latency, which is what separates automation that can run end-to-end inside a 20-minute MTTR window from automation that cannot keep up. But MVs accelerate known patterns and don't help novel investigations that need to scan raw event data with creative query shapes, so the routine playbooks live in the MV-acceleratable zone and the novel investigations don't, and the ratio between them reflects the underlying mix of routine versus novel work in a real SOC. (The MV specifics are Tier D inference from the shape of the problem and analogous observability architectures; Expel hasn't published its internal stack. The structural argument is Tier B.)
The selection criteria for an automation candidate follow from all of this and aren't subtle: precision above 90% in validation, a repeatable procedure rather than a one-off investigation, clear decision logic that can be codified, and volume high enough to justify the engineering cost. Most novel detections don't meet all four; the ones that do are the ones that should be automated, and that's the channel by which the 30-40% gets reached, slowly and one validated hunt at a time rather than in a big-bang ML deployment. The order matters: validate the hunt, measure precision, codify the playbook, automate the execution, monitor for drift. Skipping the first three steps and going straight to the fourth is how teams generate the disabled-automation-projects post-mortems you can find on conference talk slides if you look for them, because automation amplifies whatever signal exists, including the false-positive signal.
The binding constraint
Identity and trust is the binding constraint, not the percentage.
When I picture the agentic SOC two or three years out, the bottleneck isn't how much of tier-1 a single agent can automate. It's that the agents won't come from one place. You'll have a triage agent that ships with your SIEM, an enrichment agent from your EDR, a ticketing agent in your ITSM platform, a threat-intel agent from a feed vendor, and quite possibly a model you host yourself for the data you can't send anywhere. Each of those is a different runtime, a different trust boundary, and a different idea of what "authenticated" means, and the moment one of them needs to ask another to run a query or hand over a credential, you're in the problem the cross-vendor trust layer is built for. This is why a single-vendor demo is so misleading about the hard part: inside one vendor's fabric the agents already trust each other because they share an identity provider and a runtime, so discovery is a config file and trust is assumed, and the demo looks clean because it has erased the cross-boundary problem, which is the one you actually have.
The clearest articulation of this layer I've found is MIT's NANDA project, and it's worth being precise about what it is, because this is exactly the kind of research that gets mis-cited as it travels. NANDA is a research initiative at the MIT Media Lab focused on the infrastructure AI agents would need in order to discover, authenticate, and coordinate across organizational boundaries, framed deliberately by analogy to the internet's own naming and trust layer: where DNS and the certificate system let machines that have never met find and trust each other, NANDA asks what the equivalent looks like for agents. The public output is mostly academic (protocol design, an arXiv paper, 2507.14263, Raskar et al., "Beyond DNS: Unlocking the Internet of AI Agents via the NANDA Index and Verified AgentFacts", and reference implementations under the projnanda GitHub organization). The architecture has three pieces worth naming: a NANDA Index that lets an agent discover others by the capability it needs rather than by a hard-coded endpoint; a registry layer that resolves a discovered agent to an actual identity and address, with Ed25519 keys underpinning that identity; and Verified AgentFacts, which are signed attestations of what an agent claims it can do, so a capability lookup returns something a relying party can check rather than something it has to take on faith. It's a clean separation of concerns that borrows the right ideas from DNS, PKI, and service discovery without trying to be any of them, and it is not a SOC-automation benchmark and does not publish an automation percentage for security operations.
The way I'd put the contribution in security terms is that we spent the last decade rebuilding access control around the idea that you don't trust a request because of where it came from, you trust it because the actor proved its identity and its authorization for that specific action. That's zero-trust, and we built it for humans and for services. Agents are a third class of actor, and they break the assumptions of both, because an agent is neither a human with an SSO session nor a service with a pinned certificate and a fixed scope; it's a thing that shows up, claims a capability, and asks to act, and the question of whether to believe it is exactly the question zero-trust was supposed to answer. Verified AgentFacts maps most directly onto that: a signed attestation of what an agent can do, anchored to a cryptographic identity, is the agent-world version of an authorization claim you can verify rather than assume. It doesn't solve the whole problem, because you still have to decide whose attestations you trust (the bootstrap problem), but it moves the question from "take the agent's word for it" to "check a signature against an identity you've decided to trust," which is the right shape for a security control. This is the same instinct I write about in defining what you can own: the durable questions are about what an actor is allowed to own and prove, not about how clever the model is.
Read as infrastructure rather than as an automation claim, NANDA is trying to solve four specific problems that anyone who has built a SOAR integration knows by heart. The first is the N-by-M integration tax: a typical enterprise SOC runs 40-60 security tools, each with its own API, authentication scheme, and rate-limit gotchas, and connecting them point-to-point creates N-by-M complexity where every new tool requires integrations against every other tool. Gartner's 2024 Hype Cycle for Security Operations puts the average tool count at 45-65 across enterprises surveyed, consistent with what I see in practice, and capability-based discovery reduces this to N-plus-M: each tool registers its capabilities once, and other agents query by capability rather than by tool name, the same architectural shift DNS made for hostnames decades ago. The second is trust without infrastructure: most current SOAR integrations authenticate with long-lived API keys, OAuth tokens, or service accounts, and when one is compromised, revocation takes hours to days because the credential is referenced from dozens of places, whereas the cryptographic identity model (Ed25519 keys plus attestation-based trust plus CRDT-backed sub-second revocation) is an upgrade worth taking seriously. The CRDT (Conflict-free Replicated Data Type) piece matters specifically because revocation events can propagate across a distributed agent network without a central coordinator that would itself become a target, and this is the part of the architecture I think security teams should pay attention to even if the agent-coordination layer on top of it doesn't pan out. The third is capability discovery instead of endpoint binding: current automation hard-codes "Splunk API v2.3" or "CrowdStrike Falcon v2.1" into every workflow, and when the vendor updates the API the workflow breaks, whereas a discovery model lets an orchestrator ask for "siem-query" as a capability and get whichever agent currently provides it, analogous to using a service mesh instead of hard-coding IP addresses. The fourth is composability across vendors: Google's Agent-to-Agent (A2A) protocol and Anthropic's Model Context Protocol (MCP) are complementary efforts at the same problem from different angles (A2A handling inter-agent messaging, MCP handling model-to-tool context), while NANDA's contribution is the directory and trust layer that makes the messaging and context layers safer to use across organizational boundaries. None of the three is the final answer yet, but the fact that they're converging is what I'd watch.
Any architecture that introduces a new coordination layer introduces a new attack surface, and the threat model has four classes a security architect should think about. Agent impersonation, where a malicious agent registers false capabilities, gets discovered by orchestrators, and either exfiltrates the queries it receives or returns attacker-controlled results: attestation and reputation help, but the bootstrap problem (how do you trust the first attester) is non-trivial. Discovery enumeration, where an attacker who gets read access to the NANDA Index gets a complete map of your agent network, a more compact recon target than scanning your tools individually, mitigated by access control on the directory itself, which carries the same access-control complexity DNS has. Trust-chain compromise, where, if attestation depends on other agents vouching, an attacker who compromises a high-reputation agent gets to vouch for malicious agents: reputation systems are notoriously hard to design against this, and NANDA does not solve it in any way I find architecturally convincing yet. And coordination disruption, where denial-of-service against the NANDA Index degrades every agent network depending on it, where the CRDT model helps with eventual consistency but doesn't eliminate the dependency. The mitigations NANDA proposes (sub-second revocation, attestation, reputation, audit trails) are sensible and in some cases stronger than what current SOAR integrations offer, but this is not the kind of architecture where "the threat model is solved" is a fair statement; it's the kind where the threat model is well-articulated and the mitigations are partial, which is the right honest place for a research project to be.
This is also the same argument seen from the vendor side. Databricks acquired Antimatter into the Lakewatch launch specifically for agent authentication and authorization, on the underlying assumption that agentic SIEM needs identity and policy primitives that today's SOC infrastructure doesn't have. That's the NANDA argument restated as a product gap: the infrastructure lacks identity primitives for non-human actors, and whether you read it from MIT's research framing or from a vendor's acquisition, the binding constraint is the same. An identity-and-discovery layer like NANDA does not, by itself, move the investigation-automation ceiling; it makes a multi-vendor agent SOC safer to assemble, which is necessary but not sufficient, and the analyst-judgment work that caps the ceiling (is this real, what's the scope, do we notify regulators) is untouched by better discovery and trust. So if anyone shows you a high-nineties SOC-automation figure and attributes it to NANDA or to "agent networks" generally, the right response is that they've crossed two different questions: how agents find and trust each other, which NANDA addresses, and how much of an investigation an agent can close, which it doesn't.
The duct-tape era
The duct-tape era: an existence proof, honestly labeled.
The most honest thing in agentic security right now isn't a product, it's a pile of rules files. RAPTOR is
what Gadi Evron, Daniel Cuthbert, Halvar Flake, and Michael Bargury built by pointing Claude Code at
security work, and it isn't a CVE-to-patch pipeline with a clean diagram so much as a harness: a
CLAUDE.md, a set of sub-agents and skills, and the willingness to let an agent try. The four
are people with serious security résumés (Evron, founder of the Israeli CERT; Cuthbert, the OWASP appsec
veteran; Flake, who you may know as Thomas Dullien; and Bargury, who has spent the last couple of years on
agent and copilot security), and the cleanest way to describe what they built is that it turns Claude Code
into a security researcher by giving it the same scaffolding you'd give a junior analyst: a written brief,
a set of standing rules, some specialist sub-agents, and a library of skills it can reach for. There's no
Kubernetes cluster, no distributed agent fabric, and no MLOps platform underneath it, because the whole
point is that the capability now lives in the general-purpose agent and the harness is just the
configuration that aims it.
That distinction matters because it changes where the interesting thing is. If RAPTOR were a custom CVE-to-patch engine, the story would be about the engine; because it's a harness over a general agent, the story is that the underlying model got good enough that four people could compose existing pieces into something that drafts a real patch, and the composition took rules and skills rather than a new platform. The demonstration that traveled is Flake's: the patching capability drafted a candidate fix for an FFmpeg vulnerability that then needed hand tweaks before it was finished. FFmpeg is a roughly 1.4-million-line C/C++ codebase that handles media parsing, a frequent source of remote-code-execution vulnerabilities because the parser surface is enormous, and manual patches against it typically require deep understanding of codec internals. Two years ago the available models couldn't reason about a codebase that size at the level required to draft a logic change addressing a vulnerability's root cause; the bar was closer to "suggest a formatting fix" or "bump a dependency version." Drafting an actual logic patch that compiles and survives the existing test suite, even one a human still has to finish, is a different category of capability, and the creators described the whole thing as a "duct tape MVP" with "embarrassingly simple" plumbing.
I want to be careful here, because one drafted patch is not a benchmark but an existence proof, and the difference matters: the existence proof tells you the capability is available, while the benchmark would tell you how often it succeeds. RAPTOR has not been independently benchmarked, and its creators have not published a success rate across a representative CVE corpus, so treating one patch as a hit rate would be the same mistake I criticize vendors for making. That self-discipline is the bridge back to the 60-70%-needs-humans argument: the FFmpeg draft needing hand tweaks is the small, concrete version of "no playbook for novel work," because a one-off patch against a 1.4-million-line codebase is exactly the kind of creative, non-routine work that sits in the unautomatable fraction.
The framing the RAPTOR team reaches for, and the one I think is right, pulls from the early-web era. In the mid-1990s much of the dynamic web ran on Perl CGI scripts: a request hit a web server, the server forked a Perl process, the process generated HTML and exited, with no application server, no ORM, no request lifecycle, no proper session management, just a wrapper of glue around a scripting language that happened to be good at text manipulation. That "good enough" architecture is what much of the modern web-framework lineage descends from, so Rails, Django, Express, and Next.js are each the polish on top of the Perl/CGI shape, and the polish came after a decade of the messy version being used in production rather than before it; the pattern that gets remembered as revolutionary started out as duct tape. Early AI security automation is in the same place: a harness like RAPTOR is the Perl/CGI of agentic security tooling, it works sometimes, it fails often enough to require human review and finishing, and it's getting better quickly because the iteration cycle is short. The analogy isn't perfect, because Perl CGI didn't make wrong predictions with confidence the way LLMs do, but the structural point holds, so if a capability requires duct tape to get the first production deployment, the right move is to build the duct tape rather than wait for the framework that will eventually replace it.
Patch economics
Patch economics: why this specific workflow is worth the duct tape.
The CVE-to-patch workflow is one of the few in security where the gap between need and capacity is large enough that even partial automation is valuable, and the numbers (from sources I'd treat as Tier A) make the case. The National Vulnerability Database published roughly 28,800 CVEs in 2023, up from about 25,100 in 2022, on the order of 15% year-over-year growth. Mandiant's M-Trends 2024 reports median time-to-remediate for critical vulnerabilities in the 60-day range across surveyed organizations. Qualys's TruRisk research shows roughly 47% of organizations take longer than 30 days to apply patches that already exist. The piece that's structurally hard to close is the gap between CVE disclosure and vendor patch release, frequently 30 to 45 days, during which the organization knows the vulnerability exists and knows there's no official fix, with options limited to deploying a compensating control, accepting the exposure, or developing a custom patch. Custom patches are rare because they require senior engineering time against a vendor's codebase, and few security teams have that capacity.
The math on team capacity is what the argument rests on. A small security engineering team that can develop two or three custom patches per week, against 50+ critical vulnerabilities per quarter, has a backlog that grows faster than it can be drained. AI-assisted patching may not close that gap, but it changes the shape of the constraint, because senior engineers reviewing and finishing candidate patches is a different workload from senior engineers writing patches from scratch, and reviewing scales better than authoring.
A candidate patch that compiles and passes the existing test suite is not the same as a candidate patch that's safe to deploy, and the FFmpeg draft needing hand tweaks is the small version of exactly this, so the risk catalog splits into three categories. Patch quality: AI-generated code requires mandatory human review, because edge cases in complex codebases get missed and the agent may fix the vulnerability and break adjacent functionality in a way the existing tests don't catch, and there's a meta-risk that a generated patch introduces a new vulnerability while addressing the original one, because the model doesn't reason about security properties the way a human auditor does but pattern-matches against examples in its training data. False confidence: not all vulnerabilities are patchable via code changes, since some require architectural redesign (a sandbox boundary moved, a protocol changed, an authorization model rebuilt), and a patch that addresses the symptom while leaving the design flaw in place can be worse than no patch because it discharges the urgency without closing the root cause; "it compiles and the tests pass" is not the same as "it's secure." Testing gaps: automated tests aren't exhaustive, security-specific test cases are frequently missing from the suites the validation step relies on, so manual penetration testing is still required for critical patches, and production monitoring after deployment matters more than usual because the failure mode you can't catch in test is the attacker bypassing the fix through a different code path. The mitigation I'd recommend is structural: AI drafts the candidate, a human reviews and finishes it before deployment, deployment carries enhanced monitoring for the first 30 days, and the feedback from those 30 days flows back to inform when to trust the workflow next time.
Not every vulnerability is a good fit, and not every codebase is. Good candidates: well-understood vulnerability classes where the fix pattern is documented (buffer overflows with bounds checks, XSS with output encoding, SQL injection with parameterized queries); codebases with decent test coverage so the validation step has something to validate against; non-critical systems where a bad patch can be rolled back without major consequence; vulnerabilities with a clear root cause rather than ones that span multiple files or require redesign. Poor candidates: novel vulnerability classes where the model has no pattern to match against; architectural issues where the fix requires moving a boundary, not changing a line; safety-critical systems where the cost of a wrong patch is unacceptable even with human review; vulnerabilities that require coordinated changes across services or repositories, where the agent's context window is the ceiling on how broad a change it can reason about. Any productivity-multiplier claim ("3-5x throughput for patch development") should come from measured outcomes on your own codebase, not extrapolated from adjacent-domain studies on general-purpose coding assistants; I treat the 3-5x figure as Tier D until I see your numbers.
Category snapshot: the pipeline layer
The pipeline layer is where the migration risk lives.
Below the agent layer sits the pipeline layer, and it's where a structural consolidation is underway that matters for any agentic-SOC decision. The most informative single event in this category in two years was the September 2025 SentinelOne acquisition of Observo AI for approximately $225M, and it predicts a specific shape of consolidation, because SIEM and EDR vendors are buying pipeline-layer companies where the cost story plays out, where OCSF normalization happens, and where data-routing decisions either lock customers into a downstream stack or set them free to migrate, so the vendor that controls the pipeline largely controls the migration risk for whatever sits downstream. This is the same structural move CrowdStrike made when it acquired Flow Security and Adaptive Shield, and the same structural move Cisco made with Splunk, because the SIEM vendors are buying the pipeline layer where the cost story plays out.
The following is a clearly-dated evidence snapshot, refreshable as the category moves, and subordinate to the consolidation-pattern thesis above rather than a standing per-vendor survey. As of mid-2026 the roster crowding into the space behind Cribl and Tenzir is six vendors: DataBahn, Observo AI (now part of SentinelOne), Monad (now incorporating Tarsal), Abstract Security, Datadog Observability Pipelines, and Edge Delta. The category is real and the funding is real; what's thinner than the marketing suggests is independent security-specific production validation. The strongest Tier B reference in the entire group is the ~$225M Observo acquisition (September 2025), which is a public market validation of the category but not a customer reference. DataBahn is the cleanest theme-here example: its pitch is agentic AI for pipeline operations, autonomous agents that learn from enterprise data flows to automate the manual data-engineering work (parser generation for new log sources, schema inference, pipeline optimization, blind-spot detection), which maps directly to the operational pain in Cribl deployments where Pack configuration and parser maintenance consume engineering capacity. That's a structural cost advantage if the agents work as positioned, but the security-specific public evidence behind it is the thinnest of the six, which makes DataBahn the canonical "capability proven on paper, production and buyer untested" case in this whole essay: the technical pitch is among the most interesting in the group while the evidence is among the weakest. Monad is the strongest independent on named-customer evidence, with Robinhood, Navan, and Upstart (from the Tarsal acquisition) plus a Snowflake relationship and a Wiz partnership publicly stated, the only company-name-granular references in the group, though still short of the Fortune-100 named-and-scaled level that anchors Cribl. Datadog Observability Pipelines and Edge Delta belong in the conversation by capability but not by category, because they're observability products with security extensions and the procurement motion reflects that. The acquisition-prediction layer on top of all this is Tier D speculation projected from one data point, and the only reason to raise it is that it's a real factor in the decision: if you commit to DataBahn or Monad in 2026 and that vendor is acquired by a SIEM player in 2027, your pipeline strategy is now downstream of someone else's stack consolidation plan.
Lakewatch
Lakewatch as the largest-vendor worked example.
Databricks Lakewatch is the largest-vendor instance of "lakehouse-native security stops being a self-build conversation," and it's worth working through in detail because the question it raises (does the category change when a vendor of Databricks' scale productizes it) is the live one. The launch inventory: Databricks publicly positioned Lakewatch on Unity Catalog, Delta Lake, and Apache Iceberg, open table formats throughout, with OCSF on the Silver layer, under the framing "Open Agentic SIEM." Agentic triage runs on Mosaic AI with Anthropic Claude as the underlying model; Detection-as-Code carries detection content; Genie translates analyst questions into SQL against the lake; Lakeflow Declarative Pipelines and Expectations carry data quality; Lakebase (serverless Postgres) carries case management; DASF 2.0 (62 risks, 64 controls) is the governance overlay sitting on Unity Catalog primitives. Two acquisitions closed into launch: Antimatter brought agent authentication and authorization (the agent-identity gap discussed above), and SiftD.ai brought SPL translation by Tylor Murray, the original SPL author, which is the migration on-ramp Splunk customers will actually use if anyone's going to displace Splunk at scale.
Four durable whitespace gaps are what the launch doesn't ship, and they're worth naming because they're where independent practitioners and partner shops still have to do real work. The first is a first-party asset and identity graph: the Open Agentic SIEM framing implies the agent reasons across entities, yet the entity-resolution surface (CMDB integration, identity-provider linkage, vulnerability-source reconciliation) is still partner territory rather than a Lakewatch primitive. The second is productized OCSF conformance: OCSF shows up on the Silver layer in the marketing diagram, but the validation, mapping coverage, and semantic-drift detection that production OCSF deployments actually need still isn't there, and this is the DataBahn whitespace, the gap where the pipeline-layer vendors and independent shops still earn their keep. The third is a CISO-language maturity model: the launch is engineered for the security engineer who can read the lakehouse stack diagram, while the translation layer that explains the same architecture in compliance, risk, and audit vocabulary (the language a CISO has to use to defend the decision) is still partner-and-PS territory. The fourth is a lakehouse-fluent buyer-enablement curriculum, and it's the one I'd treat as the most important, because the gating constraint on how fast Lakewatch can take share is buyer education rather than technology: even at Databricks' scale the customer base capable of running the platform is smaller than the customer base capable of running a Splunk deployment, so until the curriculum exists the addressable market is "teams that already think lakehouse-native," a smaller cohort than "teams that have heard 'lakehouse-native' is the future."
Three independent analyst reads landed within weeks of the announcement, each surfacing something the marketing didn't, and I read them as things to plan around rather than pitches to repeat. HFS Research's point is that the 80% TCO framing reads as "more efficient," not "automatically cheaper," because for an organization already running Databricks at scale Lakewatch reduces marginal cost, while for an organization standing up Databricks specifically to run Lakewatch the platform cost is additive rather than subtractive, so the TCO math has to be modeled against the customer's starting state rather than against a generic SIEM baseline. Hugo Lu's point is that the "build your own SIEM" GTM motion is structurally different from the one that built Splunk's base: Splunk grew on "ingest everything, search later," a category that didn't require buyers to architect anything, whereas Lakewatch requires buyers to think in lakehouse primitives, a legitimate value proposition for sophisticated buyers but close to a non-starter for much of Splunk's installed base, so the category may take share from the top of the market while leaving most of the middle untouched. InfoTech's is the sharpest reframe of the three: for shops that already run Databricks, Lakewatch reads less as a new vendor decision than as a utilization decision on infrastructure they've already committed to, which is the framing that fits procurement (where "we're using more of what we already paid for" travels better than "we're adding another vendor"), while for shops that don't already run Databricks the framing is the opposite of helpful because the platform cost has to be underwritten before Lakewatch's value can show up.
The migration-from-Splunk story is the thread to track, because SiftD.ai's SPL translation isn't the same as the manual rule conversion existing detection-content migrations have to do. If the translation works at the corner cases (macros, lookups, the weird SPL conventions that pile up at scale), the migration cost curve flattens enough that the migration cost reality analysis may need a revision, and once the evidence lands that revision is probably worth a writing piece of its own. The matrix scores Lakewatch against the same workload archetypes it scores everything against, which means scoring it against the alternatives (Snowflake-with-Hunters, Cribl-augmented Splunk, ClickHouse-on-Iceberg, StarRocks) rather than against itself, and that's the discipline the disclosure at the top of this essay commits to.
Procurement
How to procure in this category.
The shortlist construction discipline matters more in this category than in mature ones because the public evidence is thin enough that vendor positioning carries disproportionate weight, and three principles consistently hold. Run a paid PoC against your actual workload: free PoCs are usually scoped narrowly enough that they don't surface the operational issues that matter at production scale, whereas a paid PoC (three to six months, against real production telemetry volume, with the vendor's professional services engaged) is the only structure I've seen reliably distinguish vendor positioning from vendor capability, and for a 10 TB/day deployment a paid PoC costs $50-150K and saves roughly 10x that in selection mistakes if it surfaces a structural issue before you commit. Test the distinctive claim, not the table-stakes claims: every vendor can route data from a source to a destination, so the PoC should verify the differentiator (agentic AI for parser generation, edge processing for bandwidth-constrained deployments, platform-embedded analytics, Kubernetes-native deployment with broad integration coverage), and if the distinctive claim doesn't hold up under test the vendor falls back to a generic SDPP, at which point Cribl's operational maturity is the better default. Plan and contract for the acquisition scenario: the contract terms that matter under the Observo precedent are data-portability guarantees, OCSF-as-canonical-schema commitment, pipeline-language openness (proprietary DSLs are the lock-in vector that survives acquisition), and exit clauses tied to change-of-control, none of which are unusual procurement asks but all of which get overlooked when the buying motivation is "this vendor is independent today," exactly the assumption the Observo precedent should make you doubt.
The open-format mitigation is the same as for any vendor concentration risk: prefer OCSF for schema and Iceberg or Delta for storage, avoid proprietary pipeline languages where you can, and run your evaluation on the assumption that the vendor you choose may be a subsidiary of a larger security platform within 18-24 months. The honest default for most security operations in 2026 remains Cribl for enterprise scale, Tenzir for pipeline-detection priority, or Vector for cost-constrained smaller deployments; the six vendors above are credible shortlist additions for specific scenarios, not the default recommendation, because the public evidence isn't yet strong enough to make them the default.
The fair-broker close
The fair-broker close, and what a maturity model can and can't do.
Pull it back to the spine. Any vendor pitching 90%+ owes you the scope statement: what counted, against what baseline, in which scenarios, and the production number broken down by enrichment versus end-to-end investigation closure. If they attribute the number to NANDA specifically, or to "agent networks" generally, they've confused a discovery-and-trust result for an automation one, and that tells you how carefully they read. Honest numbers are a competitive advantage, because the discipline of citing a publicly-defensible practitioner number rather than the marketing-claim equivalent is the same discipline that lets you build automation that actually works: the teams that buy into 90%+ narratives tend to under-invest in the validation work that would have moved them from 10% to 30%, while the teams that internalize "the realistic ceiling is 30-40% and most of us are nowhere near it" tend to spend their budget on the analyst skills and detection engineering that get them there.
The maturity argument is the structural reason none of this is a shortcut. SANS publishes a Hunting Maturity Model (HMM) running from HMM0 (no hunting capability) through HMM4 (automated hunting), and the SANS 2024 SOC Survey reports that most organizations sit at HMM1 or HMM2 (ad-hoc and procedural hunting), with HMM3 (innovative, custom analytics) and HMM4 (automated) rare in the field, not because the tooling doesn't exist but because the data foundations and analyst skills don't. An agent-coordination layer doesn't move an organization up the maturity model on its own. If the underlying data is incomplete, the detection logic is stale, or the analyst workflow is built around tier-1 ticket-pushing rather than hypothesis-driven hunting, deploying an agent network on top of all that just automates the existing dysfunction faster, and now with a new identity layer to secure. There's no shortcut from HMM1 to HMM4; the shortcut doesn't exist, and the agent layer doesn't manufacture one. NANDA-style architectures may eventually let HMM3 and HMM4 organizations run their existing capability more efficiently and across more tools, and they may make the multi-vendor version of that safe to build, but they're not a substitute for the foundations, which is the same point at a different altitude as the field anecdote: fix the parse-map-correlate pipeline and instrument that the data flows before you trust the layer on top of it.
I'm not re-arguing two adjacent pieces here, because they carry their own weight. On where AI maturity actually sits, the early-automotive frame (past horseless-carriage to AI-native, but pre-seatbelt, capability outran safety) is the summer-of-AI piece. On grounding and deterministic verifiability (why "below the frontier" is targeting rather than weakness), the dimensional-modeling-for-AI argument is reinvented Kimball. Both are upstream of this one; this essay is the agentic-SOC-reality dimension of the same fair-broker frame.
A short coda on what to actually track, so this page doesn't re-accrete weight per future sweep. Watch the identity layer rather than the automation headline: NANDA, A2A, and MCP converging on agent-coordination primitives, and whether SIEM and EDR vendors ship capability registration. Watch the pipeline-layer consolidation: which of the six independents gets acquired, and whether the acquirers keep the data path open. Watch SPL-translation accuracy in production, because that's the single signal that moves the Splunk-migration cost curve. And watch for the named, scaled, production security reference that would move any of these vendors from Tier C to Tier B, because that's the evidence that's missing today, and its arrival is what would change the read. The honest framing isn't "agent networks will solve SOC automation"; it's that agent networks will need an identity and trust layer before they're safe to run across vendors, the practitioner ceiling for end-to-end investigation sits at 30-40% with most of the field below it, and the durable work is the same work it always was, which is the foundation under the agent rather than the agent on top.
A number that does exist
A number you may have seen, attached to the right claim.
There's a real 98.7% figure floating around the agent conversation, and it's worth pinning down because it gets stapled onto the wrong paper. It comes from Anthropic's work on code execution with MCP, where letting an agent write and run code to call tools, rather than threading every tool call and result back through the model's context, cut the token cost of a tool-heavy task by about 98.7% (roughly 150,000 tokens down to 2,000). That's an efficiency result about how an agent uses tools, and it has nothing to do with NANDA or with SOC automation. It's still relevant to the agentic SOC, just to a different question, which is whether running a fleet of agents over your security data is affordable: token economics is one of the quiet constraints on agent-heavy architectures, and a near-order-of-magnitude reduction in the cost of tool use changes what's financially viable to automate. So keep the number if it's useful to you, but keep it attached to the claim it actually supports (agent tool-use is getting much cheaper) and not to a claim about how much of the SOC an agent network can run.
Related