Security Data Works

Writing · AI

The Gatsby Summer of AI.

The capability is real, the glamour is earned, and we built the car before we built the seatbelts. What the dazzle hides shows up in security telemetry as a confident, well-formatted, wrong answer, measured rather than felt.

Mid-2026

The dazzling summer.

Everyone remembers the parties. The orchestra on the lawn, the champagne by the case, the cars nosed up the drive all night, and at the center of it a man who had made himself out of nothing and was throwing the most dazzling summer of the decade for a crowd that mostly did not know his name. That is the image The Great Gatsby leaves you with first, the glamour, the sheer seductive surplus of it, and it is the most honest picture I can find of where applied AI sits in the middle of 2026. The capability is real and the glamour is earned, because the demos are genuinely beautiful, the agents do things that would have been science fiction three years ago, and everyone wants to be near it. Fewer people remember how that summer ended, and the way it ended is what I am circling toward, but the glamour has to come first because the glamour is what makes the ending possible.

What the glamour hides is the bill, and the whole reason this essay can be more than a culture riff is that in security telemetry that bill is measured, not merely felt. Take Microsoft's natural-language-to-KQL feature in Security Copilot, the thing that lets an analyst type "show me failed logons from new countries in the last 24 hours" and get back a Kusto query against the lake. The published evaluation of the NL2KQL system reports something that should stop a SOC manager cold: the generated queries are syntactically valid and execute cleanly something like 97 to 99 percent of the time, and they return the correct result set only about 58 percent of the time (Tier B, Microsoft research evaluation). The arithmetic is worth sitting with, because roughly two out of five queries run perfectly, look perfectly reasonable, produce a clean table of results, and are wrong. They are not wrong in a way that throws an error and asks you to try again, but wrong in a way that hands you a confident, well-formatted answer to a question you didn't ask, in the one domain where the gap between "looks right" and "is right" is the gap between catching the intrusion and writing the breach notification.

This is the silent-error tax, and it is the seatbelt we forgot to build. A loud error is cheap, because it fails, you see it fail, you fix it or you fall back to doing the query by hand, and you have lost ten minutes. A silent error is expensive precisely because the surface validity is so high: the query parses, the schema columns exist, the result set is plausible, and the analyst who is three hours into a shift and forty alerts deep has no signal that the floor isn't there. So the 97-to-99-percent surface validity is not the good news but the mechanism of the harm, the chrome on a car with no restraint system, exactly the kind of dazzle that gets a bystander shot for someone else's bad belief about who was driving.

So that is the present, the cold open, the glamour. Now I want to rewind, because the only way to understand why the glamour is dangerous is to remember how recently the car was a hostile, sputtering, barely-controllable thing, and how genuinely far it has come, since the progress is real and the realness is what sets the trap.

Rewind · the agentic turn

The Model T, and a cockpit that fought the driver.

Back up to the stretch the market started calling the agentic turn, roughly late 2024 through 2025. This is the period when the industry stopped bolting language models onto chat windows and started building machines designed from the ground up to act: Anthropic's Model Context Protocol arriving in November 2024 as the first real wiring standard, Claude Code and the broader coding-agent wave through 2025, the skills-and-tools layer maturing on top. The vocabulary shifted from what the model could say to what the model could do, and that shift is not marketing. It is a different category of machine.

The right analogue is not the sleek roadster. It is the Model T. The Model T was a real automobile, purpose-built, mass-producible, the genuine article in a way the things that came before it were not. And it had a cockpit that actively fought the person trying to drive it. There were three pedals on the floor and not one of them was the accelerator you'd expect: left pedal worked the two-speed planetary transmission, center pedal was reverse, right pedal was the brake, and the throttle lived on a lever on the steering column. You started it by retarding the spark with another lever, then walking around to the front and hand-cranking the engine, which would, with some regularity, kick back hard enough to break the cranker's arm. None of the controls were where a modern driver's hands and feet expect them, because the convention for where controls go had not yet been negotiated. The machine was real. The interface was a negotiation still in progress, and the negotiation was conducted, often, in broken wrists.

That is precisely the state of agentic AI's controls, and the security data says so out loud. The machine is real, the harness is not settled. Consider what happens when you point one of these systems at the actual surface a security team works against, which is an enterprise data warehouse with thousands of columns, not a tidy textbook schema. On Spider 2.0, a text-to-SQL benchmark built from real enterprise databases with 3,000-plus columns and the kind of nested, dialect-specific, business-logic-encrusted schemas that an OCSF-normalized security lakehouse actually resembles, the frontier models solve on the order of 6 to 17 percent of tasks (GPT-4 class near the bottom of that range, o1 near the top; Tier B). That is six to seventeen, not sixty and not eighty, and the toy benchmarks where these systems score in the nineties are the equivalent of a test track with no other cars on it. Put the same engine on a real road with real traffic and real intersections and the controls reveal themselves as what they are: unsettled, unlabeled, and capable of putting you in the ditch with full confidence that it took the correct turn.

Even on the friendlier, more mature benchmarks the ceiling is visible and the gap is not closing. On BIRD, a large text-to-SQL benchmark closer to realistic complexity, the best automated systems cluster around 81 to 82 percent execution accuracy on the test set (Agentar-Scale-SQL at 81.67 percent in May 2026, AskData with GPT-4o at 81.95 percent the prior September) against a 92.96 percent human baseline (BIRD leaderboard and the NeurIPS 2023 paper, Tier A). An eleven-point gap is survivable for a Q&A toy. It is not survivable for a control surface, and the part that matters for anyone deciding whether to wire this into a SOC is that the gap has been stubborn. It has not narrowed the way the demo cadence implies it should. The releases keep coming, the leaderboard keeps inching, and the distance to human reliability on realistic schemas has held roughly where it was, so the machine got faster while the steering did not get more trustworthy at the same rate, and on a control surface the steering is what you are actually buying.

The part of the Model T that maps most exactly onto where we are is the one that looks at first like a footnote. The Model T did not ship finished. Ford built a purpose-built, mass-producible chassis and left a startling amount of the car for someone else to supply, and an enormous aftermarket grew up in the gap. Sears sold pages of Model T accessories. People bought aftermarket ahooga horns, the klaxons that became the car's signature sound, because Ford had not built a real one in. That aftermarket was not the sign of a failed product but of a real chassis whose subassemblies had not been standardized yet, a platform good enough to build an industry of add-ons around and unfinished enough to need one. That is exactly the agentic-AI moment. The chassis is real and the aftermarket is roaring: Model Context Protocol servers, agent frameworks, eval harnesses, guardrail startups, observability layers, a whole ecosystem bolting on the components the platform labs shipped without. And the components being sold aftermarket are, tellingly, the safety ones. The horn was how you warned the world you were coming. Today the warning systems, the evals that measure your confidently-wrong rate, the provenance that reconstructs what an agent touched, the human-review gate, are aftermarket parts you buy or build yourself, because the chassis came without them.

The glamour creeps in right here, at the top of this stretch. By late 2025 the capability genuinely turned sleek. The coding agents got good enough that experienced engineers started restructuring their workflows around them, the multi-step tool use started actually completing multi-step tasks, and the whole thing acquired the seductive sheen of a technology that has crossed from "interesting" to "I can't work without it." This is the roaring-twenties part of the story, the part where the car stops being a curiosity that scares the horses and becomes the object of desire, the thing you are judged by for owning, and the seduction is what does the damage. A Model T with the spark-advance lever and the arm-breaking crank does not seduce anyone, since you respect it or you fear it but you watch it, whereas a fast, beautiful machine invites you to stop watching, and that invitation, extended before the restraints exist, is the road to the pool.

Rewind · 2024

The carriage breaks down.

Back up again, to 2024, the year the first version of the machine hit its ceiling in public. This was the slop era, and the failures were loud and a little funny. Image and video generators producing uncanny, six-fingered nonsense. The hardware gadgets that were supposed to be the AI-native iPhone, the Rabbit R1 and the Humane Ai Pin, shipping to scathing reviews and, in Humane's case, an outright safety recall over the battery. Apple Intelligence arriving late and underwhelming relative to the keynote. A fast-food drive-thru AI pilot getting yanked after it kept adding hundreds of dollars of chicken nuggets to people's orders. Goldman Sachs publishing the skeptical note about whether the capex would ever earn its return, and the market taking a beat to wonder if the whole thing was a bubble.

The trough is worth insisting on, because every one of those failures was cosmetic, and survivable for exactly that reason. A bad image model produces a bad image and you don't print it. A flopped gadget gets returned. A drive-thru bot gets switched off and a human takes the headset back. A skeptical analyst note moves a stock price and the stock price recovers. The 2024 reality checks were embarrassing and low-consequence, and they were low-consequence because nobody yet trusted the machine to do anything that mattered. The carriage broke down by the side of the road and the worst that happened was you were late and you looked foolish. You walked home.

That is the failure mode that the agentic turn left behind, and leaving it behind is the whole danger. When the machine only talks, a wrong answer is a wrong answer and a human decides what to do with it. When the machine acts, a wrong answer is a wrong action, taken, in the world, before a human is in the loop. So the next reality check will not be a six-fingered hand but an agent that confidently quarantined the wrong host, or confidently cleared the alert that was the actual intrusion, or confidently ran the 58-percent-correct query and closed the investigation on the strength of a clean-looking empty result. The 2024 crashes were the carriage failing safely. The crash this metaphor is built to warn about is the fast car failing at speed, with the failure now operational rather than embarrassing, and with a bystander in the path.

I'll put a number on why I don't think the human-in-the-loop reflex saves us here, because the people closest to the work have already voted with their behavior. In a field study of SOC analysts working alongside an LLM assistant, the analysts asked the model for an actual verdict on a query in only about 4 percent of cases (Tier B). At four percent, they were using it to draft, to summarize, to remember syntax, to get unstuck, and they almost never let it decide. These are the practitioners with the most exposure to the tool rather than any kind of Luddite, and their revealed preference is that they do not trust it to call the question that matters, so they are drivers who have noticed the brakes are spongy and have quietly decided to coast. The optimistic story about agentic security operations assumes a trusting operator, but the most experienced operators are already not trusting it, which tells you the seatbelt problem is not a perception gap to be marketed away so much as a measured reliability gap that the people on the road can feel through the steering wheel.

There is a structural ceiling underneath all of this that predates the LLMs and should temper the projections, and it is the same ceiling I work through in detail in the agentic-SOC reality piece: the practitioner-documented 30-to-40-percent end-to-end investigation rate that holds across roughly three years of investment, because the long tail of security work is genuinely contextual and adversarial in ways that resist playbook-style automation. The LLMs are a step up in handling parts of that tail, but they are not, on the current evidence, a step through the ceiling, and any business case that models them as one is pricing in a reliability curve the benchmarks do not show.

Rewind · 2023

The horseless carriage.

Back up one more time, all the way to the start, to 2023 and the chat-era euphoria. GPT-4 in March. The DevDay reveals in November. Search Wars, the great Bing-versus-Google moment, the sense that everything was about to be rewritten. This was the most exciting stretch of the whole story and it was, in machine terms, the most primitive. What we actually had in 2023 was the horseless carriage: a genuinely new engine, dropped into the chassis of the thing it was about to replace, and we were amazed it rolled at all.

The horseless carriage is the literal first form of the automobile, and the name tells you everything about how its makers understood it. It was a carriage, the familiar object, the buggy you already knew, with the horse removed and an engine put where the horse used to be. The body, the wheels, the seating, the entire mental model came straight from the thing it descended from. Nobody had yet asked what a vehicle should look like if it were designed around the engine instead of around the horse. They asked the smaller question, the bolt-on question: how do I put this new power source into the vehicle I already have?

That is exactly what AI-augmented 2023 was, and it is what most of what still ships under the "AI" label remains. A language model welded to a chat window. A model welded to a search box. A model welded to a dashboard as a little sparkle-icon assistant in the corner that summarizes the thing you were already looking at. New power, same vehicle. The SIEM with a chat sidebar bolted on is a horseless carriage. The dashboard with a "summarize this" button is a horseless carriage. They are genuinely useful, the way the first horseless carriages were genuinely useful, and they are limited in exactly the way a carriage is limited: the form constrains the engine, because the form was designed for a different power source and nobody has yet rebuilt around the new one. The dazzle of 2023 was the dazzle of seeing the carriage move without a horse. It was real. It was also the most rudimentary thing this technology will ever be, and we have spent the years since climbing off of it.

Back to the present

The progress is real, which is why the glamour deceives.

Run the rewind back forward and you have the whole arc in one line. Horseless carriage to Model T to a fast, glamorous car, in something like three years. That is genuine progress, and I want to be unambiguous that it is genuine, because the cynical read of this essay, that AI is overhyped and it's all slop, is wrong and I don't believe it. The machine got dramatically more capable as the bolt-on became a real purpose-built thing and the hostile cockpit got more livable, so the car is fast now in a way that the 2023 carriage could not have promised, and that is exactly why the glamour deceives. If the progress were fake, the missing seatbelts wouldn't matter, because nobody would be going fast enough to get hurt. The danger is downstream of the realness. The faster and more beautiful the car, the more the absence of restraints costs, and the more tempting it is to stop watching the road.

The era has two faces, and Gatsby is only the quieter one. Gatsby's carelessness is passive, the rich who smash things up and then retreat into their money and let other people clean up the mess, and the cost lands on a bystander the careless party never has to look at. The louder face, and the one most people actually picture when they picture reckless driving, is Cruella de Vil: the manic grin, the car up on two wheels through the hairpin, total indifference to anyone else on the road, speed as self-expression with the wreckage externalized onto whoever is in the way. Agentic AI runs in both registers at once. There is the quiet Gatsby version, a system shipped whose silent failures land on an analyst or a customer the deployer will never meet. And there is the loud Cruella version, the wire-it-into-production-and-see-what-happens energy that treats a live SOC like an empty road at midnight, and neither driver is watching the road while both of them are going fast.

The work

Road signs are written in blood.

So I'll end where the security work actually lives, with the boring infrastructure that the speed has outrun. Early automobiles got fast a long time before they got seatbelts, traffic law, driver licensing, and a liability regime that put the cost of recklessness on the reckless. The cars came in the 1900s. The three-point seatbelt that actually works is a 1959 Volvo patent. The gap was half a century, and it was paid in bodies. The translation to security operations is not subtle. The seatbelt is provenance, knowing which rows the agent read and which it wrote and being able to reconstruct the chain after the fact. The traffic law is a human kept on the adversary's tail, not on the assistant's output, with the model drafting and the analyst deciding, the 4-percent-trust behavior treated as a design constraint rather than a problem to market past. The licensing is measured outcomes, the discipline to run the NL2KQL-style evaluation on your own telemetry and know your own confidently-wrong rate before you wire the thing into a closing action. And the liability regime is accountability, a named owner for what the agent does when it acts, so the cost of the wrong action lands on the party who deployed it rather than on the bystander downstream.

There is a fifth piece, and it is the one we have not started on. Every road sign is a lesson learned in blood. The curve is marked because someone took it too fast and did not come back; the limit is a specific number because that number is where the data said people begin to die. A sign is memory, posted at the exact spot the lesson is needed, so the next driver does not pay for it again. We are not putting up signs, so team after team wiring an agent into production hits the same curves, the silently-wrong query, the confident wrong action, the eval nobody ran, and crashes in private, and the lesson dies in the postmortem instead of getting posted at the bend for the next team.

Build those before you go fast, or at least before you let the car take actions you can't reconstruct. Remember the dazzling summer. It ended with the man who threw the parties floating in his pool, the water gone pink around a few fresh perforations, shot by a stranger acting on a wrong belief about who was driving the car. He had not done the thing he died for; he took the blame for someone else's recklessness, and the someone else drove off untouched into the rest of her careless life. In a SOC, the bystander is the analyst who inherits the silently-wrong query, the on-call who chases the agent's confident dead end while the real intrusion walks, the customer in the breach notification that opens with a clean-looking empty result set. The car is real and it is fast and it is genuinely the best machine we have ever built for this. The seatbelts you can install yourself, today, on your own car. The signs are the harder work and the more important, because a sign is a lesson somebody already paid for in full, posted so nobody has to pay it twice. Road signs are written in blood. Start putting them up.

The rest of the writing works the same seam: the security-data layer underneath the AI conversation, measured rather than marketed.

Back to writing →