From the lab
The better the model, the quieter the wrong answer.
I had a result I half-expected and a result I didn't. The one I expected was that a language model writing its own query against security telemetry would get the hard questions wrong. The one I didn't expect was that making the model more capable made the wrong answers harder to catch rather than easier, and that the two reflexes you'd reach for to add a safety check around the model, letting it run its query and fix itself or sampling it several times and taking the consensus, both made the problem worse or left it untouched. The only thing that never failed silently wasn't the model at all; it was the layer underneath that refuses a query it can't answer.
This is the practitioner write-up, so I'll spend less time on why it matters and more on what I measured and where it bites. Evidence tier: B, my own lab, a synthetic OCSF corpus, a single host, run as a directional pilot. The transferable claim is the order and mode of failure across the arms, not the exact magnitudes, which are corpus- and host-dependent, and I'll mark the limits where they cut.
The setup
Six ways to answer the same question.
I built one battery of questions against a single OCSF event store and answered each question six ways: plain LLM text-to-SQL over the raw tables, two flavors of GraphRAG (retrieve context, let the model compose), an LLM writing SPARQL against a curated ontology and executing it deterministically through an Ontop OWL2QL rewrite, a deterministic OBDA template, and a hand-curated metrics layer of the kind a mature data team ships. The LLM arms ran at three model tiers, Haiku, Sonnet, and Opus, with the model held constant across arms within a tier, so I'm comparing methods rather than mixing a strong model into one arm and a weak one into another. Every arm is scored by one shared function that sorts each answer into three buckets: correct, silent (a confident answer that is wrong, with no error raised), and loud (a visible failure such as an empty result, a parse error, or a refusal). Eight trials per cell, because these models are sampled rather than deterministic and I wanted the variance.
The questions split into two kinds, and that split is what the rest of this turns on. Some are lookups: find the encoded PowerShell command on this host, return the no-MFA privilege-escalation event. Those are find-the-needle questions, and text-to-SQL with full table access nails them, correct on every trial at every tier. The other kind is compute-over-population: count the distinct physical assets behind a human's six identifiers once you collapse the hostname and IP and instance-id aliases, compute the dwell time between the first and last event of the chain, reconstruct the stages of the attack in the order they happened. On those three questions, none of the LLM-authored arms produced the correct answer at any of the three tiers; the only correct compute-over-population answer in the whole run came from the hand-curated metric.
So at the level of did it get it right, the comparison is dull: on the questions that require reasoning over the whole population, the model is wrong. What isn't dull is how it's wrong, because that is what decides whether you find out.
The result I didn't expect
Capability moved the failures from loud to silent.
Look at the arm where an LLM writes SPARQL and a deterministic engine executes it. On the compute tail its failures break down by model tier like this:
- Haiku: loud on every trial. It writes a
COUNT(DISTINCT …)over a tangle ofUNIONandGROUP BYthat the Ontop rewrite refuses, and the refusal is the failure, visible, exit-non-zero, a thing you can alert on. - Sonnet: loud about 96% of the time. Same story; the engine still mostly refuses what it writes.
- Opus: silent about 96% of the time. It writes a clean,
minimal, executable query,
SELECT (COUNT(DISTINCT ?host)) WHERE { ?s :host ?host }, that runs without complaint and returns the wrong number.
That is the result that stopped me. The more capable model didn't get the answer right; it got the query right enough to execute, which removed the only thing that had been catching the error. The weaker models were accidentally safe, because they wrote queries broken enough that a strict engine threw them out, and Opus's competence stripped that net away and handed back a confident wrong count with no error attached. On a path where a downstream engine is doing the refusing, raising model capability converted caught failures into uncaught ones.
This isn't really about SPARQL. The plain text-to-SQL arm has no refusing engine in front of it, and it's silently wrong on the tail at every tier, roughly half the time at Haiku, Sonnet, and Opus alike, with capability barely moving it; GraphRAG is silently wrong on every trial at every tier. What capability changes is the texture of the failure, depending on what, if anything, is positioned to refuse a bad query, and it never reliably produces a correct answer on the hard questions. So a more capable model isn't a safer one here, and on the SPARQL path it's the more dangerous one, because its competence is what strips the refusing engine's catch away.
The two reflexes
Self-correction and self-consistency both fail here.
If you've shipped anything LLM-shaped you already know the two reflexes for making a model's output trustworthy: let it check its own work by running it, and sample it a few times and take the majority. I tested both as pre-registered arms, with the predictions written down before the runs, and both fail on the compute tail in instructive ways.
The first is the agentic loop. Give the model the same question and schema as the one-shot text-to-SQL arm plus a read-only handle to run DuckDB against the store, and let it iterate up to five rounds: write a query, run it, look at the rows or the error, revise, stop when it's satisfied. The pre-registered guess was that execution feedback would clean up the loud failures, the model fixing its own empty results and syntax errors, without touching the silent ones, because a wrong aggregate runs fine and returns plausible rows, so the loop has no error to correct against. That is what happened, and it's worse than a wash. Loud failures fell to roughly zero. Silent failures went up: the tail silent rate rose from about 0.54 to 0.96 at Haiku, 0.63 to 0.88 at Sonnet, and 0.50 to 0.88 at Opus. The loop's stopping condition is that the query runs and returns something plausible, and on the compute tail that condition is satisfied by a confidently wrong answer, so the loop spends its rounds converting visible failures into invisible ones. I checked one end to end: Opus, after five rounds of building an elaborate identity-closure query, returned one distinct asset against a ground truth of nine, a query that executed, looked confident, and was wrong.
The second is self-consistency. I didn't need new generation for this, since I already had eight trials per question, so I clustered each arm's eight answers, took the majority answer when one cleared five of eight, and abstained otherwise, counting an abstention as a caught failure. The pre-registered prediction was that this catches silent errors only where the model is inconsistently wrong, and the GraphRAG tail is the opposite: it's silently wrong on every trial, in agreement, so the eight wrong answers clear the majority and the ensemble emits a confident silent error. That held. At the frontier, Sonnet and Opus, self-consistency caught none of the three GraphRAG compute-tail questions; all three came back silently wrong with the votes agreeing. And the catch rate decreased with capability, because the better model is more consistently wrong, so its own samples agree more and the ensemble is least useful exactly where it's most dangerous. Where self-consistency did catch things, on text-to-SQL, it was abstaining because the trials disagreed, different SQL each round and different wrong answers, none reaching a majority. That catch is real, but it's a function of disagreement rather than of knowing the answer is wrong, and it cuts both ways: it abstained on the one configuration where Opus occasionally got the tail right, throwing the correct answer out with the rest. Self-consistency produced zero correct compute-tail answers in the entire run.
Both reflexes optimize for the wrong target. The agentic loop optimizes for executable; self-consistency optimizes for agreed-upon. On compute-over-population questions a wrong answer is usually both, so both methods preserve the silent error and one of them amplifies it.
What held
The only arms that never went silent weren't the model.
Two arms never returned a confident wrong answer on the whole battery, and neither of them is the model. The deterministic OBDA rewrite is correct on what it can express and refuses everything else; it answers two of the eight questions, both lookups, and on the compute tail it declines, loudly, because aggregation and recursion sit outside the OWL2QL fragment it can faithfully rewrite to SQL. Its zero-silent record is bought partly by declining the hard questions rather than by out-reasoning anyone on them, and that narrowness is a real limit, stated plainly rather than hidden. The other is the hand-curated metric, the only thing in the entire run that answered a compute-over-population question correctly. It isn't magic; it's a human who understood the question writing the join once, ahead of time, and it was wrong on one of the other tail questions too. But its failures, when they came, were a metric you could inspect rather than a model you had to trust.
The common thread is that safety lived in the execution and verification layer, not in the model and not in the model's effort. The deterministic rewrite refuses what it can't guarantee; the curated metric encodes a human's understanding of the question; neither is probabilistic at the point where correctness is decided. And the arm that aimed an LLM at the curated ontology in SPARQL, the most grounded of the probabilistic paths, was at least as silent at the frontier as plain text-to-SQL, silent on nearly every Opus trial, which tells you the grounding doesn't buy safety when the query against it is still being guessed. What protects you isn't that the target is structured but that the query against it is deterministic, or was pre-verified by a person who understood the question.
What I'd do with this
Verify every compute path, and harden the part that checks.
The honest scope keeps this from turning into never let a model near your data. For lookups, find the event, pull the command line, return the record that matches, LLM text-to-SQL was correct on every trial at every tier, and that's most of what an analyst actually asks. The failure is specific to compute-over-population: counts, distinct counts after alias resolution, durations, orderings, the questions where the answer is a number derived from the whole set rather than a row you can point at. Those are also, not coincidentally, the questions detection and incident response lean on hardest, and they are, I'd argue, the ones where a confident wrong number is most costly, because it looks like an answer and shows up on a dashboard with no flag attached.
So the rule I'd write down is to verify every compute path, and to treat an LLM composed this query as a reason for an external check rather than a substitute for one. For the correctness-critical tail that means a deterministic rewrite that refuses what it can't express, a curated metric a human authored and owns, or a typed query layer that rejects the wrong shape, something outside the probabilistic path that can say no. It doesn't mean a bigger model, and it doesn't mean wrapping the model in a self-correction loop or an ensemble of its own samples, because I measured both and they move the failure in the wrong direction. If you can only afford to harden one part of an LLM analytics stack, harden the part that executes and checks, not the part that generates.
Limits
One corpus, one host, and where I'd want to be proven wrong.
This is one synthetic OCSF corpus on one host, run as a directional pilot, with the frontier approximated by three Claude tiers rather than a productized GraphRAG or text-to-SQL service. The magnitudes will move with the corpus and the schema; what I'm putting weight on is the order and mode of failure, which reproduced across all three tiers and both verification arms. The lookup class carried a retrieval confound I haven't yet isolated, so I'm not making a clean claim there. The strongest counter I can imagine is a query path with a different kind of gatekeeper, a typed semantic layer say, that catches the silent compute error the way a refusing engine catches a malformed one; if that exists and holds up, the claim narrows to an unverified LLM query path being unsafe on the compute tail, which is still the part that matters. The benches are public, the three-tier head-to-head and the two verification arms, with the scorer and the runners, at ocsf-semantic-query in the lab, so you can point them at your own questions and see where your stack goes quiet.
Competence and safety are different axes, and on a guessed query path they can point in opposite directions: the wrong answer a weak model gives is loud, and the wrong answer a strong model gives is quiet, and quiet is the one that ships to a dashboard and sits there. Verify every compute path.