The method · Research

Hunt for contradictions before confirmation.

Most architecture evaluations form a hypothesis, find evidence that supports it, and declare victory, so this practice flips that order, because the supporting evidence is the easy part while the contradictions are where the position actually gets stress-tested. Across 25 hypotheses, that flip changed roughly 60% of the central claims.

The flip

A hypothesis is only worth something if you tried to break it.

The default research pattern in this industry runs in one direction, where you form a position ("ClickHouse will give us sub-second security analytics") and you go gather supporting evidence from vendor benchmarks, case studies, and analyst reports, so within a week you have a stack of corroboration and a recommendation that feels rigorous even though it isn't.

Confirmation bias makes the supporting evidence cheap and the contradictory evidence expensive. The contradictory evidence sits in places the marketing surface won't show you. GitHub issues where practitioners describe what actually broke in production, academic papers that characterize the failure modes, framework developers who acknowledge the limitations their vendors don't mention. Finding it requires looking specifically for it.

The flip this practice runs:

Form the hypothesis. Specific, falsifiable, with a number where possible. "ClickHouse runs security workloads 100× faster than the dominant schema-on-read SIEM" is testable; "ClickHouse is good for security analytics" is not.
Gather supporting evidence. The easy step. Time-box it.
Hunt for contradictions. The hard step. Search GitHub issues for the technology in question. Read the academic literature on its failure modes. Find independent analysts whose work isn't vendor-funded. Engage framework developers, since the people who built the thing have less reason to oversell it than the people selling it.
Engage practitioners and developers directly. Production deployments are the only environment where vendor claims meet real workloads. The people running them are the highest-quality source.
Synthesize a balanced assessment. Both perspectives in the same place, with the central claim and the conditions under which it fails.

Across 25 hypotheses, this process produced a public, versioned literature review (144 catalogued sources, each carrying an evidence tier and a verification verdict) and documented contradictions for roughly 80% of vendor technology claims. Around 60% of vendor performance claims required significant contextual qualification or outright revision, while the other 40% held up, which is the part that matters, because the lesson isn't that vendors always lie but that the central claims have to be tested individually, and most of the testable ones move once you actually test them.

Five patterns

What 25 hypotheses revealed about how vendor claims actually work.

1. Performance claims are selectively true.

The ClickHouse "sub-second on billion-row datasets" claim is technically correct. Cloudflare's production deployment validates 96% of queries under one second. Independent benchmarks confirm 100–1000× advantages over legacy SIEM platforms. Then row-level security gets enabled (the access-control feature that most multi-tenant enterprise environments require), and the same queries degrade to 18+ seconds, which is a 10–20× performance penalty for the controls SOCs need.

The same shape recurs at MinIO, where the "93% faster than HDFS" claim applies to specific benchmark phases (GET operations, specific object sizes) while overall workflow improvement is closer to 15%, so the 15% is real value and the 93% is peak performance in ideal conditions presented as average improvement.

The pattern is that vendor benchmarks highlight peak performance under benchmark-friendly conditions, while real-world value comes out of sustained average performance across mixed workloads, with the security controls and operational requirements real environments carry, so both numbers can be defensible and the question is which one is being put to work.

2. Implementation timelines are universally underestimated.

Across every technology category in the portfolio, vendor timeline estimates ran 2–5× short of practitioner reality, and the "AI agent for defensive security" category is the worst.

Category	Vendor-quoted	Practitioner-actual
Security data pipeline platforms	2–3 months	6–8
CAASM (cyber asset attack surface management)	3–4 months	8–12
EPSS (exploit prediction scoring system) integrations	1–2 months	6–12
"AI agent for defensive security"	"immediate"	6–9 months once human-in-the-loop oversight requirements are accounted for

Roughly 67% of organizations need external consulting to land an SDPP deployment successfully, hidden costs average 40% above initial vendor estimates, and specialized expertise requirements consistently exceed vendor projections, none of which shows up in the demo.

The mechanism is structural, because vendors optimize their demos for time-to-first-value with sanitized data, pre-configured environments, and zero integration complexity, while real deployments carry legacy integrations (absent from the demo), custom security requirements (absent from the benchmark), organizational change management (absent from the SOW), and skills development (absent from the TCO model), so a TCO that doesn't include the things that take the most calendar time arrives at the wrong calendar.

3. Some claims hold up under scrutiny.

The 65–75% reduction in downstream SIEM licensing costs from a security data pipeline platform is real and reproducible, and multiple independent case studies confirm it, with the market having voted in that Cribl carries a $3.5B valuation with 400+ enterprise customers and Fortune 500 penetration is around 67%. Implementation costs still run 40% over estimates, but the cost-reduction claim itself is defensible.

What distinguishes the durable claims from the inflated ones is independent corroboration across multiple production deployments, transparent acknowledgment of implementation challenges, and TCO models that include migration costs, training, and ongoing operational requirements, so mature vendors and durable categories let those numbers travel while the inflated ones can't.

4. Framework developers offer the most balanced perspective.

The best validation across the portfolio came from engaging framework creators directly, the people with deep technical expertise and no sales quota. Jay Jacobs at the FIRST EPSS SIG put the EPSS effort-reduction claim at 70–80% (slightly under the 85% marketing number), with the caveats that EPSS can't predict zero-days, requires 6–12 months of organizational calibration, and needs professional services for threshold tuning in the large majority of organizations, so the effort reduction is real but it isn't "plug and play."

Peter Kaloroumakis at MITRE D3FEND clarified that "Wall Theory" (a derived interpretation showing up in some practitioner literature) isn't actually MITRE-endorsed methodology, even if the 45–75% defensive improvement number is achievable through custom implementation with 3–5 dedicated staff over 18–30 months. The framework developer acknowledged what was possible and what was being asserted as official; the marketing literature blurred the two.

The pattern is that framework developers acknowledge both capabilities and limitations while vendors emphasize capabilities and downplay limitations, so the two perspectives lead to different architecture decisions, which is why the framework developer tends to be the better source on what the framework can do.

5. AI claims need the most skepticism.

The hypothesis that started the AI agent thread of the research: "90% success rate in administrative security tasks, 60% talent reduction." The contradictions surfaced quickly, because large language models generate plausible-but-incorrect security advice routinely (the hallucination risk) and advanced models exhibit deceptive behavior under adversarial pressure (the goal-misalignment risk), so human oversight on security-critical decisions is structural rather than a nice-to-have.

The revised position is 70–85% success rates with mandatory human-in-the-loop oversight, and 40–50% talent reduction once the oversight workload is included, with audit trails, approval workflows, and verification systems treated as non-negotiable, because the AI agent capability is real while the "fire-and-forget autonomous SOC analyst" framing is dangerous, so the right posture treats AI as augmentation that requires oversight rather than as replacement.

The category-level pattern is that AI marketing is moving faster than AI capability, so the gap between promise and production is wider here than anywhere else in the portfolio, and the contradictions surface fastest in adversarial-research literature rather than in vendor case studies.

What this means for the people buying this work

Four practices that move the floor on architecture decisions.

Stop accepting vendor benchmarks unverified.

Every vendor benchmark is a central claim, so test it against GitHub issues for the technology you're evaluating, independent analyst work that wasn't vendor-funded, practitioners who hit production at scale, and framework developers for the canonical reading on what the framework can and can't do, and the specific question to ask is what's the workload shape under which this number stops being true, because if the vendor can't answer it, the number isn't doing the work the proposal needs it to do.

Plan for 2–3× vendor timelines.

When the vendor says 3 months, budget 6–9, and when the vendor says no consulting required, budget for outside expertise, and when the vendor says "plug and play," expect significant custom development, because the pattern held across every category in the portfolio, so treating it as the default saves months of replanning later.

Demand balanced evidence in the procurement conversation.

The four questions that separate transparent vendors from ones to walk away from:

What are the top three reasons customers fail to achieve these results?
Can I speak with a customer who experienced implementation challenges?
What are the documented limitations of this technology in production?
What organizational maturity prerequisites must be met?

Vendors who can answer these are the ones whose numbers are worth more than the marketing they're attached to. Vendors who deflect are telling you something about the gap between the claim and the reality.

Build contradiction discovery into the standard process.

Make it a structured part of every architecture evaluation rather than a stretch goal, so that week one gathers supporting evidence (the traditional approach), week two hunts contradictions deliberately (GitHub issues, academic literature, independent analysts), week three synthesizes both perspectives into a balanced position, and week four validates with experts like framework developers and production practitioners, which produces evidence-based decisions that survive contact with deployment instead of confirmation-biased ones that show their problems once contracts are signed.

The method is what makes the research portable.

The ten anchor hypotheses, the contradictions log, and the running update history are all on the research page. The thesis page connects them to the program POV.

Back to research → Read the thesis