The method · Research

Hunt for contradictions before confirmation.

Most architecture evaluations form a hypothesis, find evidence that supports it, declare victory. This practice flips that order — the supporting evidence is the easy part; the contradictions are where the position actually gets stress-tested. Across 25 hypotheses, that flip changed roughly 60% of the load-bearing claims.

The flip

A hypothesis is only worth something if you tried to break it.

The default research pattern in this industry runs in one direction. You form a position — "ClickHouse will give us sub-second security analytics" — and you go gather supporting evidence. Vendor benchmarks. Case studies. Analyst reports. Within a week you have a stack of corroboration and a recommendation. The pattern feels rigorous. It isn't.

Confirmation bias makes the supporting evidence cheap and the contradictory evidence expensive. The contradictory evidence lives in places the marketing surface doesn't surface — GitHub issues where practitioners describe what actually broke in production, academic papers that characterize the failure modes, framework developers who acknowledge the limitations their vendors don't mention. Finding it requires looking specifically for it.

The flip this practice runs:

Form the hypothesis. Specific, falsifiable, with a number where possible. "ClickHouse runs security workloads 100× faster than the dominant schema-on-read SIEM" is testable; "ClickHouse is good for security analytics" is not.
Gather supporting evidence. The easy step. Time-box it.
Hunt for contradictions. The hard step. Search GitHub issues for the technology in question. Read the academic literature on its failure modes. Find independent analysts whose work isn't vendor-funded. Engage framework developers — the people who built the thing have less reason to oversell it than the people selling it.
Engage practitioners and developers directly. Production deployments are the only environment where vendor claims meet real workloads. The people running them are the highest-quality source.
Synthesize a balanced assessment. Both perspectives in the same place, with the load-bearing claim and the conditions under which it fails.

Across 25 hypotheses, this process produced 150+ primary sources and documented contradictions for roughly 80% of vendor technology claims. Around 60% of vendor performance claims required significant contextual qualification or outright revision. The other 40% held up — which is the part that matters: not "vendors always lie," but "the load-bearing claims have to be tested individually, and most of the testable ones move once you actually test them."

Five patterns

What 25 hypotheses revealed about how vendor claims actually work.

1. Performance claims are selectively true.

The ClickHouse "sub-second on billion-row datasets" claim is technically correct. Cloudflare's production deployment validates 96% of queries under one second. Independent benchmarks confirm 100–1000× advantages over legacy SIEM platforms. Then row-level security gets enabled — the access-control feature that any multi-tenant enterprise environment requires — and the same queries degrade to 18+ seconds. A 10–20× performance penalty for the controls real SOCs need.

The same shape recurs at MinIO. The "93% faster than HDFS" claim applies to specific benchmark phases (GET operations, specific object sizes); overall workflow improvement is closer to 15%. The 15% is real value. The 93% is peak performance in ideal conditions, presented as average improvement.

The pattern: vendor benchmarks highlight peak performance under benchmark-friendly conditions. Real-world value lives in sustained average performance across mixed workloads, with the security controls and operational requirements real environments carry. Both numbers can be defensible; the question is which one is being put to work.

2. Implementation timelines are universally underestimated.

Across every technology category in the portfolio, vendor timeline estimates ran 2–5× short of practitioner reality. Security data pipeline platforms quoted at 2–3 months land closer to 6–8. CAASM (cyber asset attack surface management) deployments quoted at 3–4 months take 8–12. EPSS (exploit prediction scoring system) integrations quoted at 1–2 months take 6–12. The "AI agent for defensive security" category is the worst — the "immediate" timeline lands at 6–9 months once human-in-the-loop oversight requirements are accounted for.

Roughly 67% of organizations need external consulting to land an SDPP deployment successfully. Hidden costs average 40% above initial vendor estimates. Specialized expertise requirements consistently exceed vendor projections. None of this is in the demo.

The mechanism is structural. Vendors optimize their demos for time-to-first-value with sanitized data, pre-configured environments, and zero integration complexity. Real deployments carry legacy integrations (absent from the demo), custom security requirements (absent from the benchmark), organizational change management (absent from the SOW), and skills development (absent from the TCO model). The TCO that doesn't include the things that take the most calendar time arrives at the wrong calendar.

3. Some claims hold up under scrutiny.

The 65–75% reduction in downstream SIEM licensing costs from a security data pipeline platform is real and reproducible. Multiple independent case studies confirm it. The market has voted — Cribl carries a $3.5B valuation with 400+ enterprise customers; Fortune 500 penetration is around 67%. Implementation costs still run 40% over estimates, but the cost-reduction claim itself is defensible.

The pattern that distinguishes the durable claims from the inflated ones: independent corroboration across multiple production deployments, transparent acknowledgment of implementation challenges, and TCO models that include migration costs, training, and ongoing operational requirements. Mature vendors and durable categories let those numbers travel; the inflated ones can't.

4. Framework developers offer the most balanced perspective.

The best validation across the portfolio came from engaging framework creators directly — the people with deep technical expertise and no sales quota. Jay Jacobs at the FIRST EPSS SIG put the EPSS effort-reduction claim at 70–80% (slightly under the 85% marketing number), with the caveats that EPSS can't predict zero-days, requires 6–12 months of organizational calibration, and needs professional services for threshold tuning in the large majority of organizations. Real, but not "plug and play."

Peter Kaloroumakis at MITRE D3FEND clarified that "Wall Theory" — a derived interpretation showing up in some practitioner literature — isn't actually MITRE-endorsed methodology, even if the 45–75% defensive improvement number is achievable through custom implementation with 3–5 dedicated staff over 18–30 months. The framework developer acknowledged what was possible and what was being asserted as official; the marketing literature blurred the two.

The pattern: framework developers acknowledge both capabilities and limitations. Vendors emphasize capabilities and downplay limitations. The two perspectives lead to different architecture decisions. The framework developer is the better source on what the framework can do.

5. AI claims need the most skepticism.

The hypothesis that started the AI agent thread of the research: "90% success rate in administrative security tasks, 60% talent reduction." The contradictions surfaced quickly. Hallucination risks: large language models generate plausible-but-incorrect security advice routinely. Goal misalignment: advanced models exhibit deceptive behavior under adversarial pressure. Human oversight is not a nice-to-have for security-critical decisions; it's structural.

The revised position: 70–85% success rates with mandatory human-in-the-loop oversight, 40–50% talent reduction once oversight workload is included. Audit trails, approval workflows, verification systems — non-negotiable. AI agent capability is real; the "fire-and-forget autonomous SOC analyst" framing is dangerous. Treat AI as augmentation requiring oversight, not replacement.

The category-level pattern: AI marketing is moving faster than AI capability, the gap between promise and production is wider here than anywhere else in the portfolio, and the contradictions surface fastest in adversarial-research literature rather than vendor case studies.

What this means for the people buying this work

Four practices that move the floor on architecture decisions.

Stop accepting vendor benchmarks unverified.

Every vendor benchmark is a load-bearing claim. Test it. GitHub issues for the technology you're evaluating. Independent analyst work that wasn't vendor-funded. Practitioners who hit production at scale. Framework developers for the canonical reading on what the framework can and can't do. The specific question to ask: what's the workload shape under which this number stops being true? If the vendor can't answer it, the number isn't doing the work the proposal needs it to do.

Plan for 2–3× vendor timelines.

Vendor says 3 months: budget 6–9. Vendor says no consulting required: budget for outside expertise. Vendor says "plug and play": expect significant custom development. The pattern is universal across categories; treating it as universal saves months of replanning later.

Demand balanced evidence in the procurement conversation.

The four questions that separate transparent vendors from ones to walk away from:

What are the top three reasons customers fail to achieve these results?
Can I speak with a customer who experienced implementation challenges?
What are the documented limitations of this technology in production?
What organizational maturity prerequisites must be met?

Vendors who can answer these are the ones whose numbers are worth more than the marketing they're attached to. Vendors who deflect are telling you something about the gap between the claim and the reality.

Build contradiction discovery into the standard process.

Make it a structured part of every architecture evaluation, not a stretch goal. Week one gathers supporting evidence (the traditional approach). Week two hunts contradictions deliberately (GitHub issues, academic literature, independent analysts). Week three synthesizes both perspectives into a balanced position. Week four validates with experts — framework developers, production practitioners. The result is evidence-based decisions that survive contact with deployment, not confirmation-biased ones that surface their problems once contracts are signed.

The method is what makes the research portable.

The eight anchor hypotheses, the contradictions log, and the running update history are all on the research page. The thesis page connects them to the program POV.

Back to research → Read the thesis