Security Data Works

H1-PLATFORM-01 · Tier B-C · 4/5 (extension)

Policy without pipes doesn't ship.

The DSPM (Data Security Posture Management), DLP (Data Loss Prevention), and AI Security categories converged hard in 2024–2025. Netskope IPO'd at $7.3B. Cyera reached unicorn status. Cyera, Securiti, Palo Alto, and a dozen others promise sub-100ms policy enforcement and AI-driven classification. The promises assume a data platform most organizations don't have. The data platform has to come first.

The observation

"Everyone's building the policy layer. Nobody's talking about the pipes."

That's Matthias Vallentin, CEO of Tenzir, in November 2025 — calling out an architectural gap that becomes obvious once it's named. DSPM and DLP and AI Security platforms assume the customer has high-performance security data infrastructure already in place: streaming telemetry, schema-on-read flexibility, real-time enrichment, bidirectional lineage tracking, columnar storage optimized for analytics. Most organizations have none of that. They have batch ETL jobs, Logstash configs that haven't been touched in three years, Kafka clusters held together with duct tape and good intentions. The policy layer gets deployed on top of that, and the performance claims evaporate.

This is the same evidence pattern visible in the broader anchor hypothesis on the platform baseline: decisions made at the foundation layer cascade for years. The DSPM/DLP/AI Security category just makes the cascade more visible in procurement because the policy-layer marketing language is so specific (sub-100ms, real-time, automated remediation) that the gap with the underlying infrastructure shows up immediately.

The claims, against the infrastructure most shops actually run

Sub-100ms latency on a 60-second-batch pipeline doesn't end well.

"Sub-100ms inline DLP latency."

The claim requires telemetry flowing at wire speed (not batch-collected every five minutes), normalized schema available at query time (not raw JSON dumps), enrichment happening in-stream (not in a separate overnight ETL job), and a policy engine with sub-100ms decision time (not polling a database). If the pipeline is "Logstash → S3 → nightly Spark job → Parquet," the latency floor is 24 hours, not 100 milliseconds. The vendor isn't lying; they're describing performance on infrastructure the customer doesn't have.

"Real-time AI classification and remediation."

Requires continuous discovery scanning (not scheduled weekly scans), real-time classification at ingest (not background processing), streaming policy evaluation (not batch rule checks), and immediate remediation triggers (not ticketing systems). If the discovery model is "agentless scan across S3, Snowflake, and on-premises Postgres once a week," the deployment is doing weekly detection with manual remediation. Real, but not real-time.

"Comprehensive data lineage."

Requires metadata captured at every transformation step, bidirectional lineage tracking (upstream sources and downstream consumers), schema change history, and a queryable lineage graph. If the lineage artifact is "we have a Confluence page documenting our data flow," the lineage doesn't exist; it's aspirational documentation, six months out of date by the time anyone needs to query it.

What the pipe layer actually needs

Four things, in roughly this order, before any policy platform earns its claims.

Streaming-first architecture (not batch-first).

Telemetry flows continuously rather than on scheduled pulls. Processing happens in-stream rather than in batch jobs. Latency is measured in seconds, not hours or days. Backpressure is handled rather than papered over. The component categories: Apache Kafka or equivalent for event streaming; Apache Flink or Spark Streaming for stream processing; Tenzir, Cribl, or comparable for the security-specific streaming pipelines. The anti-pattern: Logstash → S3 → nightly Spark job, dressed up as a streaming architecture. It isn't.

Schema normalization at ingest (not post-processing).

OCSF or ECS (Elastic Common Schema) mapping happens during collection. Schema-on-read flexibility for multiple incoming formats. Field validation at ingest to catch errors at the source. Backward compatibility when schemas evolve. The anti-pattern: dump raw vendor logs to S3 and figure out the schema downstream — produces technical-debt that compounds across every consumer of that data.

Metadata layer with lineage tracking.

Every transformation step tracked automatically. Bidirectional lineage — upstream sources and downstream consumers — queryable rather than documented. Schema change history preserved. The component categories: Apache Iceberg with its built-in metadata tracking; Unity Catalog or AWS Glue Data Catalog for governance and metadata management. The anti-pattern: lineage as Confluence pages, manually maintained, perpetually stale.

Columnar storage for analytics.

Parquet or Apache Arrow for columnar analytical queries. Column-level compression delivering 10–20× reductions in storage cost. Scan efficiency from reading only the columns the query needs. Zero-copy in-memory transformations where the analytics engine supports them. The component categories: Parquet on the storage side, Arrow for in-memory representation, columnar engines like DuckDB or ClickHouse on the query side. The anti-pattern: compressed JSON in S3, slow and expensive to query, and the foundation many DSPM-backed deployments are quietly running on.

The build order

Pipes before policy. Then the policy claims start working.

The deployment sequence that survives contact with the policy-layer marketing claims:

  1. Streaming infrastructure — Kafka, Flink, Cribl, Tenzir — to get telemetry flowing in real-time rather than on schedule.
  2. Schema normalization — OCSF, ECS, or equivalent — at ingest, mapping vendor formats to a common standard before they hit storage.
  3. Metadata layer — Iceberg, Unity Catalog, Glue, or comparable — for lineage tracking and governance, with the metadata captured automatically rather than written by hand.
  4. Columnar storage — Parquet, Arrow, ClickHouse, DuckDB — optimized for the analytical workloads the policy platform is going to issue against it.
  5. Then the DSPM, DLP, or AI Security platform on top, with the underlying data platform able to support the latency and discovery patterns the platform actually requires.

The infrastructure investment pays dividends across more than one use case. The same data platform that lets DSPM hit its sub-100ms claims also supports threat hunting, detection engineering, and compliance retention cleanly. The "boring" infrastructure investment is what makes the "exciting" policy investment work.

Two questions to ask DSPM and DLP vendors during evaluation, before signing anything: what specific streaming and schema-normalization assumptions do your performance claims depend on, and how do those map against my current infrastructure? Vendors who can answer this concretely are the ones whose deployments survive the first six months in production. Vendors who deflect are deferring the conversation to the post-purchase implementation review where the gap can no longer be papered over.

What this extends

H1-PLATFORM-01, applied to the DSPM/DLP procurement surface.

The anchor hypothesis on the research page reads: an open table format (Iceberg), a neutral catalog, and a swappable query engine form the platform baseline that survives vendor consolidation and supports analytical workloads at scale. The DSPM/DLP/AI Security category is the most visible procurement surface where the platform question gets ducked — the policy-layer marketing focuses on the visible capability while the underlying infrastructure assumption is left implicit.

Two adjacent observations that strengthen the broader pattern. The market-bifurcation lens — policy-first vendors (DSPM, DLP, AI Security) versus pipe-first vendors (Cribl, Tenzir, streaming platforms) — is currently producing acquisition-driven consolidation rather than honest customer-facing positioning. CrowdStrike acquired Onum. SentinelOne acquired Observo AI. Palo Alto Networks acquired Chronosphere. The vertical integration is happening through M&A. The infrastructure-cost-allocation pattern emerging from production engagements: roughly 30–50% of total program cost on the pipe layer, 50–70% on the policy layer. Programs that allocated only 10–15% to the pipe layer reliably underperformed on the policy platform's headline claims.

What would change the answer. DSPM vendors explicitly listing infrastructure prerequisites in product documentation rather than burying them in implementation guides — early signs in late 2025 but not yet standard. Production case studies that publish the platform alongside the policy platform — the deployments where this is documented openly are the ones that survive the procurement round-trip without retroactive cost surprises. A unifying open standard at the policy-platform layer analogous to OCSF at the schema layer — currently absent, and structurally what would let pipes-first infrastructure be usefully decoupled from any single policy vendor.

The pipes are where the program either earns its claims or doesn't.

The full anchor hypothesis on the platform baseline is on the research page. The matrix offering applies the pipes-before-policy lens to the specific platform decisions in your environment.