Security Data Works

H1-PLATFORM-01 · Tier B-C · 4/5 (extension)

Policy without pipes doesn't ship.

The DSPM (Data Security Posture Management), DLP (Data Loss Prevention), and AI Security categories converged hard in 2024–2025, with Netskope IPO'ing at $7.3B and Cyera reaching unicorn status, and Cyera, Securiti, Palo Alto, and a dozen others now promise sub-100ms policy enforcement and AI-driven classification. Those promises assume a data platform that many organizations don't have, which is why the data platform has to come first.

The observation

"Everyone's building the policy layer. Nobody's talking about the pipes."

That's Matthias Vallentin, CEO of Tenzir, in November 2025, calling out an architectural gap that becomes obvious once it's named. DSPM, DLP, and AI Security platforms assume the customer has high-performance security data infrastructure already in place: streaming telemetry, schema-on-read flexibility, real-time enrichment, bidirectional lineage tracking, columnar storage optimized for analytics. Most of the estates I've seen have little of that, because what they actually run is batch ETL jobs, Logstash configs that haven't been touched in three years, and Kafka clusters held together with duct tape and good intentions, so when the policy layer gets deployed on top of that the performance claims evaporate.

This is the same evidence pattern visible in the broader anchor hypothesis on the platform baseline: decisions made at the foundation layer cascade for years. The DSPM/DLP/AI Security category just makes the cascade more visible in procurement because the policy-layer marketing language is so specific (sub-100ms, real-time, automated remediation) that the gap with the underlying infrastructure shows up immediately.

The claims, against the infrastructure most shops actually run

Sub-100ms latency on a 60-second-batch pipeline doesn't end well.

"Sub-100ms inline DLP latency."

The claim requires telemetry flowing at wire speed (not batch-collected every five minutes), normalized schema available at query time (not raw JSON dumps), enrichment happening in-stream (not in a separate overnight ETL job), and a policy engine with sub-100ms decision time (not polling a database). So if the pipeline is "Logstash to S3 to nightly Spark job to Parquet," the latency floor lands somewhere around 24 hours rather than 100 milliseconds, and the vendor isn't lying so much as describing performance on infrastructure the customer doesn't have.

"Real-time AI classification and remediation."

Requires continuous discovery scanning (not scheduled weekly scans), real-time classification at ingest (not background processing), streaming policy evaluation (not batch rule checks), and immediate remediation triggers (not ticketing systems). So if the discovery model is "agentless scan across S3, Snowflake, and on-premises Postgres once a week," the deployment is doing weekly detection with manual remediation, which is real enough but not real-time.

"End-to-end data lineage."

Requires metadata captured at every transformation step, bidirectional lineage tracking (upstream sources and downstream consumers), schema change history, and a queryable lineage graph. If the lineage artifact is "we have a Confluence page documenting our data flow," the lineage doesn't exist; it's aspirational documentation, six months out of date by the time anyone needs to query it.

What the pipe layer actually needs

Four things, in roughly this order, before any policy platform earns its claims.

Streaming-first architecture (not batch-first).

Telemetry flows continuously rather than on scheduled pulls. Processing happens in-stream rather than in batch jobs. Latency is measured in seconds, not hours or days. Backpressure is handled rather than papered over. The component categories: Apache Kafka or equivalent for event streaming; Apache Flink or Spark Streaming for stream processing; Tenzir, Cribl, or comparable for the security-specific streaming pipelines. The anti-pattern is a Logstash to S3 to nightly Spark job dressed up as a streaming architecture, which it isn't.

Schema normalization at ingest (not post-processing).

OCSF or ECS (Elastic Common Schema) mapping happens during collection. Schema-on-read flexibility for multiple incoming formats. Field validation at ingest to catch errors at the source. Backward compatibility when schemas evolve. The anti-pattern is dumping raw vendor logs to S3 and figuring out the schema downstream, which produces technical debt that compounds across every consumer of that data.

Metadata layer with lineage tracking.

Every transformation step tracked automatically. Bidirectional lineage (upstream sources and downstream consumers) queryable rather than documented. Schema change history preserved. The component categories: Apache Iceberg with its built-in metadata tracking; Unity Catalog or AWS Glue Data Catalog for governance and metadata management. The anti-pattern is lineage carried in Confluence pages that are manually maintained and so perpetually stale.

Columnar storage for analytics.

Parquet or Apache Arrow for columnar analytical queries. Column-level compression delivering 10–20× reductions in storage cost. Scan efficiency from reading only the columns the query needs. Zero-copy in-memory transformations where the analytics engine supports them. The component categories: Parquet on the storage side, Arrow for in-memory representation, columnar engines like DuckDB or ClickHouse on the query side. The anti-pattern is compressed JSON in S3, slow and expensive to query, which is the foundation many DSPM-backed deployments end up running on.

The build order

Pipes before policy. Then the policy claims start working.

The deployment sequence that survives contact with the policy-layer marketing claims:

  1. Streaming infrastructure (Kafka, Flink, Cribl, Tenzir) to get telemetry flowing in real-time rather than on schedule.
  2. Schema normalization (OCSF, ECS, or equivalent) at ingest, mapping vendor formats to a common standard before they hit storage.
  3. Metadata layer (Iceberg, Unity Catalog, Glue, or comparable) for lineage tracking and governance, with the metadata captured automatically rather than written by hand.
  4. Columnar storage (Parquet, Arrow, ClickHouse, DuckDB) optimized for the analytical workloads the policy platform is going to issue against it.
  5. Then the DSPM, DLP, or AI Security platform on top, with the underlying data platform able to support the latency and discovery patterns the platform actually requires.

The infrastructure investment pays dividends across more than one use case, because the same data platform that lets DSPM hit its sub-100ms claims also supports threat hunting, detection engineering, and compliance retention cleanly, so the "boring" infrastructure spend is what carries the "exciting" policy spend in the first place.

Two questions to ask DSPM and DLP vendors during evaluation, before signing anything: what specific streaming and schema-normalization assumptions do your performance claims depend on, and how do those map against my current infrastructure? Vendors who can answer this concretely are the ones whose deployments survive the first six months in production. Vendors who deflect are deferring the conversation to the post-purchase implementation review where the gap can no longer be papered over.

What this extends

H1-PLATFORM-01, applied to the DSPM/DLP procurement surface.

The anchor hypothesis on the research page reads: an open table format (Iceberg), a neutral catalog, and a swappable query engine form the platform baseline that survives vendor consolidation and supports analytical workloads at scale. The DSPM/DLP/AI Security category is the most visible procurement surface where the platform question gets ducked, because the policy-layer marketing focuses on the visible capability while the underlying infrastructure assumption is left implicit.

Two adjacent observations strengthen the broader pattern. The market-bifurcation lens (policy-first vendors like DSPM, DLP, AI Security versus pipe-first vendors like Cribl, Tenzir, streaming platforms) is currently producing acquisition-driven consolidation rather than honest customer-facing positioning, so CrowdStrike acquired Onum, SentinelOne acquired Observo AI, and Palo Alto Networks acquired Chronosphere, which means the vertical integration is happening through M&A. In the estates I’ve worked, a rough ballpark is that the pipe layer absorbs somewhere around 30–50% of total program cost while the policy layer takes the remaining 50–70%, though that’s an estimate from experience rather than a measured figure across a controlled sample, and the programs that allocated only 10–15% to the pipe layer tended to underperform on the policy platform’s headline claims.

What would change the answer. The answer shifts if DSPM vendors start listing infrastructure prerequisites in product documentation rather than burying them in implementation guides, and there are early signs of that in late 2025 though it isn't yet standard. It shifts further as production case studies publish the platform alongside the policy platform, because the deployments where this is documented openly are the ones that survive the procurement round-trip without retroactive cost surprises. And it would shift most of all with a unifying open standard at the policy-platform layer analogous to OCSF at the schema layer, which is currently absent and is structurally what would let pipes-first infrastructure be usefully decoupled from any single policy vendor.

The pipes are where the program either earns its claims or doesn't.

The full anchor hypothesis on the platform baseline is on the research page. The matrix offering applies the pipes-before-policy lens to the specific platform decisions in your environment.