Security Data Works

Foundation · Data health

The gate before everything.

MOAR, DetectFlow, MLOps-hunting. Every downstream project assumes the data platform underneath produces data analysts and engineers can trust. Most don't. The foundation engagement is the work that makes the trust claim demonstrable rather than asserted. Per-source health reports, cross-tool gap analysis, the entity graph rendered for analysts. Not a side project; the prerequisite that determines whether everything built on top of it survives contact with the SOC.

The four layers

Source health. Flow health. Data quality. Cross-tool gap analysis.

The foundation engagement measures four distinct layers, in order, from upstream to downstream. Most security programs jump straight to layer three (data quality) and skip the two above it, which means they're measuring the properties of data that may already be incomplete or stale before it ever reached the lake. The order matters because each layer's measurements are only as honest as the layer above it.

Layer one. Source health.

The producer's own operational health, measured at the sensor or agent, upstream of any data the source emits. If a Zeek sensor is dropping packets at the wire, no downstream measurement can reconstruct what was lost; the lake just sees a flow that looks complete because everything that arrived was processed correctly. Source health catches the failure at the source rather than at the symptom.

Signals measured continuously, per source:

Signal What it catches
Uptime and process health Sensor or agent running, expected number of workers, no crash loops.
Capture rate vs. baseline Packets captured (NDR), events captured (EDR), bytes ingested (log shipper) against the source's documented baseline.
Drop rate Packets dropped at the kernel or capture buffer, events rate-limited or buffer-overflow dropped, log lines truncated.
Production volume vs. baseline Sudden silence is the dangerous signal; a sensor producing zero events looks identical to a healthy quiet day until you correlate against the baseline.
Time-sync drift NTP / PTP offset; without time accuracy, every downstream cross-source join becomes unreliable.
Producer resource headroom CPU, memory, disk IO at the sensor or agent; saturation upstream causes the drops downstream.

The Corelight context surfaced these signals natively for Zeek sensors; the methodology generalizes to EDR (CrowdStrike, SentinelOne), identity providers (Okta, Entra), pipeline shippers (Cribl, DataBahn), and any agent or appliance with self-reported telemetry.

Layer two. Flow health (the SRE golden signals).

Between the source and the lake sits a pipeline. Ingest, parse, normalize, enrich, route, land. Each stage has the same operational properties production services have, and the same framework production SRE teams use applies cleanly. The Google SRE golden signals (latency, traffic, errors, saturation) are the four flows worth watching at each pipeline stage. Most security shops don't measure pipelines this way; the cost shows up as silent drops, freshness drift, and the kind of "the data was there yesterday and isn't today" failure that nobody catches until an analyst complains.

  • Latency. Event-time → queryable-time. Measured per pipeline stage and end-to-end, with a target band and an alert when the band is exceeded.
  • Traffic. Events per second per stage. Baseline plus expected variance; sudden changes are leading indicators of upstream incidents or silent drops.
  • Errors. Parse failures, schema violations, validation drops, dead-letter-queue rates. Events fail silently unless errors are first-class measurements.
  • Saturation. Queue depth, consumer lag, buffer utilization, backpressure. Pipelines work right up until they don't; saturation is the leading indicator.

Naming the framework matters. Treating security data pipelines the way production engineering treats production services, with the same instrumentation discipline, is the rhetorical move that converts "we measure things" into "we apply the operational standard production owes itself." Few security shops do this; the ones that do are visibly more reliable.

Layer three. Data quality (six dimensions, plus retention length).

The properties of the data after it lands. This is the layer analysts experience directly when they query the lake; this is also the layer most "data quality" work fixates on without ever instrumenting the two layers above it that determine whether layer three is measuring anything honest.

Dimension What it verifies
Timeliness Fresh enough for the downstream consumer's latency requirement. Near-real-time detection has a different bar than 90-day threat hunting.
Accuracy Values reflect ground truth. Distinct from validity, since values can be valid in shape but inaccurate in content.
Completeness Every event the source promised landed in storage. Holes are named and quantified, not papered over.
Consistency The same entity is represented consistently across rows and tables. A user identity flipping between three formats is a consistency failure even when each format is individually valid.
Validity Values within expected ranges, types, and shapes. Schema-conformance work lives here as the deliverable; where the customer has adopted a standard, mapping accuracy against that standard (OCSF for cross-vendor evidence chains, Splunk CIM for SPL-native shops, Elastic ECS, or vendor-bespoke contracts) is the specific instance of validity work.
Uniqueness Each event captured exactly once. Retry storms, dual-path ingest, and replay-after-failure all produce duplicates that quietly inflate counts unless the dedup posture is measured.

Plus a seventh signal that's often missed:

  • Retention length. The queryable window the data actually supports. Two-sided. A regulatory floor (Reg SCI 5 years for SCI entities, SEC Rule 17a-4 6 years for broker-dealer recordkeeping, HIPAA 6 years, PCI 1 year) and an operational ceiling (how long can the data platform actually serve queries against this data before the cost curve breaks?). The gap between the floor and the ceiling is where compliance failures happen quietly.

Layer four. Cross-tool gap analysis.

Layers one through three answer "is each source doing what it claimed?", measured one source at a time. Layer four asks the different question. Do the sources, in combination, actually cover what we said they cover? The CMDB says 50,000 assets. The EDR sees 47,000. The vulnerability scanner sees 52,000. Three sources, three answers, none of them obviously right. Most security programs paper over the delta. The asset count varies by tool, everybody knows it, the team quietly picks one source as the unofficial truth and moves on. Layer four refuses the paper-over.

The delta gets named. The authoritative source per attribute is determined explicitly, with confidence and freshness scoring. The coverage holes (the 3,000 assets the EDR doesn't see, the 5,000 the CMDB doesn't know about) get documented as the gap they are. The trustworthy and well-connected pillars from the thesis are demonstrated properties only after layer four; before then they're assertions.

Why this is a gate, not a phase

Programs that skip the gate produce alerts and models analysts rationally distrust.

The same dynamic plays out in SOC after SOC. The detection-engineering team ships content against telemetry whose completeness and freshness are unmeasured. Analysts notice the rules don't fire when they should fire and do fire when they shouldn't, conclude the rules are bad, and start writing their own SPL queries against what they consider the "real" version of the data. The detection-engineering pipeline gets routed around. The official tools get distrusted. Parallel tooling proliferates. The cost shows up as alert fatigue, analyst turnover, and shadow infrastructure, none of which appear as line items on the SIEM invoice.

The same pattern hits machine-learning models harder. A model trained on telemetry of unknown completeness and freshness produces output of unknown reliability. The model's evaluation metrics in training look fine; in production the output is noisier than the training metrics suggested. The MLOps-hunting page walks the failure mode in detail. The model wasn't bad, the data platform underneath it was.

The mechanism is the same in both cases. Engineers can't be productive on data they don't trust either; they just express the distrust differently. Parallel pipelines, shadow ETL, "let me re-extract that from source" as the default. The same data-health gate that earns analyst trust also produces engineer velocity. Programs that try to ship detection content or ML models without the gate are paying the cost of the gate's absence in slower velocity, lower reliability, and persistent rework. They just don't recognize where the cost is coming from. The foundation engagement is the work that makes the cost visible and produces the data platform that earns the trust back.

The force multiplier

One verified data platform. Two parallel multiplier effects.

The technical pillars earn the budget. What justifies the budget being spent on data infrastructure instead of more headcount or another point product is the force multiplier on the people who use the data platform. Analysts on the operational side, engineers on the build side. The two effects compound, and both are downstream of the same foundation work.

Analyst trust. The operational payoff.

When the data platform is verifiably trustworthy, well-connected, and performant, analysts stop maintaining shadow tools and personal SPL searches because they don't trust the official ones. They stop second-guessing every alert because they can't tell if the data is fresh or correct. They stop routing around the detection-engineering pipeline because the data behind it is suspect. They stop dismissing model output because it might just be reflecting bad context.

And they start hunting longer windows confidently because the data is verifiably complete. They start treating detection content as production software rather than personal craft. They start trusting model output enough to act on it, or rejecting it for a defensible, evidence-grounded reason. They start closing tickets faster because the context graph answers "what is this asset, who owns it, what else has it touched" without a five-tab investigation. A SOC of ten analysts on a verified data platform outperforms a SOC of fifty on a data platform everyone privately distrusts.

Engineer productivity. The build-velocity payoff.

The same data platform makes four engineering roles materially more productive. Detection engineers get the DetectFlow-as-code regime working against trustworthy telemetry rather than fighting it. Data engineers work with schema-as-code (OCSF plus transforms in version control), source-to-sink contracts, and lineage as a first-class artifact, so less firefighting and less ticket-driven work from analysts whose complaints turn out to be data-quality issues at the source. ML engineers get reproducible training data with lineage-traceable features, replacing the one-off-notebook pattern that doesn't survive production deployment. Platform engineers work on a vendor-neutral architecture that doesn't lock them into one vendor's whole tooling stack.

The context graph

The four layers produce data. Visualization makes it legible.

The output of the cross-tool gap analysis is a best-context report per entity (asset, user, application, configuration) by attribute (owner, criticality, last-seen, IP address, MAC address, OS version), with the authoritative source for each combination scored for confidence and freshness. The report is dense and analytically powerful, and it doesn't surface its value until analysts can navigate it visually.

The context-graph visualization layer renders entity relationships, source conflicts, and coverage gaps in a form analysts can use during incident response. When a ticket arrives referencing an asset, the analyst sees the asset's resolved identity (with the conflicts between sources surfaced rather than hidden), its ownership chain, the tools currently observing it, the recent activity across all those tools, and the gaps where no source has visibility. The five-tab investigation collapses into a single view; the time-to-context drops from minutes to seconds.

The visualization is the artifact that turns the data-health work into something analysts feel rather than read about. Engagement deliverables include both the reports (for the audit trail and the engineering team) and the visualization (for the SOC's daily operation). Skipping the visualization produces beautiful PDFs that nobody opens after the engagement closes.

When to pick it

Before any of the others. Specifically, in five concrete patterns.

For any program that hasn't validated its data platform, this comes before it commits to MOAR or DetectFlow or MLOps-hunting. Five concrete patterns where the conversation usually starts:

  • Pre-migration validation. Before a platform migration, the foundation engagement establishes a trusted baseline for the current platform: what's flowing in, what's missing, what's in conflict. Migrating without that baseline produces post-migration surprises that get attributed to the new platform when the gaps were already present.
  • Post-migration verification. After a platform migration, the foundation engagement verifies that the new platform is producing data of equivalent or better quality. Nothing got lost in the move, schema mappings translated correctly, the federation surface holds together.
  • Audit preparation. Compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI-DSS) now name data quality requirements for security telemetry directly. The foundation engagement produces the documented evidence auditors look for, with the audit trail their reviewers can reproduce.
  • Continuous monitoring deployment. Programs that want to shift from periodic assessment to continuous monitoring need the framework, the dashboards, and the alerting that the foundation engagement produces as standing artifacts. The handoff is the dashboards and the runbooks, not just the report.
  • Detection or ML program restart. Programs whose detection or ML work has stalled because of unreliable data quality benefit from the foundation engagement specifically as the unblock. Skipping ahead to "let's add more detections" or "let's train a model" without the foundation is the failure pattern that produced the stall in the first place.

The engagement is also the right shape for environments where the team senses that the data platform is unreliable but can't articulate where, the "we don't trust the data but can't say why" pattern that every mature SOC eventually hits. The foundation engagement converts the vague distrust into a documented, prioritized list of specific issues with specific owners.

The engagement shape

Data Quality & Flow Health Validation — productized form.

Data Quality & Flow Health Validation ($25K–$60K, 2–4 weeks). Pricing scales with data volume and log-source count. The methodology was originally productized at Corelight as the "Data Health Check" offering — one of two net-new professional-services offerings shipped there before the practice spun out. The deliverables:

The engagement runs as a standalone deliverable when the program isn't ready to commit to MOAR yet, and runs in parallel with MOAR migration assessments when the prospect already knows the data platform has to move. Either way, the foundation work is the gate. Every subsequent project's success depends on the data-platform trust this engagement produces.

Trust earned by measurement, not by assertion.

MOAR is the data platform. DetectFlow is the discipline. MLOps-hunting is the next horizon. The foundation is what makes any of them survive contact with the SOC.