Foundation · Data health

The gate before everything.

MOAR, DetectFlow, MLOps-hunting — every downstream project assumes the data platform underneath produces data analysts and engineers can trust. Most don't. The foundation engagement is the work that makes the trust claim demonstrable rather than asserted: per-source health reports, cross-tool gap analysis, the entity graph rendered for analysts. Not a side project; the prerequisite that determines whether everything built on top of it survives contact with the SOC.

The two stages

Per-source quality. Then cross-tool reconciliation. The order matters.

Stage one — per-source data health.

For each source feeding the lakehouse, four properties get measured continuously: completeness (the events the source is supposed to produce are actually arriving), freshness (the latency between event occurrence and event landing in queryable storage stays inside its target band), schema conformance (the fields are in the shape the downstream consumers expect), and OCSF conformance (where applicable — whether the source's events match the multi-vendor schema standard for security data, or whether the vendor's bespoke shape is being papered over without acknowledgment).

The four properties are simple individually and load-bearing collectively. A source feeding 99% completeness of fresh events with conformant schemas is doing what it claimed. A source feeding 60% completeness with hours-stale data and a quarter of the OCSF fields populated is failing silently — and the failure surfaces only when an analyst's query returns wrong results that nobody traced back to the source. Stage one is the work that catches the failure at the source rather than at the symptom.

Stage two — cross-tool gap analysis.

The CMDB says the environment has 50,000 assets. The EDR sees 47,000. The vulnerability scanner sees 52,000. Three sources, three answers, none of them obviously right. Most security programs paper over the delta — the asset count varies by tool, everybody knows it, the team quietly picks one source as the unofficial truth and moves on. Stage two refuses the paper-over. The delta gets named. The authoritative source per attribute gets determined explicitly with confidence and freshness scoring. The coverage holes — the 3,000 assets the EDR doesn't see, the 5,000 the CMDB doesn't know about — get documented as the gap they actually are.

Cross-tool gap analysis is the assurance mechanism. Per-source reports answer "is this source doing what it claimed?" Cross-tool analysis answers "do the sources, in combination, actually cover what we said they cover?" The two questions are the same posture applied at different scopes — and the cross-tool view is where assurance lives. A program that produces clean per-source reports but never reconciles across tools cannot defend the claim that its connected dataset is complete. The trustworthy and well-connected pillars from the thesis are demonstrated properties only after stage two; before then they're assertions.

Why this is a gate, not a phase

Programs that skip the gate produce alerts and models analysts rationally distrust.

The same dynamic plays out in SOC after SOC. The detection-engineering team ships content against telemetry whose completeness and freshness are unmeasured. Analysts notice the rules don't fire when they should fire and do fire when they shouldn't, conclude the rules are bad, and start writing their own SPL queries against what they consider the "real" version of the data. The detection-engineering pipeline gets routed around. The official tools get distrusted. Parallel tooling proliferates. The cost shows up as alert fatigue, analyst turnover, and shadow infrastructure — none of which appear as line items on the SIEM invoice.

The same pattern hits machine-learning models harder. A model trained on telemetry of unknown completeness and freshness produces output of unknown reliability. The model's evaluation metrics in training look fine; in production the output is noisier than the training metrics suggested. The MLOps-hunting page walks the failure mode in detail — the model wasn't bad, the data platform underneath it was.

The mechanism is the same in both cases. Engineers can't be productive on data they don't trust either; they just express the distrust differently — parallel pipelines, shadow ETL, "let me re-extract that from source" as the default. The same data-health gate that earns analyst trust also unlocks engineer velocity. Programs that try to ship detection content or ML models without the gate are paying the cost of the gate's absence in slower velocity, lower reliability, and persistent rework — they just don't recognize where the cost is coming from. The foundation engagement is the work that makes the cost visible and produces the data platform that earns the trust back.

The force multiplier

One verified data platform. Two parallel multiplier effects.

The technical pillars earn the budget. What justifies the budget being spent on data infrastructure instead of more headcount or another point product is the force multiplier on the people who use the data platform. Analysts on the operational side, engineers on the build side. The two effects compound, and both are downstream of the same foundation work.

Analyst trust — the operational payoff.

When the data platform is verifiably trustworthy, well-connected, and performant, analysts stop maintaining shadow tools and personal SPL searches because they don't trust the official ones. They stop second-guessing every alert because they can't tell if the data is fresh or correct. They stop routing around the detection-engineering pipeline because the data behind it is suspect. They stop dismissing model output because it might just be reflecting bad context.

And they start hunting longer windows confidently because the data is verifiably complete. They start treating detection content as production software rather than personal craft. They start trusting model output enough to act on it — or rejecting it for a defensible, evidence-grounded reason. They start closing tickets faster because the context graph answers "what is this asset, who owns it, what else has it touched" without a five-tab investigation. A SOC of ten analysts on a verified data platform outperforms a SOC of fifty on a data platform everyone privately distrusts.

Engineer productivity — the build-velocity payoff.

The same data platform makes four engineering roles materially more productive. Detection engineers get the DetectFlow-as-code regime working against trustworthy telemetry rather than fighting it. Data engineers work with schema-as-code (OCSF plus transforms in version control), source-to-sink contracts, and lineage as a first-class artifact — less firefighting, less ticket-driven work from analysts whose complaints turn out to be data-quality issues at the source. ML engineers get reproducible training data with lineage-traceable features, replacing the one-off-notebook pattern that doesn't survive production deployment. Platform engineers work on a vendor-neutral architecture that doesn't lock them into one vendor's whole tooling stack.

The context graph

Stage one and two produce data. Visualization makes it legible.

The output of the cross-tool gap analysis is a best-context report per entity (asset, user, application, configuration) by attribute (owner, criticality, last-seen, IP address, MAC address, OS version), with the authoritative source for each combination scored for confidence and freshness. The report is dense and analytically powerful, and it doesn't surface its value until analysts can navigate it visually.

The context-graph visualization layer renders entity relationships, source conflicts, and coverage gaps in a form analysts can use during incident response. When a ticket lands referencing an asset, the analyst sees the asset's resolved identity (with the conflicts between sources surfaced rather than hidden), its ownership chain, the tools currently observing it, the recent activity across all those tools, and the gaps where no source has visibility. The five-tab investigation collapses into a single surface; the time-to-context drops from minutes to seconds.

The visualization is the artifact that turns the data-health work into something analysts feel rather than read about. Engagement deliverables include both the reports (for the audit trail and the engineering team) and the visualization (for the SOC's daily operation). Skipping the visualization produces beautiful PDFs that nobody opens after the engagement closes.

When to pick it

Universally, before the others. Specifically, in five concrete patterns.

Every program that hasn't validated its data platform needs the foundation engagement before it commits to MOAR or DetectFlow or MLOps-hunting. Five concrete patterns where the conversation usually starts:

Pre-migration validation. Before a platform migration, the foundation engagement establishes a trusted baseline for the current platform — what's actually flowing in, what's missing, what's in conflict. Migrating without that baseline produces post-migration surprises that get attributed to the new platform when the gaps were already present.
Post-migration verification. After a platform migration, the foundation engagement verifies that the new platform is producing data of equivalent or better quality — that nothing got lost in the move, that schema mappings translated correctly, that the federation surface holds together.
Audit preparation. Compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI-DSS) now name data quality requirements for security telemetry directly. The foundation engagement produces the documented evidence auditors look for, with the audit trail their reviewers can reproduce.
Continuous monitoring deployment. Programs that want to shift from periodic assessment to continuous monitoring need the framework, the dashboards, and the alerting that the foundation engagement produces as standing artifacts. The handoff is the dashboards and the runbooks, not just the report.
Detection or ML program restart. Programs whose detection or ML work has stalled because of unreliable data quality benefit from the foundation engagement specifically as the unblock. Skipping ahead to "let's add more detections" or "let's train a model" without the foundation is the failure pattern that produced the stall in the first place.

The engagement is also the right shape for environments where the team senses that the data platform is unreliable but can't articulate where — the "we don't trust the data but can't say why" pattern that every mature SOC eventually surfaces. The foundation engagement converts the vague distrust into a documented, prioritized list of specific issues with specific owners.

The engagement shape

Data Quality & Flow Health Validation — productized form.

Data Quality & Flow Health Validation ($25K–$60K, 2–4 weeks). Pricing scales with data volume and log-source count. The methodology was originally productized at Corelight as the "Data Health Check" offering — one of two net-new professional-services offerings shipped there before the practice spun out. The deliverables:

Pipeline health assessment. Per-source completeness, accuracy, timeliness, and consistency measured against documented baselines. The output: a per-source report with the four properties scored, the trend over the assessment window, and the flagged anomalies with proposed owners.
Cross-tool gap analysis. Best-context determination per entity (asset, user, application, configuration) by attribute, scored across sources for confidence and freshness. The assurance mechanism — without it, "trustworthy" and "well-connected" are assertions, not demonstrated properties.
Context graph visualization. Entity relationships, source conflicts, and coverage gaps rendered for analysts. The artifact that converts data-health work into something the SOC feels in daily operation.
OCSF schema mapping validation. Field-level semantic alignment against the OCSF standard. Identifies the vendor-specific extensions, the misclassified event categories, and the mapping accuracy gaps that downstream analytics inherit.
Automated validation framework. dbt tests, SQL monitors, dashboards. The standing artifacts that turn the engagement's findings into continuous monitoring rather than a one-time snapshot.
Remediation roadmap. Prioritized list of issues with proposed owners, estimated effort, and expected impact on the trust pillars. Drives the next quarter's work for the data engineering and detection engineering teams.
Ongoing-monitoring playbook. The runbook the team uses after the engagement closes — what to watch, how to respond when the dashboards alert, when to escalate.

The engagement runs as a standalone deliverable when the program isn't ready to commit to MOAR yet, and runs in parallel with MOAR migration assessments when the prospect already knows the data platform has to move. Either way, the foundation work is the gate — every subsequent project's success depends on the data-platform trust this engagement produces.

Trust earned by measurement, not by assertion.

MOAR is the data platform. DetectFlow is the discipline. MLOps-hunting is the next horizon. The foundation is what makes any of them survive contact with the SOC.

Back to thesis → See engagements