Security Data Works

Data-quality deep-dive

The parsing layer nobody owns.

Security data quality is the part of the stack everything else depends on and no one is paid to own. The clearest example is time: a firewall log with no timezone at all, a sensor log stamped when it was written instead of when the event happened, valuable context buried in a field nobody parsed. I've fixed problems like these at the vendor level and watched the fixes go nowhere, and the reason is incentives rather than competence. This is the case for a fair broker, and why I'm building one.

Reading time: about 12 minutes. Evidence tier: B — first-hand practitioner experience plus a public, citable pull request (PaloAltoNetworks/Splunk-Apps #294); the vendor-incentive reading is labeled as opinion where it is inference rather than fact. The regulated-utility details are anonymized.

The eighteen-hour day

Time is the foundation, and it was eighteen hours wide.

A few years ago I was reconstructing the order of events during an incident at a regulated utility, and the timeline refused to make sense. Connections that should have followed each other arrived hours apart; a session looked like it had torn down before it was built. I assumed I'd made a query mistake, then a parsing mistake, and only after I'd exhausted the things that are usually my fault did I check the one thing nobody checks, which is whether the clocks agreed. They did not. The firewall fleet had been configured over years by different people in different regions, and the appliances were set to a spread of timezones, UTC on some, US/Eastern, Central, Mountain, and Pacific on others, so that across the estate the same instant could be stamped as much as eighteen hours apart depending on which box you asked. The timeline I'd been staring at wasn't wrong because of a bad query. It was wrong because the ground underneath it wasn't level.

Time is the foundation you're standing on in security. Almost everything we do to data after it lands assumes the timestamps are true: correlation across sources, impossible-travel math, the simple human act of saying this happened, and then that happened. A detection is a statement about sequence and rate, and sequence and rate are functions of time. When the clocks disagree by eighteen hours you don't have a small error you can absorb with a tolerance window; you have a fabricated history, and the dangerous part is that it still looks like a history. Nothing throws an exception. The dashboard renders, the query returns rows, and you are quietly, confidently wrong about when things happened, which during an investigation is the one thing you cannot afford to be.

The missing field

PAN-OS ships without a timezone.

What made this particular hole hard to see is that PAN-OS syslog carries no timezone value at all. The TRAFFIC log gives you a generated time and a receive time, but nothing in the record states which zone those numbers are in, so the SIEM has to assume, and the tribal knowledge (the thing the experienced engineers know and the documentation mostly doesn't say out loud) is that every Palo Alto firewall has to be set to UTC so the assumption holds and the times stay comparable. That works right up until someone stands up a new pair of firewalls for a new site and leaves them on local time, and now a slice of your estate is silently offset from the rest, with no field anywhere in the data that would tell you, because the field doesn't exist. The fix is operational discipline that lives entirely in people's heads, which means it degrades the way undocumented discipline always degrades: slowly, invisibly, and right before you need it.

The tell

Fine in the vendor's tool.

The part of this that taught me the most wasn't the misconfiguration, it was what happened every time I raised it. I'd take the discrepancy to the firewall subject-matter experts, and they would open their Palo Alto console, the vendor's own tool, and show me that the time was fine. And in their tool it was fine. My read, and I can't see the internals, so I'll hold this as inference rather than fact, is that the appliance keeps richer, better-timestamped telemetry for its own management plane than it ships out over the syslog feed to the SIEM, so the vendor console always had enough fidelity to render a coherent picture while the SIEM was working from a thinner, zone-ambiguous copy. The practical effect was a lesson the organization kept relearning without ever quite saying it: if you want the real data, the high-quality data, you go to the vendor's tool. The SIEM, which is supposed to be the place where you correlate across every vendor you own, was quietly getting the second-class feed, and the people closest to the device had been trained by the tooling to believe the problem was on my end. Palo Alto and Splunk were not, in this respect, symbiotic partners. The data crossed the boundary between them and lost something on the way, and no one whose job it was to care about that boundary actually owned it.

Fixing it at the source

What a thousand-file pull request bought.

So I tried to fix it at the source. In 2023, while I was a customer and not anyone's vendor, I submitted a pull request to Palo Alto's own Splunk app, the parsing logic that turns the syslog feed into the fields a SIEM can search, because the field extractions were broken in ways that went well past timezones. The mechanism matters, so here is what actually broke. The TRAFFIC and CONFIG logs are positional, comma-separated records, which means a field is identified by its place in the line, and a single wrong assignment cascades: every field after the mistake shifts one position and parses into the wrong column. In the user-id extraction the src_user field had been omitted, so everything downstream of it slid over and landed somewhere it didn't belong. In the config extraction the logic was pulling devicegroup_level3 and devicegroup_level4 fields that don't exist in the data at all, which threw off everything after them the same way. There was a name collision where a serial_number field was overwriting the reporting device's serial with the asset's serial, so two different identities were being confused in a log where knowing which device you're looking at is the whole point. The change touched on the order of a thousand files and was tested against large-scale existing pan:* data. It was never merged. The repository has since been archived.

I want to be careful about how I read that, because reaching for malice when incentives are the better explanation is its own kind of sloppiness. I don't think anyone at the vendor looked at the pull request and decided to keep their customers' data broken. What I think is that nobody was paid to merge it. A community Splunk app is a cost center, maintained on the margins, and the data being a little wrong in the SIEM is not a problem the vendor experiences, because the vendor experiences its own console, where the data is fine. If anything the gradient runs the other way: every hour a customer spends fighting the SIEM feed is a soft argument for spending more time in the vendor's tool, which is where the vendor differentiates and retains. None of that requires a conspiracy. It only requires that the people who could fix it have no incentive to, and the people who need it fixed have no way to force the issue. An unmerged pull request sitting in an archived repository is what that equilibrium looks like from the outside.

There is a layer beneath even that. The app wasn't freelancing; it was following Palo Alto's published syslog field reference, the document every consumer of that feed parses against, not just Splunk. For the config log, that reference and the data my firewalls actually emitted didn't agree, and the reference stayed that way. A canonical spec that's wrong about field order, and stays wrong, silently misaligns every parser built to it, across every downstream vendor, for as long as it goes uncorrected, which is a larger and quieter failure than any one app's bugs. The timezone gap is the cleanest illustration, because a resolution for it was already sitting in the data: a high-resolution timestamp carrying a timezone had been appended to the tail of the syslog format, the only safe place to add a field to a positional record. But the documentation doesn't connect that field to the problem it solves, and the Splunk app was never updated to read it, so the set-everything-to-UTC convention outlived a fix that was already shipping. The pull request would have surfaced the field and let teams retire the workaround. None of that needs bad faith read into it; it's what an unowned layer produces, where a fix can land in the data and still never reach the people relying on the workaround, because connecting the two is the work nobody is paid to finish.

Not just one vendor

The same shape, three more times.

If this were only a Palo Alto story I'd file it under one vendor's neglect and move on, but I've now seen the same shape enough times to think it's structural. Take Zeek, the open-source network-monitoring data that Corelight productizes, where I found and contributed parsing fixes to Corelight's Splunk app while I was their customer (I'd later go to work there, which I should disclose, though everything in this account predates it). Splunk, by default, timestamps an event using the first time value it encounters in the record, which is a reasonable general heuristic and the wrong one for Zeek specifically, because the first time value in a Zeek log is frequently the write timestamp, the moment the log was flushed on the sensor, and not the event start time, which is the timestamp security actually needs. The standard is that you stamp an event when it began, because that's the moment you're reasoning about, so stamping it when the sensor happened to write the record offsets your network telemetry by buffering and batching latency that has nothing to do with the event. It's a smaller error than eighteen hours, but it belongs to the same category: the time is wrong in a way that's invisible until you go looking, and it's wrong in the SIEM specifically, in the cross-vendor layer, while the vendor's own tool looks correct. I contributed that fix at the vendor level rather than just patching it locally for one employer, because the local patch is the easy move that quietly abandons everyone standing in the same hole. Fixing your own copy is real work and real relief, and it also lets the systemic problem persist for every other customer who hasn't figured it out yet.

Or take Tenable, where the failure was shaped differently but rhymes. The vulnerability data was reaching the SIEM, technically, except the genuinely valuable part, the structured CVE context you'd actually use to evaluate exposure across your infrastructure, arrived nested inside a field and unparsed, so until you happened to discover where the value was hiding you could not query your own CVEs in any way that crossed your environment. I ended up building a local SIEM app to extract that nested data into fields a human could reason about. The data was present and useless at the same time, which is its own kind of data-quality failure, and again it lived precisely at the handoff into the correlation layer rather than inside the vendor's own product.

Put those next to each other and a pattern shows through. In each case the data technically arrived, and a naive integration would report success: bytes moved, events ingested, a green check on the source-health page. And in each case the data was wrong in a way that mattered and was hard to see, the timezone that didn't exist, the field that slid one position, the write-time wearing the event-time's clothes, the context folded into an unparsed blob. The common thread is that the corruption happens at the boundary between systems, in the parsing and normalization layer, which is exactly the layer no single party in the chain is paid to get right.

I've since measured how common this is across real logs. I took twelve thousand real production log lines (public samples across six systems, parsed by code against each format's published grammar, with no model in the loop) and the gap between what the standard documents and what the systems actually emit ran from zero to forty-three percent depending on the source. The case I keep coming back to is the one a detection engineer would never think to check: the same sshd daemon, on two different real deployments, writes its program field two different ways. One emits sshd, exactly as the syslog standard describes. The other emits sshd(pam_unix), the PAM module that did the logging folded into the program token, on every one of its six-hundred-odd authentication events. A detection or a normalization rule keyed on the documented program name sshd silently matches none of them; the authentication failures and the attempted break-ins are all sitting in the data under a token the rule was never told to expect, and nothing fires. The standard never sanctioned that form, the deployment emits it anyway, and the only way to know is to measure your own feed against the spec instead of trusting that the two agree. These are operating-system and application logs rather than a commercial security vendor's feed, and the rate tracks how a given deployment is configured rather than being a fixed property of the source, but that is the unowned layer again one level down, where whether your parser is right depends on a convention nobody wrote down.

The incentives

Why no one in the chain fixes it.

The failure is over-determined, and it helps to walk the incentives one actor at a time, because several rational parties each decline to fix it for reasons that make sense from where they sit. The source vendor is paid for its own product, and a high-fidelity console next to a lossy syslog feed tilts the incentive toward the console, so polishing the community SIEM app is unpaid work that weakens a soft reason to stay in the vendor's tool. The SIEM is paid for ingest and search, and a per-source-correct default is more expensive to build and maintain than a general one, so you get the general one, first timestamp seen, take the line as it comes, and the per-source correctness becomes the customer's problem. The customer, meanwhile, is paid to keep the lights on, not to reverse-engineer a vendor's feed format on a Tuesday, and most security teams don't have a data engineer who thinks in field positions and timezones to begin with. Each of those is a defensible local decision. Their sum is that the parsing layer, which everything downstream depends on, is owned by no one.

There's a fourth actor, and they're the one who turned this from a series of bad afternoons into a thesis for me. Integrators and the more mature managed-service shops know all of this already. They have the timezone discipline, the corrected field extractions, the nested-field unpacking, the per-source timestamp overrides, packaged as a set of configurations they bring with them, and that package is a meaningful part of what they sell when they walk into a customer who's drowning in their own data. I don't begrudge them the charge; if you've done the unglamorous work of learning where every vendor hides the bodies, getting paid for that knowledge is fair. But the structure it creates is perverse, because the rational move for an integrator is to keep the knowledge tribal. Publishing the fix would help the whole market and give away the advantage, so the fix stays private, gets re-sold to the next customer, and the systemic problem never resolves, because resolving it across the market runs against the interest of the people best equipped to resolve it. Everyone in the chain is behaving sensibly, and the sum of all that sensible behavior is that practitioners across the industry keep paying, separately and repeatedly, to rediscover the same dozen parsing problems.

The gap

What a fair broker is for.

What's missing is anyone whose incentive is aligned with the practitioner across the whole market rather than with a single product, a single contract, or a single retained client. Standards help with part of this, and I don't want to undersell them. The Open Cybersecurity Schema Framework gives you a defined place to put a normalized timestamp and a typed home for fields that used to land in a free-text blob, which is genuine progress and a lot of my own benchmarking work measures how completely real vendor schemas map into it. But a schema can only describe where a correct value should go; it cannot manufacture a correct value out of a feed that never carried a timezone, and it cannot decide for Splunk that the Zeek event-start time matters more than the write time. Field-mapping fidelity asks whether there's a correct place to put the value. Data quality asks whether the value that lands there is true. You can have a flawless schema mapping carrying a confidently wrong timestamp, and most of the time nobody will notice, because the pipeline is green and the dashboard renders.

That gap is the reason I'm building Security Data Works the way I am. The idea is to act as a fair broker: to benchmark the things vendors and integrators have an incentive to keep quiet, and to say them out loud, on the record, with reproducible evidence and disclosed positions. Palo Alto's syslog feed ships without a timezone, and the community Splunk app carried well over a hundred broken and missing field extractions for years. Splunk's default timestamp logic picks the wrong field for Zeek. Tenable's most useful context arrives nested and unparsed. None of those is a secret to the people who've hit them; all of them are effectively secret to the practitioner who hasn't yet, because the knowledge is locked inside vendor consoles, unmerged pull requests, and integrator playbooks. The evidence I produce is meant to make that knowledge public and checkable, not vendor assurances, not integrator promises, but tests and teardowns you can run yourself and disagree with line by line. Some of those checks already run as code: the public demonstration stack ships a data-health gate (./moar healthcheck in security-data-that-works) that tests exactly this class of failure: timestamps actually in epoch UTC, no NULL hiding inside a filter set, row counts that agree across engines. On clean data it reports healthy, and the same checks caught every fault injected into the demonstrator.

I should be honest about the limits, because a fair broker who oversells is just another interested party. I'm one person with a point of view, not an institution; the evidence I produce is reproducible practitioner-grade work, which is a real tier but not peer review; and I hold positions I have to disclose rather than pretend away. This isn't a solved problem with a product bolted on. It's a structural gap in how the security-data market is organized, where the cost of bad data at the boundaries is paid by practitioners and the durable incentive to fix it sits with no one. I don't think naming that fixes it overnight. I do think someone has to stand in the gap and name it plainly, with evidence, without something else to sell you, and that the absence of anyone doing so is most of why the problem has lasted as long as it has.

Closing

What time is it on your data?

The test I keep coming back to is the one from that incident, and it's a question every defender should be able to answer and most cannot: what time is it, really, on the data you're defending with? Not what time the dashboard shows, and not what time the vendor's console shows when the subject-matter expert pulls it up to prove you wrong. What time the data actually carries, across every source, when it lands in the place you go to correlate it. If the honest answer is that you'd have to open the vendor's own tool to find out, then the layer your detections rest on is one nobody owns, and no amount of detection engineering built on top of it will make the answer any truer.