Writing · Detection integrity

Parquet doesn't hash the way your chain-of-custody thinks it does.

A lot of security and compliance tooling proves that evidence hasn't changed by hashing the file and comparing the digest later, which works beautifully for a PDF or a packet capture because the same bytes always produce the same SHA-256. I went to apply that same instinct to a Parquet table of security telemetry and found that it doesn't hold: the same logical rows, written to Parquet twice, came back as two different files with two different hashes, and nothing about the data had changed. The cause is narrow and fixable, but the consequence for any workflow that keys integrity on a file hash isn't small, because a perfectly faithful re-export reads as tampering.

The assumption underneath the hash

Same data, same bytes, same digest?

The reason a file hash works as an integrity control is that it's deterministic in the only direction that matters: change one byte and the SHA-256 changes completely, leave the bytes alone and the digest is identical every time you compute it. Chain-of-custody, WORM retention, and content-addressed dedup all rest on the second half of that, the assumption that the same evidence will always hash to the same value, so a later digest that matches the original is proof the file wasn't touched and a digest that differs is proof it was. For files written once and copied verbatim that assumption is sound, which is why nobody thinks about it, so you hash the artifact, store the digest, and a year later re-hash and compare.

Parquet quietly breaks the assumption, because a Parquet file isn't a verbatim copy of anything. It's the output of an encoder that takes logical rows and lays them out as compressed column chunks grouped into row groups, and that layout is a choice the writer makes, not a property of the data. Two writers, or the same writer run twice, can produce byte-different files that decode to exactly the same rows in exactly the same logical order, and to a hash function those two files are as different as two completely unrelated documents. So the moment your evidence lives in Parquet and your process ever rewrites it, the file hash stops answering the question you think you're asking.

I want to be careful about why this happens, because the easy explanation is wrong and the wrong explanation leads to the wrong fix. It is tempting to assume the writer is nondeterministic in its content, that it's making different encoding decisions run to run or salting something, and if that were true you'd be stuck. It isn't true. The content the writer produces is stable; what varies is the order in which it's emitted, and that distinction is the whole story.

What's actually moving

It's the order the threads finish in.

When a modern writer encodes a large table it doesn't do it on one thread. It splits the work, hands each worker a slice of rows, and each worker encodes its slice into one or more row groups, and then those row groups get written into the file as the workers complete. The encoding of any individual row group is deterministic, so the bytes inside a row group are the same every run. What isn't fixed is the sequence in which the workers hand their finished row groups back, because that depends on scheduling, on which thread got which slice, on how the OS interleaved them on a given run. Run the write again and the same row groups land in the file in a different order, the file's byte layout shifts, and the SHA-256 changes even though every logical row is present and identical.

That's worth restating plainly because it inverts the natural suspicion. The data didn't change, the encoder didn't make a different decision, and there's no randomness in the column values, so the only thing that varied is the physical ordering of fixed, identical blocks. A reader doesn't care, because Parquet carries its own metadata and the engine reassembles the logical table correctly regardless of row-group order, which is exactly why this stays invisible until something downstream is comparing bytes rather than rows.

The fix follows directly from the cause, and it's cheap. You force determinism on the one thing that's varying, either by writing single-threaded so there's only one possible emission order (in the engine I was using that's threads=1) or by imposing an explicit ORDER BY before the write so the rows, and therefore the row groups, come out in a defined sequence no matter how many threads encode them, and once you pin either of those the same logical data produces the same bytes every time and the SHA-256 is stable again. The ordered version had a side effect I didn't expect: the sorted layout was about 20% smaller on disk, because sorted data puts similar values next to each other and Parquet's encodings compress runs of similar values better than scattered ones, so the thing you do for reproducibility also buys you storage.

Where this bites in security

A faithful re-export reads as tampering.

Think about what a lakehouse full of security telemetry actually does to its files over their lifetime, because it's a lot more than write-once-read-never. A tier migration moves cold data to cheap storage and rewrites it on the way, a compaction job merges many small files into fewer large ones, a re-partition reshapes the layout when the access pattern changes, and every one of those operations reads the logical rows and writes them back out faithfully without altering a single value, yet every one produces new bytes with a new hash because the row-group order won't match the original. If your evidence-integrity process recorded the SHA-256 of the original file and checks it after one of these housekeeping jobs runs, the check fails, and the failure looks exactly like the thing the check exists to catch.

That's the dangerous direction, the false positive that cries tampering on benign maintenance, because it trains people to distrust a control that's firing correctly and then to route around it. There's a quieter direction too. A byte-hash proves the bytes are identical, but the thing you care about in an investigation is that the rows are identical, and those aren't the same statement once the layout is free to vary. A genuinely altered file could in principle be re-laid-out so that it satisfies whatever your process happened to expect, and more practically, a dedup or WORM system that keys on byte-hash will treat two faithful copies of the same evidence as two distinct objects, storing both and retaining both, while believing it has deduplicated. The control isn't lying about bytes. It's answering a question about bytes when you needed an answer about content.

So the correction is to hash the logical content rather than the file, which means deciding on a canonical ordering of the rows and hashing the data in that order, or computing a row-level digest that doesn't depend on physical layout at all, so two files decoding to the same rows produce the same integrity value no matter how the writer arranged them. It's more work than calling sha256sum on a file, and it forces you to define what "the same evidence" means at the row level, which is uncomfortable but correct, because that definition was always what your chain-of-custody was really trying to assert and the file hash was only ever a convenient proxy for it.

What this is and isn't evidence of

One machine, a property of the writer.

I reproduced this on a single machine in the lab, in the ocsf-parquet-determinism probe, and I'd rather label it honestly than oversell it. This is Tier B evidence: first-party and reproducible, but observed on one host with one toolchain, so I'm reporting a behavior I can demonstrate rather than a universal law of every Parquet writer ever shipped. What makes me fairly confident it generalizes is that the cause isn't an accident of my setup, it's a property of writing in parallel, and parallel encoding is the default in essentially every engine that writes Parquet at scale because that's how large writes get fast. The non-determinism comes from the same place the speed does, so any writer that hands row groups to multiple threads has the same freedom to emit them in whatever order they finish unless you've told it not to.

The honest open questions are about edges, since I haven't characterized whether every writer's single-threaded path is bit-for-bit stable across versions or how this interacts with the file-level statistics some catalogs compute, and those are worth a small cross-writer probe rather than an assumption. What I'm not claiming is that Parquet is broken or that you shouldn't store evidence in it, because the format faithfully preserves the rows and the readers reconstruct them correctly, which is the job. The claim is narrower and, I think, durable: byte-level reproducibility is not something Parquet gives you for free, you have to ask for it, and a security control that assumed you already had it was resting on an assumption that doesn't survive contact with a multi-threaded writer.

The good news is inside the bad news: the failure is localized. Force the order and determinism comes back cleanly, the digest is stable run to run, and you get exactly the integrity guarantee you wanted with one extra constraint on the write. This is a findable, fixable property with a known cause and a one-line remedy, not a fog of general unreliability, and once you know to pin the order the problem stops being mysterious.

Why this keeps happening

The same failure as the silent wrong count.

This has the same shape as a finding I wrote up separately, where one engine returned a filtered count a few rows short over a byte-identical Parquet file and raised no error doing it. In both cases the defect lives in the gap between what the tooling checks and what you actually care about. There the schema validation confirmed the result was a valid integer and never checked whether it was the right integer; here the hash confirms the bytes are identical and never checks whether the rows are. Both are silent because everything in the normal path is satisfied, the query is valid SQL and the file is a valid Parquet, while the property you were trusting was quietly never the property being verified. You can read that sibling case here.

The common cause is that an open lakehouse moves you from one closed engine that controlled the whole path to a set of independent components, each making its own implementation choices, and the controls you carried over from the closed world made assumptions that the open world doesn't honor. A SIEM that owned its storage could promise a stable on-disk artifact because nothing else ever touched it, but the moment your telemetry sits in Parquet under an open catalog, multiple writers and engines touch the same data and each is free to lay it out as it sees fit, so an integrity check written for the single-owner world starts catching its own infrastructure doing routine work. The portability is worth having, and it relocates where you have to verify, from trusting a vendor's closed guarantees to pinning the guarantees you actually need yourself.

Why the check is the deliverable

Hash the content, not the bytes.

The reason I keep circling back to this is that the same rule shows up everywhere I run an honest measurement: you verify the thing you actually care about rather than a convenient proxy for it. The rule that does the most work in keeping a benchmark from misleading you is to compare logical answers rather than physical artifacts, because the artifact can differ for reasons that have nothing to do with correctness while the answer is what you're really asserting. Integrity is the same problem wearing different clothes, since you don't want to know that the bytes match, you want to know that the evidence is unchanged, and on a columnar format under a multi-writer catalog those two questions have come apart. I wrote about the measurement side of this separately, and the integrity side is the same discipline applied to chain-of-custody.

It would be easy to read this as a knock on the open lakehouse, and it isn't. The closed SIEM gave you a stable artifact by owning everything and charging you for the privilege, and the open architecture takes that ownership back, which is most of why you chose it. What changes is that the guarantees you used to get bundled now have to be specified and checked, and that specifying is the work that makes the architecture trustworthy enough to put evidence on rather than a tax on it. A byte-hash on a Parquet file isn't wrong so much as it's answering a question the open world quietly stopped asking, and the fix is to ask the right question, which was always whether the rows are the same.

So the takeaway I'd hand to anyone running security or compliance workloads on Parquet is concrete. If a hash anywhere in your pipeline is standing in for "this evidence hasn't changed," confirm whether it's hashing bytes or content, and if it's bytes, either pin a deterministic write (single-threaded or an explicit ORDER BY) so the bytes are stable or move the integrity value down to the row level so layout can't disturb it, then re-derive that value after every compaction and tier migration. The first time a maintenance job trips your tamper alarm on data nobody touched, you'll be glad you knew why before the auditor asked.

Evidence: Tier B (first-party, reproduced; single machine). The non-determinism was observed in the SDW Lab ocsf-parquet-determinism probe: the same logical rows written to Parquet twice produced byte-different files with different SHA-256 digests, the cause traced to row-group emission order under a multi-threaded writer rather than to nondeterministic content, and determinism restored by forcing a single-threaded write (threads=1) or an explicit ORDER BY, with the ordered layout also about 20% smaller. Generality across other Parquet writers and versions is an open follow-up. See also the sibling silent-failure case at the wrong-count finding and the lab's broader method at the lab page.

The bytes can change while the evidence doesn't.

A file hash proves the bytes are identical, but on a Parquet table under an open catalog that isn't the same as proving the rows are. The integrity control worth building checks the content, not the layout, and the lab work that finds this class of gap is the same discipline that keeps a benchmark honest.

Read the lab page → The query engine that returned the wrong answer See engagements