CEP 19 - Computing the hash of the contents in a directory
Title | Computing the hash of the contents in a directory |
Status | Approved |
Author(s) | Jaime Rodríguez-Guerra <jaime.rogue@gmail.com> |
Created | Nov 19, 2024 |
Updated | Dec 19, 2024 |
Discussion | https://github.com/conda/ceps/pull/100 |
Implementation | https://github.com/conda/conda-build/pull/5277 |
Abstract
Given a directory, propose an algorithm to compute the aggregated hash of its contents in a cross-platform way. This is useful to check the integrity of remote sources regardless the compression method used.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119][RFC2119] when, and only when, they appear in all capitals, as shown here.
Specification
Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path as a Unicode string. More specifically, it MUST follow an ascending lexicographical comparison using the numerical Unicode code points (i.e. the result of Python's built-in function ord()
) of their characters 1.
For each entry in the contents table, compute the hash for the concatenation of:
- UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
- Then, depending on the type:
- For regular files:
- If text, the UTF-8 encoded bytes of an
F
separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (\r\n
replaced with\n
). A file is considered a text file if all the contents can be UTF-8 decoded. Otherwise it's considered binary. If the file can't be opened, it's handled as if it were empty. - If binary, the UTF-8 encoded bytes of an
F
separator, followed by the bytes of its contents. - If it can't be read, error out.
- If text, the UTF-8 encoded bytes of an
- For a directory, the UTF-8 encoded bytes of a
D
separator, and nothing else. - For a symlink, the UTF-8 encoded bytes of an
L
separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding. - For any other types, error out.
- For regular files:
- UTF-8 encoded bytes of the string
-
.
Note that the algorithm MUST error out on unreadable files and unknown file types because we can't verify its contents. An attacker could hide malicious content in those paths known to be "unhashable" and later reveal then again in the build script (e.g. by chmod
ing them as readable).
Example implementation in Python:
import hashlib
from pathlib import Path
def contents_hash(directory: str, algorithm: str) -> str:
hasher = hashlib.new(algorithm)
for path in sorted(Path(directory).rglob("*")):
hasher.update(path.relative_to(directory).replace("\\", "/").encode("utf-8"))
if path.is_symlink():
hasher.update(b"L")
hasher.update(str(path.readlink(path)).replace("\\", "/").encode("utf-8"))
elif path.is_dir():
hasher.update(b"D")
elif path.is_file():
hasher.update(b"F")
try:
# assume it's text
lines = []
with open(path) as fh:
for line in fh:
lines.append(line.replace("\r\n", "\n")
for line in lines:
hasher.update(line.encode("utf-8")))
except UnicodeDecodeError:
# file must be binary
with open(path, "rb") as fh:
for chunk in iter(partial(fh.read, 8192), b""):
hasher.update(chunk)
else:
raise RuntimeError(f"Unknown file type: {path}")
hasher.update(b"-")
return hasher.hexdigest()
Motivation
Build tools like conda-build
and rattler-build
need to fetch the source of the project being packaged. The integrity of the download is checked by comparing its known hash (usually SHA256) against the obtained file. If they don't match, an error is raised.
However, the hash of the compressed archive is sensitive to superfluous changes like which compression method was used, the version of the archiving tool and other details that are not concerned with the contents of the archive, which is what a build tool actually cares about.
This happens often with archives fetched live from Github repository references, for example.
It is also useful to verify the integrity of git clone
operation on a dynamic reference like a branch name.
With this proposal, build tools could add a new family of hash checks that are more robust for content reproducibility.
Rationale
The proposed algorithm could simply concatenate all the bytes together, once the directory contents had been sorted. Instead, it also encodes relative paths and separators to prevent preimage attacks.
Merkle trees were not used for simplicity, since it's not necessary to update the hash often or to point out which file is responsible for the hash change.
The implementation of this algorithm as specific options in build tools is a non-goal of this CEP. That goal is deferred to further CEPs, which could simply say something like:
The
source
section is a list of objects, with keys [...]contents_sha256
andcontents_md5
(which implement CEP 19 for SHA256 and MD5, respectively).
References
- The original issue suggesting this idea is
conda-build#4762
. - The Nix ecosystem has a similar feature called
fetchzip
. - There are several Rust crates and Python projects implementing similar strategies using Merkle trees. Some of the details here were inspired by
dasher
.
Copyright
All CEPs are explicitly CC0 1.0 Universal.
Footnotes
-
This is what Python does. See "strings" in Value comparisons. ↩