CEP 21 - Run-exports in sharded Repodata
Title | CEP 21 - Run-exports in sharded Repodata |
Status | Approved |
Author(s) | Bas Zalmstra <bas@prefix.dev> |
Created | Jan 16, 2025 |
Updated | March 20, 2025 |
Discussion | https://github.com/conda/ceps/pull/108 |
Implementation | NA |
Abstract
We propose to add run-export information to the sharded repodata shards.
Motivation
When building conda packages the build infrastructure needs to extract run-export information from conda packages in the host- and build environments.
Run-export information is stored in a package and can be extracted by downloading the package and extracting the run_exports.json
file.
Even with the possibility to stream parts of .conda
files this is a relatively resource-intensive operation.
CEP 12 formalized a run_exports.json
file that is stored next repodata.json
file.
However, not all channels on the default server (conda.anaconda.org) provide this information which means falling back to downloading and extracting this information from the packages. It is possible to extract the data by only sparsly reading the file but the overhead is still relatively large.
Having two separate files also poses some problems as extra mechanisms have to be introduced in the build infrastructure to manage and sync both files on the build machines.
Specification
CEP 12 mentions the following reasons for splitting the information into two files:
- It would require extending the repodata schema, currently not formally standardized.
- It would increase the size of the already heavy repodata.json files.
- (Typed) repodata parsers would need to be updated to handle the new field. We propose that these reasons no longer hold with sharded repodata.
It would require extending the repodata schema, currently not formally standardized.
We propose to add a run_export
field to each record that mimics the specification from CEP 12.
If the run_export
field is not present in the record it means no run_export
information is stored with the record, and a fallback mechanism should be used to acquire the run-export information.
Since adding a field will not break existing parsers we feel this is safe and does not require a schema change.
It would increase the size of the already heavy repodata.json files.
The size of the shards would grow if run-exports are added.
Let's take a look at the current sizes of run_exports.json and repodata.json files.
channel + subdir | size of repo_data.json | size of run_exports.json | repodata.json.zst | run_exports.json.zst |
---|---|---|---|---|
conda-forge + linux-64 | 254 MB | 34.7 MB (11%) | 38.4 MB | 2.2MB (5%) |
conda-forge + noarch | 107 MB | 13.6 MB (11%) | 16.7 MB | 0.9 MB (5%) |
conda-forge + osx-arm64 | 99.8 MB | 11.8 MB (11%) | 12.6 MB | 0.8 MB (6%) |
conda-forge + win-64 | 185 MB | 22.1 MB (11%) | 24.7 MB | 1.4 MB (5%) |
Since the repodata shards are also compressed we can conclude that in practice adding run exports information would increase the size of the repodata shards by roughly 5-6%.
With the introduction of sharded repodata in CEP 16 the issues with size (and scale) have been effectively mitigated. Adding 5-6% to the total size of the shards will not pose a risk since all advantages of sharded repodata mentioned in the CEP still hold.
(Typed) repodata parsers would need to be updated to handle the new field.
Unless parsers are very strict about unknown fields (which was not a requirement for sharded repodata shards) this will not pose an issue. Since the absence of the field means that the state of the run-exports is unknown adding the field does not require a schema change.
Patching
With the run-exports part of the repodata we propose to also allow repodata patching these fields. The original run-exports can still be extracted from the packages. We do not see a reason how the run-export information is different from other patchable information stored in the repodata (like the dependencies).
To facilitate these patches, the patch_instructions_version
in patch_instructions.json
files is incremented to 2
. A patch_instructions_version: 2
file MAY contain a run_exports: ...
field in the patch instructions that MUST be used to patch generated run_exports.json
files.
{
"packages": {
"_libarchive_static_for_cph-3.3.3-h0921cf1_1.tar.bz2": {
...,
"run_exports": {
"weak": ["libarchive >=3,<4"],
}
},
...
}
}
For a channel to provide run-export patches, it MUST contain a run_exports.json
file.
For build tools to support patched run exports it MUST always query the sharded repodata, or use a run_exports.json
file as the source of truth for the run exports.
If neither sharded repodata or a run_exports.json
file is present build tools can assume no run export patches exist.