Skip to main content

CEP 21 - Run-exports in sharded Repodata

Title CEP 21 - Run-exports in sharded Repodata
Status Approved
Author(s) Bas Zalmstra <bas@prefix.dev>
Created Jan 16, 2025
Updated March 20, 2025
Discussion https://github.com/conda/ceps/pull/108
Implementation NA

Abstract

We propose to add run-export information to the sharded repodata shards.

Motivation

When building conda packages the build infrastructure needs to extract run-export information from conda packages in the host- and build environments. Run-export information is stored in a package and can be extracted by downloading the package and extracting the run_exports.json file. Even with the possibility to stream parts of .conda files this is a relatively resource-intensive operation.

CEP 12 formalized a run_exports.json file that is stored next repodata.json file. However, not all channels on the default server (conda.anaconda.org) provide this information which means falling back to downloading and extracting this information from the packages. It is possible to extract the data by only sparsly reading the file but the overhead is still relatively large.

Having two separate files also poses some problems as extra mechanisms have to be introduced in the build infrastructure to manage and sync both files on the build machines.

Specification

CEP 12 mentions the following reasons for splitting the information into two files:

  • It would require extending the repodata schema, currently not formally standardized.
  • It would increase the size of the already heavy repodata.json files.
  • (Typed) repodata parsers would need to be updated to handle the new field. We propose that these reasons no longer hold with sharded repodata.

It would require extending the repodata schema, currently not formally standardized.

We propose to add a run_export field to each record that mimics the specification from CEP 12.

If the run_export field is not present in the record it means no run_export information is stored with the record, and a fallback mechanism should be used to acquire the run-export information.

Since adding a field will not break existing parsers we feel this is safe and does not require a schema change.

It would increase the size of the already heavy repodata.json files.

The size of the shards would grow if run-exports are added.

Let's take a look at the current sizes of run_exports.json and repodata.json files.

channel + subdirsize of repo_data.jsonsize of run_exports.jsonrepodata.json.zstrun_exports.json.zst
conda-forge + linux-64254 MB34.7 MB (11%)38.4 MB2.2MB (5%)
conda-forge + noarch107 MB13.6 MB (11%)16.7 MB0.9 MB (5%)
conda-forge + osx-arm6499.8 MB11.8 MB (11%)12.6 MB0.8 MB (6%)
conda-forge + win-64185 MB22.1 MB (11%)24.7 MB1.4 MB (5%)

Since the repodata shards are also compressed we can conclude that in practice adding run exports information would increase the size of the repodata shards by roughly 5-6%.

With the introduction of sharded repodata in CEP 16 the issues with size (and scale) have been effectively mitigated. Adding 5-6% to the total size of the shards will not pose a risk since all advantages of sharded repodata mentioned in the CEP still hold.

(Typed) repodata parsers would need to be updated to handle the new field.

Unless parsers are very strict about unknown fields (which was not a requirement for sharded repodata shards) this will not pose an issue. Since the absence of the field means that the state of the run-exports is unknown adding the field does not require a schema change.

Patching

With the run-exports part of the repodata we propose to also allow repodata patching these fields. The original run-exports can still be extracted from the packages. We do not see a reason how the run-export information is different from other patchable information stored in the repodata (like the dependencies).

To facilitate these patches, the patch_instructions_version in patch_instructions.json files is incremented to 2. A patch_instructions_version: 2 file MAY contain a run_exports: ... field in the patch instructions that MUST be used to patch generated run_exports.json files.

{
"packages": {
"_libarchive_static_for_cph-3.3.3-h0921cf1_1.tar.bz2": {
...,
"run_exports": {
"weak": ["libarchive >=3,<4"],
}
},
...
}
}

For a channel to provide run-export patches, it MUST contain a run_exports.json file. For build tools to support patched run exports it MUST always query the sharded repodata, or use a run_exports.json file as the source of truth for the run exports. If neither sharded repodata or a run_exports.json file is present build tools can assume no run export patches exist.