CEP 16 - Sharded Repodata
Title | Sharded Repodata |
Status | Accepted |
Author(s) | Bas Zalmstra <bas@prefix.dev> |
Created | April 30, 2024 |
Updated | July 22, 2024 |
Discussion | https://github.com/conda-incubator/ceps/pull/75 |
Implementation |
Sharded Repodata
We propose a new "repodata" format that can be sparsely fetched. That means, generally, smaller fetches (only fetch what you need) and faster updates of existing repodata (only fetch what has changed).
Motivation
The current repodata format is a JSON file that contains all the packages in a given channel. Unfortunately, that means it grows with the number of packages in the channel. This is a problem for large channels like conda-forge, which has over 150,000+ packages. It becomes very slow to fetch, parse and update the repodata.
Design goals
- Speed: Fetching repodata MUST be very fast. Both in the hot- and cold-cache case.
- Easy to update: The channel MUST be very easy to update when new packages become available.
- CDN friendly: A CDN MUST be usable to cache the majority of the data. This reduces the operating cost of a channel.
- Support authN and authZ: It MUST be possible to implement authentication and authorization with little extra overhead.
- Easy to implement: It MUST be relatively straightforward to implement to ease the adoption in different tools.
- Client-side cacheable: If a user has a hot cache the user SHOULD only have to download small incremental changes. Preferably as little communication as possible with the server should be required to check freshness of the data.
- Bandwidth optimized: Any data that is transferred SHOULD be as small as possible.
Previous work
JLAP
In a previously proposed CEP, JLAP was introduced.
With JLAP only the changes to an initially downloaded repodata.json
file have to be downloaded which means the user drastically saves on bandwidth which in turn makes fetching repodata much faster.
However, in practice patching the original repodata can be a very expensive operation, both in terms of memory and in terms of compute because of the sheer amount of data involved.
JLAP also does not save anything with a cold cache because the initial repodata still has to be downloaded. This is often the case for CI runners.
Finally, the implementation of JLAP is quite complex which makes it hard to adopt for implementers.
ZSTD compression
A notable improvement is compressing the repodata.json
with zst
and serving that file. In practice this yields a file that is 20% of the original size (20-30 Mb for large cases). Although this is still quite a big file it's substantially smaller.
However, the file still contains all repodata in the channel. This means the file needs to be redownloaded every time anyone adds a single package (even if a user doesn't need that package).
Because the file is relatively big this means that often a large max-age
is used for caching which means it takes more time to propagate new packages through the ecosystem.