Skip to main content

Sharded repodata in conda (beta): an order of magnitude faster

ยท 9 min read
Travis Hathaway
Conda maintainer ๐Ÿ‘ท๐Ÿ”ง
Banner image for Sharded repodata in conda (beta): an order of magnitude faster blog post


We're excited to announce a new beta feature in conda called sharded repodata. This optimized repodata format makes environment solves faster by reducing the time spent fetching package metadata. Conda-forge is already serving sharded repodata, so you can try it immediately when using conda with conda-forge. In this post, we'll show you how to enable it, explain how the work came together across the ecosystem, and share the performance improvements you can expect in everyday use.

How do I try it out?โ€‹

If you're using conda-forge with conda and would like to try out this new feature, first update conda-libmamba-solver in your base environment and then opt in to the feature by setting plugins.use_sharded_repodata = true:

conda install --name base 'conda-libmamba-solver>=25.11.0'
conda config --set plugins.use_sharded_repodata true

Background and contextโ€‹

In this section we provide some context around why we decided to develop this feature. We think it will help you to better understand the performance metrics, but if you already know the ins and outs of how conda stores and uses package metadata, feel free to skip ahead to the Performance improvements section.

What's repodata?โ€‹

Repodata is a conda specific term that refers to the package index that all conda clients must download in order to find and install available packages. The best way to think about it is as a database containing every package file, its dependencies, and other metadata for a particular channel.

Repodata challenge at scaleโ€‹

Channels distribute repodata as a single file. As channels add more packages, that file grows. For channels with hundreds of thousands of packages, like conda-forge, that single file repodata becomes a bottleneck. It takes longer to download and requires more memory to parse. When anything in the channel updates, the entire cache invalidates. Conda re-downloads the complete file just to get the latest metadata. All of this leads to a slow experience for users.

Addressing this challengeโ€‹

There have been several attempts to address this problem over the past six years, including reducing the size of repodata.json and incrementally updating repodata by patching. Both of these previous attempts to make repodata fetching more efficient still weren't as performant as they could be. This led Bas Zalmstra and Wolf Vollprecht at prefix.dev to design and implement a new approach. Bas authored CEP 16, with input from the entire conda community, defining a new mechanism for fetching repodata using a sharded approach.

This approach works by splitting repodata into multiple "shards". Each package has its own shard which is much smaller than the complete repodata. So, when a package is installed, we only need to fetch what we need. This results in a much smaller download size and a faster overall experience.

Sharded repodata in the wider ecosystemโ€‹

CEP 16 was authored by Bas Zalmstra at prefix.dev, reviewed by the conda community, and approved by the conda Steering Council in July 2024. For a deep dive into the technical design and motivation, see the prefix.dev blog post on sharded repodata.

Since the CEP was approved, Pixi, rattler, and rattler-build have all provided production-ready implementations of sharded repodata. The prefix.dev channels have supported CEP 16 from day one, giving Pixi and rattler users the benefits of faster metadata fetching for over a year.

Now that anaconda.org also serves sharded repodata for conda-forge, even more users across the ecosystem will benefit. This is also great news for conda-forge maintainers using rattler-build: faster repodata fetching means reduced build times in CI.

This collaborative effort across multiple organizations, prefix.dev, Anaconda, Quansight, and the conda-forge community, demonstrates how the conda ecosystem can work together to deliver meaningful improvements for everyone.

Bringing sharded repodata to condaโ€‹

With CEP 16 already proven in production by Pixi and rattler, we've been doing the work needed to bring the same benefits to conda users.

  • Earlier this year, we updated conda-index so channels can generate sharded repodata.
  • Most recently, we added support in conda-libmamba-solver, now in beta, so the conda CLI can consume the new repodata format.
  • The anaconda.org team at Anaconda, with contributions from Quansight, worked with the conda-forge community to enable hosting of sharded repodata. We plan to work with other channels to add support as the beta progresses.

With all this in place, conda-forge is now publishing sharded repodata, and conda will automatically find it if the feature is enabled.

In the next section, we share the performance improvements we've seen so far.

Performance improvementsโ€‹

To compare performance between the sharded and non-sharded approach, we used two different environment creation scenarios: Python and Data Science. The Python scenario fetches the package python and all its dependencies, and the Data Science scenario fetches the packages, pandas, plotly and scipy and their dependencies. We benchmarked just the repodata fetching itself (check out the script we used here).

info

We used a temporary conda-forge-sharded channel because at the time of profiling, sharded repodata was not available on conda-forge via anaconda.org.

The benchmarks were run inside a linux/arm64 Docker container running on an Apple M1 Pro.

Furthermore, we ran our comparison using a cold cache where nothing was present in conda's cache and everything was fetched via network requests, and a warm cache where very few if any network requests were necessary.

To get a complete picture of how these changes affect conda, we not only measured total time but total network bandwidth usage and maximum memory usage.

Total timeโ€‹

Below are the comparisons between non-sharded and sharded fetching measuring total time.

Total time with cold cache (seconds)โ€‹

Loading chart...

Total time with warm cache (seconds)โ€‹

Loading chart...

Key takeawaysโ€‹

  • Under both scenarios, we see a ten times speed-up in fetching and parsing of repodata
  • This happens because the sharded repodata is significantly smaller
info

The nature of how shards are stored also means that the cache itself is invalidated less often (see CEP 16 for more information). This means that conda-forge users will see times closer to the faster "warm" cache scenario more often and will spend less time downloading repodata to install the packages they want.

Network bandwidthโ€‹

To further illustrate the improvements, we show the total amount of megabytes downloaded with sharded versus the non-sharded approach.

Total network bandwidth (MB received)โ€‹

Loading chart...

Key takeawaysโ€‹

  • The sharded approach reduces the amount downloaded by a factor of 35!
  • Non-sharded fetching always has to fetch the same sized repodata but sharded repodata only fetches what it needs, meaning it varies based on the requested packages as seen here.

Max memory usageโ€‹

The last metric we examine is the maximum memory usage of sharded repodata fetching. Here, we just show the "cold" cache scenario.

Max memory usage (in MB)โ€‹

Loading chart...

Key takeawaysโ€‹

Both package scenarios see significant decreases in memory usage with a fifteen and and seventeen times reduction in memory usage.

Conclusionโ€‹

We're excited to see these numbers and think this will translate to a better overall experience for conda users! If you've read this far, we hope you're convinced to give the beta release a try and we welcome any feedback you may have. Please file an issue at the conda-libmamba-solver repository on GitHub to reach out to us.

Finally, we want to give a big shout out to the conda-maintainers team and specifically Daniel Holth for making the addition of this feature possible!

info

If you're interested in how we generated the profiling data we've presented, please checkout the perfpy-conda and the perfpy tool.