Vendoring Projects Through Git (Part 1)

What is it and its problems?

I’ve noticed more chatter about vendoring dependencies lately. Between the recent interview with Matthew Miller, deciding how to handle Rust dependencies in the Linux kernel, and other discussions I’ve seen, I figured I’d write down my thoughts on it. This will be a multi-part series that I’ll index here:

Personally, I find vendoring to be a necessary evil. I don’t really like it that much, but if it’s going to be done, I’d like it to be done for the right reasons and try and avoid as many of the problems it can cause as possible. In future parts, I’ll cover why one might want to vendor in spite of these problems that end up needing solved and some strategies for avoiding the problems once you’ve decided to vendor something.

What is vendoring?

“Vendoring” is the process of embedding some dependency of your project in your project’s own codebase. It is not uncommon in projects that target Windows or projects which are “products” such as games. This can be done either by using some pre-compiled binaries directly or by using the source code to build your own copy to use within the project.

There’s a similar effect that happens with static linking where the projects may be separate at build time, but when distributed, they end up becoming mostly indistinguishable from source vendoring at runtime as changes to the statically linked library are not automatically available to the consuming binaries without further work.

Additionally, once a process is running, it can be considered to have vendored any code it loads since replacing this is largely not possible without restarting the program. This is typically not considered a large issue unless some serious flaw is found (e.g., some libc mechanism has a denial-of-service laying for any program to trip over or such).

Things to consider

There are a number of things to consider when deciding whether to vendor or not. If you’re shipping an end product, then there are many fewer things to worry about as you can help ensure that you’re not conflicting with the ambient environment that will be used. Mechanisms such as rpath or other runtime loader settings (usually though the environment) are available to ensure that your copy of some dependency is what gets used. There are a few points in a project’s lifecycle that need to be considered. I’ll use zlib as an example because it has a history of being vendored and is a small library that is also “everywhere” already. However, this is not limited to just embedding a C library and there are similar situations for other languages such as Python, JavaScript, and others.

Building

The first place where conflicts may occur is at build time of your own project. Many platforms will inevitably have zlib.h and libz.so around somewhere on development machines. If the build environment is not set up properly, your project may get that copy instead of your vendored copy. Usually this can be avoided by having pristine build environments that are relied upon to make official builds, but it is still something to consider for developer machines as well. Usually, there is enough control over things like include paths and linker flags or Python’s sys.path to ensure that everything is being used as intended.

One additional thing to note here is that vendoring is “viral”. If your project vendors zlib, any other dependencies that use it internally usually need to agree on the same version. This is absolutely the case if they have APIs which use zlib types in their interface, but loading and runtime need to be considered even if not. When this happens, this middle dependency also typically needs to be vendored so that it can use the vendored copy as well.

Developer use

The next place with conflicts is, if it has a life as an SDK, once your project is installed and other projects use it. This can happen when the consuming library wants to use zlib for its own purposes as well. If you ship your own copy’s SDK, the consuming library needs to be pointed at this version. If an SDK is not provided (e.g., your use of zlib is an implementation detail and not present in your own APIs), it still needs to not conflict at runtime, but SDK conflicts can be ignored at least.

This can usually be mitigated by installing your zlib’s headers and library into a subdirectory that is accessible, but not available by default.

Loading

After everything is built, the next conflicts can occur when loading the program from disk and preparing it to actually run. How runtime loaders work varies from platform to platform in more ways than anyone wants. They also all have knobs to tweak how they work though this is generally a process-wide change. These knobs also usually only control details adjacent to your goals and therefore have a fun mix of unintended consequences and yet still not quite doing exactly what you need. Anything beyond the simple PATH-like mechanisms of environment variables and rpath entries is out of scope here since they tend to be way more platform-specific than generally useful. Languages such as Python have similar issues as there’s window to reliably influence sys.path in a way that doesn’t also have unintended consequences. Of course, the easiest answer for this specific section is to just statically link the vendored dependency to avoid having to search for a library at all.

Another problem to consider here is when your zlib needs loaded in a context where another copy of it may exist. There are ways of ensuring that your copy is only loaded when needed (e.g., by using an rpath entry relative to $ORIGIN or @executable_path), but there are some side effects of these that should be noted:

  • Loading a library later may expose them to your zlib due to inheriting search paths. This can happen on macOS with @rpath/ or with ELF if the now-deprecated DT_RPATH is used rather than DT_RUNPATH.
  • If using LD_LIBRARY_PATH or similar mechanisms, spawning new processes may also expose your zlib to other tools.

The easiest thing to do here is to change the name of the library during the build in use so that your project is looking for libmyproject_zlib.so instead of libz.so. This ensures that you (and anything that may have used it through your SDK) only look for your copy and anything unaware aren’t blindsided by your copy either.

Runtime

Alright, so your project has been built, built against, and loaded without issues. But does it actually work?

On macOS and Windows, this is usually not that much of an issue as symbols are, by default, looked up relative to the library they’re expected to live in. However, if global symbols are wanted (e.g., the GIL in CPython or any kind of global type registry), these can handled by using -flat_namespace on macOS or expecting the symbol to come from the loading library instead (at the expense of having undefined symbols at link time. On that last note, I have a 100% untested patch for the macOS linker to support saying “any symbols you get from library X will come from waves hand somewhere” and leave them undefined and in the flat namespace.

However, in ELF land (Linux, FreeBSD, etc.), there’s generally just one global symbol table. This is good for the global registry uses, but harmful when you end up with two copies of zlib loaded at once. You might have one library end up calling some other library’s vendored zlib. This might be fine, but if you’re not (and rarely can you hope to be so lucky), the best results are usually memory corruption that cause immediate failures due to ABI expectations not being met.

The way to solve this is to “mangle” the symbols as well. This usually involves #define symname myproject_symname and including it in every relevant TU when building the project. C libraries usually end up with a gigantic table of defines. C++ can usually define a namespace name to cause a library-wide symbol mangling to occur. However, this has a fun side effect in the preprocessor depending on how the library includes its own headers.

Since the preprocessor replaces tokens after #include with the same rules, care must be taken. Let’s use Boost as an example. If #define boost myboost is used to mangle symbols, #include <boost/config.hpp> will end up looking for a myboost directory. This does not occur if #include "boost/config.hpp" is used because the preprocessor doesn’t look inside of string literals for replacements. Just renaming the directory or using symlinks (if available) is usually sufficient.

This does cause issues if the use of the vendored library is optional. Since the name used to include headers is dependent on whether a vendored copy is used or not, one ends up with patterns like #include BOOST_HEADER(config.hpp) that generates the right include directive. Note that this confuses formatters which like to “see” division operators and format things like BOOST_HEADER(directory / file.hpp) which…doesn’t work. In this case, deploy the // clang-format off comments to tell it to ignore your preprocessor abuses.

In Python (and likely in other similar languages such as Ruby), since there is caching involved with package lookups, it is hard to just “move” a package to somewhere else in the package hierarchy because, without patching, it will look itself up in the root and possibly find a non-vendored copy of itself. Multiply this by any library that end up being vendored and communicate via their APIs and vendoring libraries in these languages reliably can become a chore. It’s saved by duck typing, but still has problems with global state possibly becoming confused. It’s also why things like virtualenv are so useful in such environments.

Workflows

The next post in the series will cover how you can manage vendored projects in your source tree should you decide to do so. Hopefully that one won’t take 3 months to finally finish up…