Linux Problems

This page was prepared for the OSDL meeting in December 2005. It describes many of the problems inherent to Linux we've encountered whilst distributing complex software in binary form to end users. It also offers a few suggestions for improvements.

This page is probably the most comprehensive document on Linux binary portability out there right now.


NB: This page is only about BINARY COMPATIBILITY



# Unused deps

We need to automatically strip unused DT_NEEDED entries, as many foo-config, pkgconfig files and libtool versions add unnecessary -l options to the link line. These make binaries far more brittle than need be by increasing their exposure to library instability.

Some systems have a broken libPNG that does not link against libZ and libM. It interferes with automatic dep stripping.

Recommendations:

We already fix this in autopackage using apbuild.


# Python

Python is unfortunately problematic for third party developers who wish to: * ...integrate the Python interpreter into their application. * ...develop Python modules (written in C/C++).

Pure Python apps are OK but experience indicates most desktop Python apps aren't "pure", that is, they use C modules included in the source tree.

The libpython ABI is very unstable (every minor release changes it), and it also varies between distributions, because the unicode ABI changes according to how the configure script was run - Python upstream uses UCS2 but Red Hat uses UCS4:

Apps that can be extended using Python (apps that link to the Python interpreter) must have that support compiled out to be distributable using an autopackage. This is not a good thing, as many applications have Python support.

It may be possible to hack around this by developing compatibility shims, but nobody has shown an interest in doing so currently.

"Recommendations": * Include Python in the LSB and force a particular set of exported APIs * Application developers should avoid Python if they wish their apps to be easily distributed across Linux distributions for now


# Exception Handling

Exception handling "optimizations". This probably saves a few hundred cycles: given that it can break binary distribution in some cases, it wasn't worth it:

Recommendations:


# Window Management

Not strictly binary compatibility related, but differences between WMs and theme engines can break apps in some cases. Billy Biggs of the Eclipse project writes:

He provides this list: [WWW] http://vektor.ca/osdl-meeting3.txt


# C++

C++ has serious issues, of course. Not all are of the form "can't link A to B when A imports a C++ API from B and they're built with different compilers" which hits any Qt/KDE app. Even programs written in C can crash thanks to the interaction between C++ and ELF.

Their attempts at parallel versioning these ABI changes to reduce the pain failed: the -fabi-version switch has completely defied our attempts at making it work, and the GCC developers themselves admit that it's probably not accurate anyway. The libstdc++ symbol versioning turned out to be a total waste of time, but we discovered this only after autopackage 1.0 was released with preliminary C++ support. The problem is that the symbol versioning as applied doesn't version all the symbols, as parts of the STL are inlined into applications effectively placing STL symbols into each application. These are also placed into libstdc++ itself, meaning that loading two versions of libstdc++ into the same process crashes and burns which is exactly what it was meant to avoid. The only conclusion possible is that this feature was implemented, documented and advertised without ever being tested on a real world application.

This is GCC bug 21405 ([WWW] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21405) and is is a variant of the ELF scoping problems, more on that below.

There is now a solution in the works for this, but nobody answered mails on the GCC list asking for more information on how the new scheme works.

This problem manifests itself as apparently random heap or stack corruption, usually triggering crashes either immediately on startup or some time during execution of the program. Observed examples include:

and just for fun, a non C++ app:

In summary:

C++ is currently unsupportable on Linux, that is, I wouldn't want to deal with tech support issues due to it. The failure to allow for even unrelated C++ objects compiled with different compilers to exist in the same process can trigger fatal memory corruption in any application, at any time. ISVs cannot control or predict when this will happen even if they don't use C++ as many common libraries support plugins and dynamically load code in the background, which may according to the users configuration result in the failure case occuring. When it does, the problem is nearly impossible to debug as heap arena corruption usually manifests itself as a crash sometime after it actually occurred, making backtraces and other common debugging aids useless.

Recommendations:


# ELF

ELF is, to put it mildly, "not excellent".

To be fair, it has a few redeeming features. It's portable. It offers a decent implementation of PIC code. It's extensible. When you look at the hash Microsoft and Apple made of PE and Mach-O respectively, we have to be greatful for small mercies.

Unfortunately ELF suffers many other flaws: it's incredibly complex, lacking in features, has pathological performance problems and worst of all goes to great lengths to provide the same semantics static linking provided.

Developers usually visualise dynamic linking as a tree. You have the executable at the top, which loads the libraries it needs, and they load the libraries they need recursively. You end up with a tree (really a graph but showing it in the GUI as a tree is traditional) of all of libraries a program needs to run. Each object is connected to the objects it links against on the fly by the dynamic linker.

Oh. Wait a moment. That's what you'd expect to happen. In reality, the ELF designers decided that this "tree" thing was a bit complicated for the UNIX developers of yesteryear, so instead they decided that a symbol would be linked against .... whatever happened to be loaded first. Instead of a tree of libraries with each node being connected to its children, they could quite literally be linked against any random library that was floating around at the time - just like static linking.

Now all this was a great wheeze back in the day when a program might use three or four shared libraries and "versioning" was something other people would have to deal with. Unfortunately in a world where apps regularly use 40 or 50 libraries to do their job, many of which have evolved over the years into several incompatible versions, this is a recipe for total chaos.

One very common way it manifests itself is two unrelated parts of a program link against two different versions of the same library. For instance, libpng has this problem as image loading is a pretty common thing for people to want to do. Thus:

In a sensible linkage model, when libWhatever called png_info_init() it would jump to libpng.so.3 - after all, that's what its headers say it needs. And when gdk-pixbuf uses the same function, it'd call into libpng.so.2, which may have a different function prototype, semantics, struct sizes etc.

Unfortunately ELF flies in the face of common sense here. Both gdk-pixbuf and libWhatever will be linked against libpng.so.2 and libpng.so.3 will not be used. If libWhatever uses a new API introduced in v3 of the library then that'll have a new name so it'll be OK and call into libpng.so.3, but unfortunately libpng probably uses its own functions at some point and THEY will be cross-linked against v2. So you have this total cross-wiring of libraries and library internals, which almost inevitably leads to data corruption and a crash.

This is the underlying cause of the libstdc++.so mixing issues discussed above.

Note that the dependencies of a library are theoretically an implementation detail, and as such, subject to change at any time. Therefore, you cannot defend yourself against this type of problem. Because of this IMHO ELF is not enterprise-supportable in its native form - a minor change in one part of the system can cause unrelated areas to corrupt data.

Recommendations:


# Weak Linking

Continuing in the same vein, ELF doesn't support DT_USEFUL or any equivalent. In other words, there's no toolchain support for saying "I can run without this library, but if it's there I'd like to use it". You can do that - sort of - on a symbol level using weak symbols, but weak symbols aren't very well documented and they seem primarily meant for the internals of the C++ ABI.

I wrote a program to add this ability to ELF called relaytool. It's handy, especially for programs like Gaim which often have failed installs due to a missing gtkspell. Obviously, spelling checkers are nice to have, but if it comes to the crunch Gaim should still run without it. Many people don't even try to spell things correctly on IM anyway ;)

Relaytool is implemented using dlopen and dlsym under the hood, but it lets you write code in the natural way, using the standard header files and such. You don't have to define function pointers. You can also use it to solve the problem of "I am compatible with .so.4 and .so.5, even though they are theoretically incompatible, because I use a part of the API that didn't change"

Debian gets in the way here. Attempts at getting relaytool or dlopen support for gtkspell into Gaim were rejected because the Debian packagers said it would defeat Debians automatic dependency scanning. Correct. However:

Recommendations:


# glibc

The GNU C library makes it extremely awkward to link against older versions of itself, as it uses a GNU-proprietary symbol versioning scheme. It's possible to override this using injected GAS pseudo-ops and indeed this is what gave birth to the apbuild tool but:

The GNU symbol versioning scheme is of dubious correctness anyway because it assumes developers never re-compile their software, which is obviously wrong (unless a project is dead). There's no way except for this badly documented and barely-supported assembler trick to choose which symbol versions your app gets linked against; the compile-time linker always chooses the latest one.

That leads to the following problem. Developer writes AwesomeApp which uses API foo(). The glibc developers make a change in foo() which might break backwards compatibility, eg by making the call stricter about input validation. Old AwesomeApp binaries continue to work correctly, as intended. However, in the meantime Developer upgrades his own computer and writes a new version, AwesomeApp 2005 which he then compiles and distributes to the masses. What happens when code he never touched for the new version suddenly stops working, for no clear reason, because foo() is now silently failing?

There's a right way to do API versioning, and this isn't it. A better way would be to modify the glibc headers such that developers must opt-in to new functionality and call versions. For instance,

#define _GLIBC_VERSION 2005 #include <foo.h>

would mean your app is linked against the new version of foo(), but if this macro is defined to be less than 2005 you get the old version - always. Each time a new major version of glibc was released, a single document would be posted on their web page covering the interesting changes for app developers. As time goes on, developers can upgrade their software at their own pace, taking into account the possibly breaking changes.

Recommendations:


# Headers

Headers are a frustrating problem. Some projects, notably glibc and GTK+, like to silently modify the dependencies of your application behind your back. This is usually done by using the macro preprocessor to rewrite your code, eg by redefining existing macros, or changing functions to be macros that call other functions.

This works OK in a hypothetical universe where a developer compiling software against FooLib 2.8 means they must have installed it from their distro, therefore it must be available to everyone, therefore it's OK to helpfully "upgrade" the app.

In the real world, users don't upgrade their operating system for entertainment every few months. There are a disturbingly high number of users running around the net who are still using Red Hat 9 with no security updates. In the real world we have multiple distros and just because FooLib 2.8 is available on Ubuntu doesn't mean it's available on Fedora. In the real world we expect that if we avoid APIs introduced after FooLib 2.2 was released, our app will work on a system that only has FooLib 2.2 - yet Linux defies this basic logic.

Concrete examples of this problem:

Very few developers realise this is a problem.

A variant of this problem not related to headers is when static linking bindings like GTKmm for C++ apps, even if you depend on GTK+ 2.4 features the 2.8 bindings will introduce silent dependencies on 2.8, so you need to ensure you always use the bindings matching the version of the library you depend on. I don't think that can be fixed easily and besides, it's not so hard to use the right bindings version if you static link them.

Recommendations:

In autopackage we somewhat automate the 2nd recommendation, but obviously you can't get them all ....


# Usual crap

Files in different places. Inconsistent support for installing software to your home directory and/or little to no support for software installed outside of /usr. Out of date libraries, or alternatively, bleeding edge ones with backported patches.


# Libraries to avoid

Libraries to avoid (is this list still up to date?):

last edited 2006-04-28 13:18:03 by MikeHearn