Linux Problems - Autopackage Wiki

This page was prepared for the OSDL meeting in December 2005. It describes many of the problems inherent to Linux we've encountered whilst distributing complex software in binary form to end users. It also offers a few suggestions for improvements.

This page is probably the most comprehensive document on Linux binary portability out there right now.

NB: This page is only about BINARY COMPATIBILITY

Unused dependencies
Python
Exception handling
Window Management
C++
ELF
Weak Linking
glibc
Headers
Usual Crap
Libraries to avoid

# Unused deps

We need to automatically strip unused DT_NEEDED entries, as many foo-config, pkgconfig files and libtool versions add unnecessary -l options to the link line. These make binaries far more brittle than need be by increasing their exposure to library instability.

Some systems have a broken libPNG that does not link against libZ and libM. It interferes with automatic dep stripping.

Recommendations:

Invert logic in GNU ld, making --as-needed the default. Software that relies upon ELF being weird (see below) can use the --no-as-needed switch to get back the old behaviour. This is binary compatible as it's a build time change controlled by developers. If well publicised, it should cause minimal developer pain, and greatly improve the robustness of distributions internally.

We already fix this in autopackage using apbuild.

# Python

Python is unfortunately problematic for third party developers who wish to: * ...integrate the Python interpreter into their application. * ...develop Python modules (written in C/C++).

Pure Python apps are OK but experience indicates most desktop Python apps aren't "pure", that is, they use C modules included in the source tree.

The libpython ABI is very unstable (every minor release changes it), and it also varies between distributions, because the unicode ABI changes according to how the configure script was run - Python upstream uses UCS2 but Red Hat uses UCS4:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=173990

Apps that can be extended using Python (apps that link to the Python interpreter) must have that support compiled out to be distributable using an autopackage. This is not a good thing, as many applications have Python support.

It may be possible to hack around this by developing compatibility shims, but nobody has shown an interest in doing so currently.

"Recommendations": * Include Python in the LSB and force a particular set of exported APIs * Application developers should avoid Python if they wish their apps to be easily distributed across Linux distributions for now

# Exception Handling

Exception handling "optimizations". This probably saves a few hundred cycles: given that it can break binary distribution in some cases, it wasn't worth it:

http://autopackage.org/docs/devguide/ch07s05.html

Recommendations:

Ignore it - most users have upgraded by now. Be warned: this sort of pain will happen again in future unless the underlying culture of the Linux community changes.

# Window Management

Not strictly binary compatibility related, but differences between WMs and theme engines can break apps in some cases. Billy Biggs of the Eclipse project writes:

the "splash screen" type hint isn't useful because of the differences in its implementation across WMs, neither is the "tool" type hint
it's hard to predict WM trim -- Tom Fitzsimmons did a nice system for metacity but kwin doesn't yet support it
metacity has too many bugs with its focus-stealing prevention code and nobody is spending the time to fix these (mike: this code is also not really backwards compatible)
metacity doesn't really support multiple levels of modal dialogs
kwin centers dialogs on the parent and you can't stop it
GTK+ window focus messages are unreliable, firefox also has problems here, some of this is fixed post GTK+ 2.6.8

He provides this list: http://vektor.ca/osdl-meeting3.txt

# C++

C++ has serious issues, of course. Not all are of the form "can't link A to B when A imports a C++ API from B and they're built with different compilers" which hits any Qt/KDE app. Even programs written in C can crash thanks to the interaction between C++ and ELF.

Their attempts at parallel versioning these ABI changes to reduce the pain failed: the -fabi-version switch has completely defied our attempts at making it work, and the GCC developers themselves admit that it's probably not accurate anyway. The libstdc++ symbol versioning turned out to be a total waste of time, but we discovered this only after autopackage 1.0 was released with preliminary C++ support. The problem is that the symbol versioning as applied doesn't version all the symbols, as parts of the STL are inlined into applications effectively placing STL symbols into each application. These are also placed into libstdc++ itself, meaning that loading two versions of libstdc++ into the same process crashes and burns which is exactly what it was meant to avoid. The only conclusion possible is that this feature was implemented, documented and advertised without ever being tested on a real world application.

This is GCC bug 21405 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21405) and is is a variant of the ELF scoping problems, more on that below.

There is now a solution in the works for this, but nobody answered mails on the GCC list asking for more information on how the new scheme works.

This problem manifests itself as apparently random heap or stack corruption, usually triggering crashes either immediately on startup or some time during execution of the program. Observed examples include:

Game written in C++ (libstdc++.so.5) uses SDL (written in C) loads libarts written in C++ (libstdc++.so.6) as an audio plugin -> crash
Inkscape written in C++ (libstdc++.so.6) uses libgtkspell (written in C) uses enchant (written in C) uses aspell written in C++ (libstdc++.so.5) -> crash
Mozilla.org official builds of Firefox written in C++ (libstdc++.so.5) uses GTK+ (written in C) loads the Simple Chinese Input Method written in C++ (libstdc++.so.6) -> crash

and just for fun, a non C++ app:

FooApp written in C loads gtkspell written in C loads aspell written in C++ and also loads a C++ library that somehow wasn't upgraded in lockstep with the rest of the users system or was shipped privately -> crash

In summary:

C++ is currently unsupportable on Linux, that is, I wouldn't want to deal with tech support issues due to it. The failure to allow for even unrelated C++ objects compiled with different compilers to exist in the same process can trigger fatal memory corruption in any application, at any time. ISVs cannot control or predict when this will happen even if they don't use C++ as many common libraries support plugins and dynamically load code in the background, which may according to the users configuration result in the failure case occuring. When it does, the problem is nearly impossible to debug as heap arena corruption usually manifests itself as a crash sometime after it actually occurred, making backtraces and other common debugging aids useless.

Recommendations:

Use autopackage 1.2 (when it's out). The apbuild tool works together with the autopackage APIs to ensure the user always has a binary of the correct ABI (by dual-compiling every binary with two versions of the compiler at build time).
Fix ELF

# ELF

ELF is, to put it mildly, "not excellent".

To be fair, it has a few redeeming features. It's portable. It offers a decent implementation of PIC code. It's extensible. When you look at the hash Microsoft and Apple made of PE and Mach-O respectively, we have to be greatful for small mercies.

Unfortunately ELF suffers many other flaws: it's incredibly complex, lacking in features, has pathological performance problems and worst of all goes to great lengths to provide the same semantics static linking provided.

Developers usually visualise dynamic linking as a tree. You have the executable at the top, which loads the libraries it needs, and they load the libraries they need recursively. You end up with a tree (really a graph but showing it in the GUI as a tree is traditional) of all of libraries a program needs to run. Each object is connected to the objects it links against on the fly by the dynamic linker.

Oh. Wait a moment. That's what you'd expect to happen. In reality, the ELF designers decided that this "tree" thing was a bit complicated for the UNIX developers of yesteryear, so instead they decided that a symbol would be linked against .... whatever happened to be loaded first. Instead of a tree of libraries with each node being connected to its children, they could quite literally be linked against any random library that was floating around at the time - just like static linking.

Now all this was a great wheeze back in the day when a program might use three or four shared libraries and "versioning" was something other people would have to deal with. Unfortunately in a world where apps regularly use 40 or 50 libraries to do their job, many of which have evolved over the years into several incompatible versions, this is a recipe for total chaos.

One very common way it manifests itself is two unrelated parts of a program link against two different versions of the same library. For instance, libpng has this problem as image loading is a pretty common thing for people to want to do. Thus:

AwesomeApp
- GTK+
  - gdk-pixbuf
    - libpng.so.2
- libWhatever
  - libpng.so.3

In a sensible linkage model, when libWhatever called png_info_init() it would jump to libpng.so.3 - after all, that's what its headers say it needs. And when gdk-pixbuf uses the same function, it'd call into libpng.so.2, which may have a different function prototype, semantics, struct sizes etc.

Unfortunately ELF flies in the face of common sense here. Both gdk-pixbuf and libWhatever will be linked against libpng.so.2 and libpng.so.3 will not be used. If libWhatever uses a new API introduced in v3 of the library then that'll have a new name so it'll be OK and call into libpng.so.3, but unfortunately libpng probably uses its own functions at some point and THEY will be cross-linked against v2. So you have this total cross-wiring of libraries and library internals, which almost inevitably leads to data corruption and a crash.

This is the underlying cause of the libstdc++.so mixing issues discussed above.

Note that the dependencies of a library are theoretically an implementation detail, and as such, subject to change at any time. Therefore, you cannot defend yourself against this type of problem. Because of this IMHO ELF is not enterprise-supportable in its native form - a minor change in one part of the system can cause unrelated areas to corrupt data.

Recommendations:

Require distributions to ship glibc with Michael Meeks -Bdirect patches, and then require -Bdirect to be used by default for all libraries. This doesn't really fix the problem, but does move it around a bit.
Wide-spread use of symbol hiding and -Bsymbolic would help.
Alternatively, patch the dynamic linker to do the Right Thing by default. This modifies the semantics of ELF and breaks with the standard, so it'd need to be some kind of extension.
Applications that require robustness can ship their own dynamic linker patched as above.

# Weak Linking

Continuing in the same vein, ELF doesn't support DT_USEFUL or any equivalent. In other words, there's no toolchain support for saying "I can run without this library, but if it's there I'd like to use it". You can do that - sort of - on a symbol level using weak symbols, but weak symbols aren't very well documented and they seem primarily meant for the internals of the C++ ABI.

I wrote a program to add this ability to ELF called relaytool. It's handy, especially for programs like Gaim which often have failed installs due to a missing gtkspell. Obviously, spelling checkers are nice to have, but if it comes to the crunch Gaim should still run without it. Many people don't even try to spell things correctly on IM anyway

Relaytool is implemented using dlopen and dlsym under the hood, but it lets you write code in the natural way, using the standard header files and such. You don't have to define function pointers. You can also use it to solve the problem of "I am compatible with .so.4 and .so.5, even though they are theoretically incompatible, because I use a part of the API that didn't change"

Debian gets in the way here. Attempts at getting relaytool or dlopen support for gtkspell into Gaim were rejected because the Debian packagers said it would defeat Debians automatic dependency scanning. Correct. However:

Who cares? You can add dependencies to the specfile manually in about 20 seconds.
You can't determine dependencies via binary analysis in the general case anyway, that's equivalent to solving the halting problem. So you may as well ask packagers to fill out the dependencies manually.
Why should the convenience of the Gaim Debian packagers, who number in the low 2s, outweigh the convenience of people who download autopackaged binaries from the website, who number in the high 200s/day?

Recommendations:

Add a GNU extension, DT_GNU_USEFUL or DT_GNU_MULTI which gives similar abilities as relaytool
Fix the header versioning problem so people can actually control their dependencies

# glibc

The GNU C library makes it extremely awkward to link against older versions of itself, as it uses a GNU-proprietary symbol versioning scheme. It's possible to override this using injected GAS pseudo-ops and indeed this is what gave birth to the apbuild tool but:

Nobody except us seems to know this technique. The LSB has a solution to do the same thing but it works by constructing fake "stub" libraries from relational databases. It's quite hard to separate from the other LSB policies like "you must statically link anything not in the LSB", making their tool worthless for open source projects who can't ship the entirety of GNOME/Python with their binaries. GAS pseudo-op injection is a lightweight and modular way to solve the same problem.
It doesn't have any native toolchain support. You need GCC wrapper programs to pull it off (which is what apbuild is)
The differences between versions of glibc symbols are not documented.

The GNU symbol versioning scheme is of dubious correctness anyway because it assumes developers never re-compile their software, which is obviously wrong (unless a project is dead). There's no way except for this badly documented and barely-supported assembler trick to choose which symbol versions your app gets linked against; the compile-time linker always chooses the latest one.

That leads to the following problem. Developer writes AwesomeApp which uses API foo(). The glibc developers make a change in foo() which might break backwards compatibility, eg by making the call stricter about input validation. Old AwesomeApp binaries continue to work correctly, as intended. However, in the meantime Developer upgrades his own computer and writes a new version, AwesomeApp 2005 which he then compiles and distributes to the masses. What happens when code he never touched for the new version suddenly stops working, for no clear reason, because foo() is now silently failing?

There's a right way to do API versioning, and this isn't it. A better way would be to modify the glibc headers such that developers must opt-in to new functionality and call versions. For instance,

#define _GLIBC_VERSION 2005 #include <foo.h>

would mean your app is linked against the new version of foo(), but if this macro is defined to be less than 2005 you get the old version - always. Each time a new major version of glibc was released, a single document would be posted on their web page covering the interesting changes for app developers. As time goes on, developers can upgrade their software at their own pace, taking into account the possibly breaking changes.

Recommendations:

Improve glibc documentation. Most of it is undocumented. Some APIs that are exported as public symbols in the headers have no documentation, for instance dlmopen, and the internals are poorly documented too. It turns out there IS some documentation of the source tree layout so a previous version of this page was wrong about that, but it's oriented towards people who wish to port glibc to other architectures rather than those who wish to work on glibc itself. The ELF dynamic linker has good code comments but is (apparently) undocumented at a high level. The differences between symbol versions aren't documented except in the changelogs.
Implement source-level symbol versioning as described above in addition to the binary level versioning. This can be done entirely by changing the headers, no patches to the linker are needed.

# Headers

Headers are a frustrating problem. Some projects, notably glibc and GTK+, like to silently modify the dependencies of your application behind your back. This is usually done by using the macro preprocessor to rewrite your code, eg by redefining existing macros, or changing functions to be macros that call other functions.

This works OK in a hypothetical universe where a developer compiling software against FooLib 2.8 means they must have installed it from their distro, therefore it must be available to everyone, therefore it's OK to helpfully "upgrade" the app.

In the real world, users don't upgrade their operating system for entertainment every few months. There are a disturbingly high number of users running around the net who are still using Red Hat 9 with no security updates. In the real world we have multiple distros and just because FooLib 2.8 is available on Ubuntu doesn't mean it's available on Fedora. In the real world we expect that if we avoid APIs introduced after FooLib 2.2 was released, our app will work on a system that only has FooLib 2.2 - yet Linux defies this basic logic.

Concrete examples of this problem:

glibc redefines the ctype APIs like isalpha(), isnum() etc to use symbols named like __ctype_b_loc - this is part of the glibc "thread local locales" interface. But it's an extremely esoteric and non-portable feature that outside of web servers isn't that useful. Desktop apps only work in one locale at a time, and this "upgrading" causes users pain (eg, http://www.linuxquestions.org/questions/history/342881)
GTK+ likes to redefine macros, often for thread safety reasons

Very few developers realise this is a problem.

A variant of this problem not related to headers is when static linking bindings like GTKmm for C++ apps, even if you depend on GTK+ 2.4 features the 2.8 bindings will introduce silent dependencies on 2.8, so you need to ensure you always use the bindings matching the version of the library you depend on. I don't think that can be fixed easily and besides, it's not so hard to use the right bindings version if you static link them.

Recommendations:

Implement header versioning as described for glibc above, and only introduce silent dependencies on new symbols when a magic macro has been defined saying "OK, I want to depend on Foo 2.4 or higher now"
Application developers should ensure they compile against the right headers for their dependencies.

In autopackage we somewhat automate the 2nd recommendation, but obviously you can't get them all ....

# Usual crap

Files in different places. Inconsistent support for installing software to your home directory and/or little to no support for software installed outside of /usr. Out of date libraries, or alternatively, bleeding edge ones with backported patches.

# Libraries to avoid

Libraries to avoid (is this list still up to date?):

freetype - unstable API. cf http://lists.debian.org/debian-devel-announce/2005/11/msg00016.html
wxWidgets - C++/unstable ABI. Though we have heard that WxWidgets 2.6 has a stable ABI throughout the entire 2.6 series.
libPython - unstable C ABI
libtiff - libtiff has broken ABI compatibility as of version 3.6.1. Link "tiff".
lcms - LCMS has a versioning system that's hard to check. Link "lcms".