Universal Feed Parser 3.0 beta 18 is out. It uses libxml2 if available. Libxml2 uses libiconv to convert between a large number of character encodings, including Chinese, Japanese, and Korean encodings. The upshot of this is that the Universal Feed Parser just got a lot more universal.

Programming with libxml2 is like the thrilling embrace of an exotic stranger. It seems to have the potential to fulfill your wildest dreams, but there’s a nagging voice somewhere in your head warning you that you’re about to get screwed in the worst way.

Libxml2 is fast. I mean insanely fast. Nothing else even comes close. It is insanely fast and insanely compliant with all the specifications that it claims to support, and it is getting faster while gaining more features. So you just know that somewhere, someone is selling their soul to somebody, and you just hope it isn’t you.

Here is a true story:

I have a large and growing collection of unit tests for the Universal Feed Parser, which I run under a variety of platforms and conditions: Windows, OS X, Debian GNU/Linux, Python 2.1, Python 2.2, Python 2.3, Python with PyXML, Python with libxml2, Python with its built-in XML parser. The testing framework is already too complex. One of the things it now has to do is instantiate a local HTTP server in order to test interactions between the HTTP level and the XML level (such as character encoding). The HTTP server introspects into the test feed itself to pull out a list of custom HTTP headers to include in the response. It is hairier than I would like.

Earlier this evening, I integrated libxml2 support. When I ran the complete set of unit tests for beta 18 with libxml2 under Python 2.3 on Windows, one test failed. It was one of the ones that gets routed through the testing framework’s HTTP server, which makes it more difficult to test and dramatically increases the number of things that could have caused the problem other than the code being tested.

Take out libxml2, all tests pass. Put back libxml2, same test fails. Strange.

Turn on my _debug flag and re-run just that one test. It prints a bunch of stuff to stderr and then… passes. Turn off libxml2, and back on, twice, in disbelief. Passes. Turn off debugging and re-run just that one test; it still passes. The test passes consistently when run by itself.

OK. The test in question is somewhere around #1100, out of about 2000 tests. Run the first half of the tests. They pass, just as before. Run the second half of the tests. They all pass, including the one that used to fail. Run all the tests together; one fails. Turn off libxml2; all tests pass. Exotic.

Time to dig deeper. Where exactly is it failing? It doesn’t simply fail; it actually hangs for a while and then fails. All tests normally pass or fail in sub-second time; on my 4-year-old laptop, I can run 2000 tests in 23 seconds. Except this one, which hangs for 10 seconds and then fails. Hmm. 10 seconds. The feed parser specifies a 10-second network timeout. Increase the timeout value. Lo and behold, now it hangs for 20 seconds and then fails. My custom HTTP server is failing to respond. But only on one test. And only when I use libxml2, adds the nagging voice.

Keep in mind that my HTTP server doesn’t call libxml2. It just reads files from disk, does one or two regular expression matches, and prints out raw data. libxml2 comes in later, much later.

Time to turn on that _debug flag again. To avoid losing myself in an avalanche of crap, I add code to only set _debug = 1 if the filename matches the test in question. Re-run all tests with libxml2 off, as a control. All tests pass, and I get debugging information on one test. Excellent. Turn libxml2 on, re-run, and… all tests pass.

Double-check the code; nothing ever happens with _debug == 1 except print statements. The test fails under normal circumstances, but passes when you try to debug it. Lovely. Exotic. Strange.

I decide, guiltily, to remove the test in question and release beta 18 anyway. rm, cvs remove, cvs commit, hope no one notices. Hey, it’s beta. Remove the debugging code, re-run the complete test suite, and all tests pass… except the test that is now in the same position (#1100) as the one I just removed.

OK. This must be an obscure bug lurking in the testing framework. (But if so, why does libxml2 make a difference? asks the nagging voice.) The HTTP server is set up to serve a certain number of requests and self-terminate; maybe I have an off-by-1 error and it’s terminating before serving the last test. Or maybe it’s not initializing properly and is failing the first HTTP-related test. Poke through the list of HTTP-related tests; neither of these hypotheses is true. The test that used to pass when it was #1101, now fails when it is #1100. And it fails in exactly the same way that the previous test failed, before I deleted it (and to be clear, I felt very, very guilty about doing that). Somehow, somewhere, in the middle of several hundred requests, libxml2 is causing my HTTP server to fail exactly once.

Turn the _debug = 1 code back on and re-run. Comment out the debugging code and re-run. Suddenly everything passes. Double-check; yes, libxml2 is on. Yes, libxml2 is being used. No, I’m not hallucinating. Re-run again, everything passes. Reboot. Everything passes. What did I do differently? I hunt and scour. Finally I see it: I left a global _debug declaration uncommented. Comment it out; test fails. Uncomment; test passes.

Note that simply saying global _debug within a function should have no effect at all, if you don’t ever set _debug within that function. It’s like a threat that you never carry out. It’s just code flotsam. Yet there it is: comment it out, and the test fails. Declare it, and the test passes. Turn on libxml2 and it fails. Turn off libxml2 and it passes.

The moral of the story: libxml2 causes unrelated subsystems to fail, unless you threaten to debug it.

Postscript: I decided to release it anyway, with a caveat, but then I discovered an unrelated (and quite understandable) bug under Python 2.1. After fixing that and re-running the tests, I am now unable to reproduce the bug I just described. I have fiddled with the global _debug and _debug = 1 and turned libxml2 on and off several times in an attempt to bring back the maddeningly buggy behavior, but to no avail.

I swear that this happened, that I wasn’t just hallucinating. But I admit that I have no way of proving that; you’ll just have to take my word on it. Meanwhile, all my tests pass, so I’m going to stop coding for tonight and releasing this beast before something else happens. Use at your own risk, and beware of strangers.


Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



© 2001–present Mark Pilgrim