Older blog entries for Bram (starting at number 43)

Joel on API Apology

Joel Spolsky has an article in which he states

All non-trivial abstractions, to some degree, are leaky.

This is overly dogmatic - for example, bignum classes are exactly the same regardless of the native integer multiplication. Ignoring that, this statement is essentially true, but rather inane and missing the point. Without abstractions, all our code would be completely interdependent and unmaintainable, and abstractions do a remarkable job of cleaning that up. It is a testament to the power of abstraction and how much we take it for granted that such a statement can be made at all, as if we always expected to be able to write large pieces of software in a maintainable manner.

There are really two separate statements here, (what does 'leaky' mean, anyway?) The first one is -

All APIs contain artifacts of what they're built on top of.

As I said before, this is overly dogmatic, but mostly true. The example given, however, is extremely curious. TCP requires quite a tangent to explain ('look how smart I am, I know about TCP') and is one of the most robust, useful, and widely deployed abstractions in existence. I've spent a huge amount of time using TCP in very novel ways and haven't set a socket option in my life. Furthermore, Joel cites TCP's failure mode as an artifact of IP, when it is in fact an artifact of wires and electricity, and an extremely well done and clean one at that. Whether the connection drops due to network outage or the counterparty machine going down, you still get a single well-defined failure, with clear semantics as to what might have arrived on the other end.

TCP actually has plenty of real artifacts, such as slow start and high latency during bulk transfers, but these would clarify the distinction between the statement about artifacts and the one Joel is really trying to prove, which is -

All APIs are Broken

As I explained, the TCP example doesn't prove this at all - if a failure mode is defined in the API, then that doesn't mean that the API breaks when it gets invoked. But this statement does apply to most of the other examples Joel goes on to cite - NFS is a mess, C++ strings aren't real strings, COM internals aren't properly hidden, ASP.NET doesn't have proper string mangling code to make GET parameters properly, etc.

You may notice that all the Microsoft tools are left to the end, right before the claim that the law of leaky abstractions is dragging us down. This essay isn't about API design, it's about software apology, an attempt to claim that all software is inherently bad hence it's okay that Microsoft's is bad. Like any good microsoftie, Joel starts with the assumption that the (very flawed) tools he uses are completely acceptable, and works backwards from there.

12 Nov 2002 (updated 12 Nov 2002 at 11:18 UTC) »
Protocol Stuff

The single sign-on protocol I gave earlier can be made secure when re-using challenge/response pairs by using HMAC instead of raw sha1 (HMAC is built using sha1). That greatly simplifies implementation on third party web sites and gets rid of a nasty denial of service attack.


I tried playing a few games of Bohemia and was a bit disappointed. There are a few specific things I'm going for with the rules -

  • The game should be even on a medium size board, ideally in the ten to twenty on a side range. The rules as I presented them lead to an about even game on a 5 by 5 board, way too small.

  • The beginning of the game should be very positional, gradually leading to a very tactical endgame. The rules I gave are a lot more like Renju, with lots of forcing lines.

  • There should be a complicated interplay between pieces of either color, with many double-edged moves. With all the forcing moves, the square winds up emphasizing one color and the bohemian mostly plays the other, and the bohemian's color's pieces don't generally make squares.

To try and fix those problems, I've come up with the following rule change - the square's new goal is to form a square whose corners are black on the lower left and upper right, and white on the upper left and lower right. This straightforwardly enforces a complicated interplay between both colored pieces, and since forming a clump of one color is very counterproductive, the square no longer has much benefit in playing lots of forcing moves. Also, the reduction in number of winning arrangements and greater complexity of having to play both colors probably makes the evenly matched board size larger.

Furthur experimentation is once again necessary.

If anyone can figure out a proof that with a sufficiently large board under this new rule the square can force a win, I'd like to hear it. I don't see one immediately.

tripp: Yes, the challenge/response pairs are for the purposes of (drastically) reducing the amount of communication between blog.example.com and signon.example.net

I avoided using public keys for several reasons. The biggest one is that if signon.example.net relied on knowing blog.example.com's public key, then if that public key were lost or compromised some very serious human intervention would be required at signon.example.net, while with the technique I gave blog.example.com would simply issue a new query to get a new list of challenge/response pairs. Getting a phone call from someone demanding that you manually change a key in a database, and having to somehow guess if they're who they claim they are or someone trying to commit fraud, is a problem you want to deal with never.

Also, the technique I gave is simpler. There's no key setup and management queries necessary at signon.example.net, and the only libraries blog.example.com needs are string mangling and sha1, which are already available on almost all web scripting systems. Also, any public key operations at signon.example.net can result in a significant amount of CPU load at not very high levels of scaling, doing the operations for https is bad enough.

I can't think of any disadvantages to the method I gave over using public key trickery - even the extra bandwidth it uses is paltry.


I figured out a cleaner proof of my earlier theorem about monochromatic squares, based on the Hales-Jewett theorem. I looked up the section on theorems proven using Hales-Jewett in a book I have and, sure enough, I re-invented what's known as Gallai's theorem.

The lower bound on lattice edge size is two to the power of Hales-Jewett with two colors and edge length four. This is still a huge value. At some point I'll get around to throwing blackbox at the problem to try to get a reasonable guess of the actual value.

single sign-on protocol

Raph mentioned that we discussed single sign-on protocols, and I came up with some good ideas. I'll summarize now.

There is a single monolithic site which users maintain their login information and identity on. This is very convenient from an end user perspective, since they only have to remember one password and maintain their information in one place. It's also very appealing for web sites, since they no longer have to manage any login information or do tech support for lost passwords and the like.

Say the site which does the signon is signon.com and the site using those logins is blog.com. When a user is doing something which may require login to blog.com, they'll get a hyperlink saying 'log in', which will link to signon.com. When the user clicks on that, they'll either be prompted to set up an account on signon.com, or if they already have one and have an active session, they'll get redirected back to blog.com, where they're now logged in.

That's the end user experience, now for the technical details.

blog.com makes an http or https query in advance to get a bunch of challenge/response pairs, which are generated dynamically by signon.com. A huge number of these can be stockpiled, so the query only has to be done rarely, the only real restriction on their use is that they can only be used once each. The login hyperlink goes to (or redirects to) a url on signon.com, and includes a challenge code and callback url as GET parameters. For example, it might be http://signon.com/signon?challenge=a63f34b34d&callback=http://blog.com/login.cgi signon.com then gets the users login from their cookie, verifies their identity (either by an active session cookie or making them type in their password again) and redirects back to the callback, giving it GET parameters of the user's login identity and sha1(login + response). For example, it might be http://blog.com/login.cgi?user=bram&verify=ssonuoocrdua blog.com can then easily check that verify is the correct value, and give the user a session cookie if so.

Raph suggested that for developer convenience it should be possible to send signon.com a nonce url which gets echoed back. I think this is more cleanly implemented by making signon.com notice if there are already GET parameters in the callback url, and preserving them if so. There is a special place in hell for whoever decided that the first GET parameter should have a different delimiter than the rest of them.

signon.com can avoid having to store all challenge/response pairs by keeping a secret symmetric key and generating challenges at random and setting responses to be the encryption of the challenge.

I'm very happy with this protocol. It's easy to implement on all sides, and has a smooth implementation path - web sites are immediately motivated to use it because it removes the burden of account management from them. Also, blog.com has complete control over whether all requests are done by http or https, allowing them to determine their level of security pragmatically and migrate over time. signon.com doesn't even have to configure anything new to support new sites - it's simply left running. Best of all, sites using signon.com can't compromise each other no matter how insecure they are.

The main problem with this whole scheme, of course, is that it's centralized, has a single point of failure, and generally isn't very cypherpunkly. That's explicitly the model Raph asked for, out of overwhelming engineering expediency, and I agree with that sentiment - this solution is far from perfect, but it's easy, and noone's demonstrated that decentralized single signon is workable.


Bohemia's board size problem can be fixed with an interesting version of the swap rule - one player decides the board size, and the other decides whether they want to be the square or the bohemian. Handicaps can be given with a spread on the board size. It's possible that an even chances board size is overly large or small, experimentation is necessary.

One dimensional Bohemia is possible as well, by having the square go for arithmetic progressions. This can be done with any number of colors and length of arithmetic progression. The small van der Waerden numbers indicate that length three and two or three colors might be a bit small, but more is probably very interesting. This has the distinction of being one of the few games on a one dimensional board which isn't completely lame.


BitTorrent (which had a new release today) has the necessary files to make debs, but isn't in debian. If any debian maintainer would like to volunteer, I'd very much appreciate it. Maintenance should be easy - the dependencies are straightforward and it's written in pure Python.

3 Nov 2002 (updated 4 Nov 2002 at 07:17 UTC) »
garym: Your description of the current state of the software industry is, unfortunately, quite accurate, and it won't get much better any time soon. The problem is that only programmers can really tell whether other programmers are any good, and we aren't going to be put in charge of the industry any time soon.

I've achieved a modest degree of success as an open source developer, but my advice for those considering it as a career is, only do it if you have to out of some immovable internal need. We are only just beginning to see the phenomenon of starving artist programmers, and it will only get more pronounced in the future, possibly even becoming standard for the industry.

I've seriously considered attempting a career as a screen writer. Although I have no experience with it, I have a good work ethic, natural talent, and possibly most important, a willingness to write high concept work with happy endings. Unfortunately those attributes don't even vaguely gaurantee a decent living in that industry, so I've shelved any such plans until I either have enough money to not be worried about failure or have come to hate the computer industry so much that I just want out.

My current long term career plan is to get good enough at financial stuff that I can start a hedge fund or set strategies for one. Unfortunately there's a lot of luck and who you know involved in that industry, but at least it consistently pays well and, unlike screenwriting, the industry has a desperate need for real talent.

For now, I'm doing okay working on networking software, which unlike any of the above I have oodles of experience in. My other career option which involves something I'm already good at is juggling, but I find that a less appealing career path than any of the above.

I haven't even dreamt of getting payed for all the random weirdness I like posting to my advogato diary. That's only possible with a tenured university position, and the more time passes the farther away I seem to be from academia.


Here are the rules to an abstract board game I came up with called Bohemia -

The two players are the square and the bohemian, the square moves first.

Players alternate playing either white or black pieces on an n by n grid. Either player may place either color piece. The bohemian is allowed to pass.

If four pieces of all the same color get placed such that they form the corners of a square with horizontal and vertical sides, the square player wins. If the board gets filled up without that happening, the bohemian wins.

I'm not sure what size board gives even chances to both players. That should be determined through experimentation.

A conjecture strongly related to this game is that for a sufficiently large board size the square must always win regardless of how the pieces are placed. This can be generalized to more than two colors. Also to lattices of side k, for which a square is a special example with k=2. This is a two-dimensional version of van der Waerden's theorem, which can of course be further generalized to higher numbers of dimensions.

I can't even prove this for two colors, two dimensions, and lattices of size two. It seems likely that such a straightforward problem in ramsey theory has already been studied though.

Update: I figured out the following very nice proof of the simplest case of my conjecture.

First, we will demonstrate that for any number of colors, a sufficiently large lattice will contain three pointts of the form (a, b), (a + c, b) and (a, b + c) which are all the same color. I'll call these formations Ls.

By van der Waerden's theorem, a sufficiently large 2-colored lattice will contain an arithmetic progression of length three somewhere along its bottom row. Let us say that those points are at (a, 0), (a + c, 0) and (a + 2*c, 0). The point (a, c) must be the opposite color to avoid forming an L with (a, 0) and (a + c, 0). Likewise the point (a + c, c) must be the opposite color from (a, c) because of (a + c, 0) and (a + 2*c, 0). Now we can show by contradiction that there most be some monochromatic L - the point (a, 2*c) must either form an L with (a, c) and (a + c, c) or with (a, 0) and (a + 2*c, 0). Therefore for a sufficiently large 2-colored lattice there exist points (a, b), (a + c, b) and (a, b + c) which are all the same color.

We will now demonstrate that this is true for any number of colors by induction. Say we have shown for k colors that a lattice of size f suffices. By van der Waerden's theorem, a sufficiently large lattice colored with k+1 colors will have a monochromatic arithmetic progression of length 2*f along its bottom row, say that this is (a + x * c, 0) for 0 <= x < 2*f. Now consider the lattice formed by (a + p * c, c * (q + 1)) for 0 <= p < f and 0 <= q < f. Either some node in this lattice will be the same color as our arithmetic progression on the bottom row, in which case it forms an L with that, or it contains only k colors and hence contains an L by induction.

Now we will use the above lemma to prove our theorem. Say the size of a 2-colored lattice so that it must contain an L is g. Now make a sufficiently large lattice that if it's colored with 2 ** (g ** 2) colors, there must be an L, and substitute a square lattice with side length g for each of these, with colors corresponding to 2-color assignments within the sub-lattices. Call the offset between the L of squares which all contain the same pattern d. The lattice of offset d in both directions from the lower left must contain the exact opposite of every corresponding color or by construction form a square, so we will consider the case where it's the opposite. Since it's of size g, there must be an L in the lower left one of these, whose coordinates we will call (a, b), (a + c, b) and (a, b + c). It is now straightforward to show that by construction the points (a, b), (a + d + c, b), (a, b + d + c) and (a + d + c, b + d + c) are all the same color, hence there must be a monochromatic square.

This proof is so straightforward that I'd be extremely surprised if it isn't already known, but it's pretty and non-trivial enough that I'm happy to have come up with it.

Unfortunately the lower bound it gives is a little bit too big to be practical.

chalst: The online sources say differently. Perhaps you're thinking of subgraph isomorphism?

ciphergoth: what's ISTR?

Linear Algebra as a Reasoning Technique

Consider a problem statement with some number of boolean variables, and restraints along the lines of 'out of this assignment, at least x must be true'. For example, in a problem with five variables, you might have a constraint stating 'of a, b, not c, not d, and e, at least three must be true'.

We can construct a linear relaxation of this problem by replacing boolean variables with numbers in the range [0, 1] and replacing their constrains with linear constraints, for example the above clause would change to a + b + (1 - c) + (1 - d) + e >= 3.

This allows us to do a very interesting trick. Pick a random assignment to all the variables, and calculate the minimum number of them which can be true (their sum) in the linear relaxation. If the values which generate that result round off to a satisfying assignment, we're done. If not, we can take the minimum value, say it's 2.5, and deduce that since each of the values is 0 or 1, the minimum must be the next higher integer, in this case 3. We can then express this as a new linear constraint and add it to the list of constraints. If we do this repeatedly, we may nail down the constraints enough to force the approximation we find to yield an actual solution or show that none exists.

I should emphasize that this is a reasoning method and not an approximation method. If this technique keeps adding constraints until there is no solution to the linear relaxation, that constitutes a formal proof that the original boolean problem has no solution. My motivation in coming up with this technique is that all exhaustive techniques for NP-complete problems I've heard of are backtracking algorithms, and I'm trying to find an exception to that rule.

If the minimum values given in the constraints are less than half the number of variables in them, then the linear relaxation can always be solved by setting all values to 1/2, and the round-up trick won't be able to get past that point. This is why this technique doesn't work for basic SAT problems. Yeah that's obvious, but it took me a few years to realize, and I haven't had access to a decent linear programming package since then, so I don't know for sure that this technique is effective on the more appropriate problems I've suggested, although it seems promising.

Hard instances of these problems can probably be generated simply by adding new random constraints until it gets to the point where there's a 50% chance of there being a solution, at which point both finding any solution and determining that there isn't one by backtracking will be hard.

23 Oct 2002 (updated 23 Oct 2002 at 08:50 UTC) »
baruch: To verify signatures you check that the pre-hashes hash to what they're supposed to, there's not much to it.

I didn't realize until after my earlier post how much hotornot resembles animal. There truly is nothing new under the sun.

Graph Isomorphism Hard. Yeah, Right.

Complexity theorists talk about graph isomorphism as if it's a hard problem, since there's no known polynomial time algorithm. I'm extremely skeptical, since although I can't prove a particular algorithm is polynomial, there are ones which seem extremely hard to fool.

Feeding my skepticism is that the existing packages don't give any instances of isomorphism which they claim to be hard. They give examples of isomorphic graphs which are tricky to find mappings for, but no non-isomorphic graphs which are supposedly hard to identify as such. Since polynomial algorithms for determining isomorphism can generally be turned into polynomial ones for finding a mapping, (by being used as a method of pruning an exhaustive search), this seems tantamount to an admission that there are no hard instances.

The obvious algorithms for checking graph isomorphism all amount to finding some value for each node, then sorting those values, and comparing the sorted lists. Methods for coming up with these values inclued the following (all of these assume there are no separate islands in the graph, the case where there are islands is easy to deal with) -

  1. Calculate the minimum number of hops to each other node, and sort that list.

  2. Calculate the fraction of the time you'd spend at the given node if you did a neverending random walk around the graph along edges.

  3. Assume that you'll drop all other nodes with some probability. Calculate the value for that probability which makes the expected number of remaining reachable nodes equal to half the total. This can be computed via sampling, which requires some randomness and uncertainty, but is still polynomial.

All of these, of course, can be used jointly. Very symmetrical graphs can sometimes look homogenous to the above, but that can be fixed by for each node calculating the values for all other nodes if that node were removed and sorting the results. This can of course be done for multiple iterations, although each iteration multiplies the runtime by the number of nodes. Different iteration levels can, of course, be used jointly.

Given the above extremely powerful isomorphism checking techniques, I find it dubious that there are any hard instances of graph isomorphism. It looks rather like finding an odd perfect number - we have no proof that it doesn't exist, but the known constraints on it are quite ludicrous.

Note that for all of the above I'm talking about the problem of determining whether a mapping exists, or finding a single one, not the problem of finding all mappings. This is the problem which the interesting zero knowledge proof applies to, I'm distinctly uninterested in automorphisms.

Digital Signatures

It's possible to make digital signatures using a secure hash as the only primitive. This is very reassuring, since the sheer number of moving parts in modular group based signatures is huge.

The basic trick is to publish n pairs of hashes, where n is the number of bits in the secure hash. To sign something, you publish the pre-hashes of the one of each pair, picking which one based on the bits of the secure hash to be signed.

This technique can only sign one value, but can be improved considerably.

A straightforward improvement is to publish n + log(n) pre-hashes, and sign using a subset of exactly half of those, which there are about 2^n of. This doesn't change the number of things which can be hashed any, but does make the hashes smaller.

More things can be signed by making the public key be the root of a hash tree. This is fairly computationally intensive on the part of the signer - the entire tree has to be generated up front, but it does allow multiple things to be signed, and doesn't increase the size of the signature much. A new restriction is that with each signature there is a number, whose value isn't significant other than that the same number can't be used twice the system becomes insecure.

Even more things can be signed by making the first value signed be another public key. This can be repeated for any number of iterations, and results in a total number of signatures which can be done equal to the number which can be signed at each iteration to the power of the number of iterations. The algorithm for computing the later public keys must be deterministic based on the private key, to keep different public keys from getting signed by a single intermediate key.

Put together, those tricks result in performance which is quite practical for many applications. A completely excessive safety margin results in 20 kilobyte signatures. I have a functioning implementation, which unfortunately has not to my knowledge been vetted by anyone, but that should be straightforward to do.

Unfortunately the best known public key encryption algorithms still are along the lines of merkle puzzles, and require bandwidth equal to the square root of the computational ability they're trying to defend against. Bloom filters can reduce the value by a constant factor of about three, but it's still very far from practical. It would be nice to have a complete crypto toolset which didn't use any of that modular exponentiation nobody really trusts.

21 Oct 2002 (updated 21 Oct 2002 at 07:15 UTC) »
Game Interfaces

I've been playing online games some over the last few days. A frustrating amount of time is spent trying to find games rather than playing them. A hotornot-style interface, in which the server guesses which people you'd like to play against and asks a yes/no as to whether you'd like to play them would be a vast improvement. There are several algorithms which could be used to determine selections and decide when a match happens. It's an interesting problem.

Also lacking are personalized thumbnail images like are used on livejournal and some chat sites. The more serious online communities tend to look down on images, due to the 'asl' banter flavor of them, which is unfortunate, since context greatly improves social interaction.

None of the online game sites I've used have an option for the players to replay a game and comment on it after it's done. This is one of the more social aspects of playing over the board gaming; it would be a nice addition.

Another missing feature is an integrated opening book which automatically displays your comments on the current position as play unfolds. Opening books are often viewed as cheating, but if everyone had them it would be fair, and rote memorization isn't a fundamental aspect of most games.

I'm interested in getting into hedge funds, so I'm reading up on finance. Currently I'm reading 'Options, Futures, and Other Derivatives' by John Hull. I haven't finished it yet, but I'm going to post a review anyway.

This is the classic book for beginning traders to learn the basics from. Its structure is what you would expect - very simple, clear prose, and comprehensive coverage of all the basics.

My problem with reading this book is that it keeps making me fall asleep. After a few pages I have to take a nap. After enough naps I can't sleep any more and go into a trance-like dissociated state with the words passing through my brain with no comprehension. All the formula derivation in this book is spelled out in painstaking detail, with whole sections devoted to formulas anyone with a decent mathematical background could re-derive in a few minutes. Also, key bits of math are dropped out, presumably because they're too difficult. According to the reviews on amazon those derivations are actually misrepresented and the formulas given are approximate, even assuming that the underlying model is perfect. I think all readers would be better served with a treatment which simply gave the formulas, without burying them at the end of tedious derivations which are either too hard or too boring for any given reader.

I'll post a review of another, much more promising book I'm reading in the future.

34 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!