Older blog entries for raph (starting at number 362)

chema

It's hard to believe that Chema Celorio has indeed passed away. We spent quite a bit of time together at the first Linux Printing Summit. I was really impressed by his passion, and have enjoyed working with him since. He will be sorely missed.

Multidimensional interpolation

The Argyll color management system uses a very sophisticated technique based on thin-plate splines to for its profile creation. The core operation is to take the colors for a few hundred patches and fit it to a smooth multidimensional function, typically from R^4 to R^3 (CMYK tint values to a 3-dimensional colorspace such as Lab).

I'm seeing some wiggliness in the profiles I'm generating. I'm not sure why; it might be measurement error on my part. But I'm starting to suspect it may be the interpolation algorithm itself. I've read a recent survey, but it doesn't provide clear guidance on which technique provides the smoothest results.

My own personal inclination would be to create the interpolation function by solving the heat-diffusion equation with a source at patch point, so as to fit the measurements. I'm convinced that this approach would give an optimum level of smoothness, but I see problems at the boundaries, even when using the free boundary assumption.

To simplify the discussion, assume that we have to extrapolate beyond the range of colors measured. Even in the simple case of two points, one at (0, 0) and another at (1, 1), the only solution I'd consider optimally smooth is the straight line f(x)=x. Yet, a straightforward solution of the heat diffusion equation with sources at 0 and 1 is going to have a slope of zero at the asymptotes.

I'm thinking of two possibilities: first, there might be a different boundary assumption that converges to a slope of 1 at both asymptotes. It's not too hard to imagine that somebody else might have come across such a thing; you'd need to to minimize the global curvature (magnitude of second derivative) for the solution.

Another possibility is to solve the heat diffusion equation for the derivative of the desired solution. In the 1-D case, it's pretty easy to see how it would work. For your sources, you use a derivative-of-Gaussian shape. For the example above, the derivative would obviously converge to a constant 1 (the free boundary assumption for the derivative seems like the right thing here). However, it's not immediately obvious to me how to generalize this to higher dimensions (in fairness, rillian tried to explain the usual finite-difference multidimensional integration technique, but I didn't quite grasp it at the time; I'll ask again).

But it's possible I'm going over well-known territory, and a reader better versed in diffusion problems can just tell me the answer. I'll find out soon, and even if not I'll report again on what I find.

Politics

My new favorite blog is Talking Points Memo. That's where I found a link to this video clip in which Wesley Clark kicks the ass of some lame Fox News reporter. One of the most satisfying bits of video I've seen in a while.

8 Nov 2003 (updated 8 Nov 2003 at 08:26 UTC) »
How much resolution is enough?

Faithful readers of this blog will have already noted my fascination with high resolution displays. Desktops seem stuck at 100dpi dot matrix quality for the moment, but the portable space is heating up nicely. For one, it looks like the recent release of the Toshiba e805 is a harbinger of the next major rev of the Pocket PC platform. And if 200dpi isn't good enough, Sharp has started showing a prototype of a 300 dpi model.

Just how much resolution is enough? Are pixels going to keep getting smaller, or is there some natural resolution beyond which there isn't a noticeable improvement? I've just submitted the manuscript for my paper on this topic, which I'll present at the EI04 conference in San Jose next January. The quick answer is that, for desktops and laptops, this natural resolution is around 170 dpi, and about 240 for handhelds. Devices within striking range of these targets are available now, but not yet mainstream.

Attacks on the GPL

I'm not sure people realize how seriously the GPL is under attack these days. Yeah, SCO is trash-talking it, but I think this could ultimately be good for us, as they're on the verge of becoming thoroughly discredited, even to the point where shill reporters won't be able to repeat their baloney with a straight face.

I find the official policy of the BSA more disturbing. Of course one expects the BSA to lobby for the benefit of the proprietary software industry, but don't forget that IBM and Novell (the proud new owner of SuSE) are also members in good standing. (Thanks to this Groklaw article for the heads-up)

Take a look at this paragraph in particular:

2. Ensure that government funded research is available to all. Government plays an important role in financing basic software research. When public funds are used to support software research and development, such funding should be made equally available to all developers, regardless of the development model chosen to do the research. The innovations that result from this work should be licensed in ways that take into account both the desirability of building a commons of knowledge and the desirability of applying basic research advances to commercialized products. The dissemination of results broadly in this manner has contributed to a virtuous cycle of innovation in which government funding for basic research advances the set of knowledge available to the public by spurring advances in commercial products. These commercial products in turn generate growth and tax revenue necessary to fund future rounds of public research.

Unless I'm totally misreading it, it's saying that government-funded computer science research projects should not release GPL software, preferring BSD-like licensing if it's going to be released freely. I'm not sure how close this kind of trade-association policy is to becoming the law of the land (not that such things ever happen here in America), but I consider it highly alarming. At the very least, I consider it plausible that this lobbying is having a chilling effect on American government funding of GPL'ed projects. At least other countries seem to be starting to see the light.

Is ZF set theory a hack?

It's hard to believe, but apparently some offhand comments made here have inspired a serious paper comparing various axiomatizations for the foundations of mathematics, primarily in terms of the complexity needed to represent these axioms. Check out "Is ZF a hack?" on Freek Wiedijk's notes page.

I don't think there is a One True Axiomatization for mathematics. In fact, I think the lesson we should take away from Gödel's incompleteness theorems is that, in a very real formal sense, such a thing is impossible.

But, if we can't have that, what can we use? As I've written before, I think the answer is a variant of Postel's Law: be conservative in what axioms you require, but liberal in what axioms you're compatible with.

In that case, you'll use ZFC if you want to prove things about hairy infinite objects, but probably a weaker axiomatization such as Z2 or HOL for most everyday work.

Actually, what I think is needed is a system of polymorphic axioms that express basic concepts such as finite data structures and functions, that can be linked with any of the popular primitive axiom-sets. Finite data structures are fairly straightforward (although I did manage to make an embarrassing typo in the formulation for pair in ZFC - it should be (a + b) * (a + b) + b). Functions are going to be harder. Among other things, I'll need to deal with partiality in some reasonable way. But I think that good engineering solutions are possible.

Such an axiom system wouldn't justify better from a philosophical perspective, but I do think it would be a serious improvement in one practical sense: making formal theorems as portable as possible. I think ordinary mathematicians don't worry about this too much, because it usually goes without saying that ordinary, everyday mathematics can be ported (at least in principle) to any appropriate formal framework, so it isn't actually necessary to do it. Things are different, though, if you're trying to stitch together a world-wide repository of formal proofs.

Paul Krugman

Oh, and berend, if I had to pick an entry for the Krugman Gotcha Contest, it would probably be this piece saying that Microsoft's monopoly is not so dangerous after all. Does it surprise you that I don't agree with everything the man has ever written? And yes, I've also been disappointed at the level of sloppiness in some of his columns, if for no other reason than the ammunition it gives to his detractors.

But I find that he has a lot to say. Even when I don't agree with him, I find that he writes thoughtfully, and in such a way as to provoke thoughtful discussion. I'm not sure I see the point of the more vicious Krugman detractors (of whom Luskin seems to be staking out a leadership role). It seems to me that their central point is that we should disagree with everything that Krugman has ever written because he's been wrong a few times. That's one big giant ad hominem attack, and I have little patience for such nonsense.

26 Oct 2003 (updated 26 Oct 2003 at 00:03 UTC) »
Groklaw

I'm finding that the SCO case is a far more compelling drama than any of the law TV shows I've indulged in watching over the years. It doesn't hurt, of course, that I find the outcome important.

And what better way to watch this drama than by reading newspapers and articles in the technology press? I'm very impressed by their probing investigative reporting, and very reassured that their professional objectivity is giving me a clear, unbiased view of the story.

Not!

If you want to really follow the SCO case, the only game in town is Groklaw. Investigative journalism may well be dead in the corporate media, but it is alive and well, nay thriving, at Groklaw. Over there, they actually do the "shoe-leather journalism" that the traditional press claims as their domain but has largely forsaken. When a document is filed at the courthouse in Utah or Delaware, somebody (for example, Frank Sorenson in Utah) goes and scans it. A deep analysis, with plenty of links and citations, is posted by the next day. Any flaws or weak spots in the analysis are pointed out by thorough readers. Pamela Jones, the editor of Groklaw, picks through these reader comments, and includes the best of them in the next posting. I don't know if judges are supposed to read outside sources, but if so they'll certainly find Groklaw to be a treasure trove.

By contrast, coverage in respected newspapers is quite lame. John Markoff ran a piece in the NYT a couple of weeks ago, in which he makes numerous important (as well as trivial) errors. And the scoop he got, while real (a company formerly affiliated with SCO settled a GPL infringement lawsuit, claiming that they shouldn't pay damages because the infringement was unintentional), is thin gruel compared with the hearty meals served up almost daily at Groklaw.

And the so-called "trade press" is barely worthy of our derision. They're just as likely to cite a scumbag shill such as Rob Enderle as an authoritative source such as Groklaw, and they invariably get their facts wrong, sometimes painfully so. But I need hardly bad-mouth the trade press in this community. We get our news from good outlets such as LWN.

Groklaw shows off what a distributed, open-source news gathering and reporting community, with a strong editor, can do. There are many, many more such news stories, possibly too many for an amateur community to be able to handle. SCO is surely a special case. They have provoked the collective anger of the free software community; a unique accomplishment, with a unique result. Whether that result can generalize up to news-gathering on a wider scale, I have no idea, but if it can, that would be hugely exciting.

Epson 2200

I've been writing a high quality inkjet driver for Ghostscript, and I've been doing my testing on an Epson Stylus Photo 2200. I'm passionate about printing, but this is the first time I've really fallen in love with a device in a long time. As far as I'm concerned, this device has solved essentially all of the problems of making high quality digital prints except for the cost of the materials.

First, the photo quality is stunning, truly comparable to real photographic prints. It accomplishes this by using ridiculously tiny (4 picoliter) dots, real 1440x720 resolution, and 7 colors: the usual CMYK plus light cyan, magenta, and black. I find the light black ink to be an especially significant improvement - on 6-color devices, whenever the black starts phasing in, you really see those big, ugly black dots against the colored-dot background. The light black ink makes a nice gray, blending beautifully with the other colors. You simply cannot see the dither patterns with your naked eye. Under a magnifier, you do see them, but they look more like film grain than the kind of halftoning usually associated with inkjets.

The second problem they've solved is the colorfastness of the inks. The 2200 is pigment-based, while the consumer devices (such as the 870) use dyes. Dye-based prints are simply not suitable if you plan on keeping them around for any period of time. Further, if you expose them to any kind of sunlight, the fading and color shifts become noticeable within days.

The specs say that the new pigment inks are a lot more stable than the dyes, but there were a few reasons to be skeptical. The previous generation printer, the 2000, also had pigment inks, but they were far from archival, exhibiting a serious orange shift. Also, they sell an "Archival Matte" paper that, as far as I can tell, has the equivalent of sunscreen on its surface, so that the paper will absorb as much of the UV as possible before it gets to the inks. Are they engaging in heroic measures to make up for weak longevity of the inks themselves?

I'm never really satisfied with such questions until I do an experiment myself. So, I printed a test pattern on a piece of plain paper, and taped it to the outside of my south-facing window, so it would be subject to the full measure of California sun and the elements. A month later, the test pattern is still there. In particular, the cyan+magenta+yellow patches show no sign of color shift. Believe me, neither commercial offset printing nor ordinary color prints wouldn hold up nearly so well under these conditions. In fact, the paper is starting to show some signs of degrading, including a slight yellowing and a more brittle-feeling texture.

So, it's not exactly a scientific test, but I think you can make prints on the 2200 with confidence that your grandchildren will still be able to enjoy them. Highly recommended.

[Full disclosure: Epson generously donated the 2200 I used, but has not provided any additional funding. Hey, I'm a programmer and blogger, not a journalist, and I love getting toys! Let this be a notice that sending me toys can result in excellent publicity. I could ues, hmmm, let's see: a high-res handheld with WiFi, a mini-ITX box for media playback at the house, and of course a digital camera.]

GE optical mini-mouse

Another piece of gear is the $15 GE HO98094 USB optical mini-mouse I bought at Target lsat week. It tracks well, works flawlessly under RH 9, and feels good in my hand (the full size Mitsumi on my desktop, which I've always been happy with, now feels clunky). Recommended, and kudos to the people responsible for making the multiplexing between built-in and USB mouse so seamless.

By the way, about a year ago I made the transition from focus-follows-mouse to click-to-focus, based on what appeared to me to be a consensus that the latter was becoming the default. It's ok, but I do miss being able to change focus just by moving the mouse. On the other hand, I wasn't using Alt-Tab much, and obviously that's a huge convenience with click-to-focus. So I was somewhat pleasantly surprised to find that the scroll wheel has (mostly) focus-follows-mouse semantics even when the window manager is set to click-to-focus. The one annoying exception seems to be emacs, which always scrolls the buffer with the keyboard focus.

I'm still not totally happy with the way focus works overall. By far my biggest annoyance is the way that Mozilla tends to arbitrarily grab focus when I'm in the middle of typing, based on

Correspondence

I'm having a number of fascinating correspondences in email now. For one, I decided to follow up by email to Berend De Boer's recommendation of a piece criticizing a recent op-ed by Krugman, and I'm glad of it. I think there's a good chance we'll both learn something from the discourse -- something that seems to be happening more rarely as the public debate between "liberals" and "conservatives" seems to be getting more polarized.

Also, I'm corresponding with both Freek Wiedijk and Norman Megill (of Metamath fame) about proof systems, in particular my nutty idea to design a proof format much like Metamath, but with sound definitions and good support for modular, portable proofs. Freek is urging me to take a closer look at Mizar. I don't care for their closed-source approach (including the centralization of the theorem library), and there doesn't seem to be much good documentation for beginners, but I agree; it seems to have a lot going for it. With luck, I can rip off, uhm, I mean adapt their best ideas. :)

24 Oct 2003 (updated 24 Oct 2003 at 06:20 UTC) »
Letter quality handheld

The first mass-market computer with a "letter quality" display (which I define as 192 dpi or higher) was released today. Not surprisingly, it's a handheld. Too bad it ships with such a lousy OS.

AltiVec and SSE2

One of the projects I've been working on lately is SIMD-optimized versions of my inkjet error diffusion code. It'll be released soon, so all you guys can take a look, but in the meantime here are my impressions.

AltiVec and SSE2 are very, very similar. Both operate on 128 bit words, which can be segmented into four singles or 32-bit ints, eight shorts, or 16 chars (plus a couple more, depending on the chip). In both cases, the raw computational bandwidth available is absolutely stunning - well-tuned code will run at two instructions per clock cycle. That's something like 24 GFLOPS theoretical peak for a 3 GHz P4. Wow.

Not surprisingly, it's tricky to get that peak performance. A lot depends on how parallel the problem is. A lot of 2D graphics code is embarassingly parallel, which makes it pretty easy to get good SIMD performance. Not so error diffusion algorithms, which have some very tight serialization constraints. I get around this by running four planes at a time. This is a very reasonable approach for printers such as my Epson Photo 2200 which use 7 inks (so only 1/8 of the compute bandwidth is wasted), but does make things a bit trickier.

You really feel the constraint of pipeline latencies. On the P4, you can't use the result of a load until 6 cycles after the load instruction (and that's assuming it hits in L1 cache). Considering that ideally you're issuing two instructions per cycle, that means you can't use the results of a load until twelve instructions down. That's a lot, especially when you've only got 8 registers. On the flip side, the bandwidth is incredible. You can issue one load operation per cycle, for an awe-inspiring peak L1 bandwidth of 48 GB/s.

Bottom line, running 4 planes in AltiVec is about the same speed as running one plane in the old C code. SSE2 is about 1.5 the speed. So, at least for this problem, the potential for speedup is greater on the P4 architecture than G4. I haven't analyzed the code closely, but I suspect that the main culprit is the conditional branches in the C version of the code, which have all been replaced by bitwise logical operations in the vectorized version. Mispredicted conditional branches are performance death on the P4.

As I've said, I found the two SIMD approaches more alike than different. Here are some differences I noted, though:

  • AltiVec feels cleaner, richer, and better designed, even though SSE2 has more instructions. I'm sure a big part of the problem is that it took several generations for SSE2 to evolve - I consider MMX seriously underpowered, and SSE (Pentium III) lacks packed integer operations, critically important for image processing.

  • AltiVec has 32 registers, SSE2 8.

  • The tools for AltiVec are more mature. I really appreciated having the C compiler schedule the instructions for me (using "intrinsics"). Intrinsics are available for SSE2 in bleeding-edge versions of GCC, but don't ship with RH9.

  • It's easier to understand AltiVec performance; it's better documented, and tools like CHUD really help (I used amber and simg4).

  • All that said, with currently available chips, the raw bandwidth of the P4 outstrips the G4/G5. While the G4's implementation of AltiVec is excellent, the clock speed is pitifully slow by today's standards. The G5 runs with faster clock, but takes a step backward in how much gets done per cycle (in many more cases, only one instruction can be issued per clock).

I think both architectures are becoming reasonably stable, but it's still easy to find computers that don't support SIMD well, especially laptops and the cool-running Via chips. My desktop is a dual Athlon, which is sadly SSE-only. I also hear that AMD64's implementation of SSE2 is lackluster. So, the performance win still depends a lot on the particulars of the system you're using. I suspect that'll improve with time, as newer models phase in.

SIMD and graphics hardware represent two significantly different approaches for 2D graphics optimization, with different strengths and weaknesses. I feel that SIMD is ultimately the biggest win for printing applications, largely because it easily accommodates sophisticated color transformations and so on. Even so, its raw bandwidth will lead to great performance on interactive display applications. I'd be happy to stake out the SIMD-optimized rendering territory, while largely leaving optimization through offloading to the video card to Xrender/Cairo. In any case, it looks like some fun times are ahead!

On Grokking BitTorrent

Tim Bray recently tried out BitTorrent and didn't see what all the fuss is about. Maybe I can help.

At heart, BitTorrent is nothing more or less than a way of distributing largish (by today's standards, anyway) files around the Internet efficiently, reliably, and without the need for the original distributor to subsidize the bandwidth costs. It is truly legality-neutral, in that there are compelling applications in each of the domains of legal, illegal, and ethical but technically illegal. It is not spyware or anything nasty like most of the commercial P2P clients. Rather, it's distributed under a very liberal license, which makes it possible to integrate deeply with other applications that might need of large files. You haven't seen any of these applications yet because BitTorrent is still young.

Combined with three other technological developments, BitTorrent makes digital video practical for ordinary people. (The other developments are, of course, spiffy video codecs such as MPEG-4, GHz computers, and Mbit/s Internet.) From a technological point of view, people can what what they want, when they want. I think that's a game-changer.

Of course, most of the movies and TV shows that people want to watch are illegal, but that's just because Hollywood hasn't figured out a business model that takes into account the fact that such a thing is now, just this year, practical on a wide scale. Actually, I feel it's a shame that the MPAA isn't successful in their crusade to reeducate Internet movie traders; if they were, it would greatly increase interest in public domain materials, as well as artists who want to release their work under licenses such as Creative Commons. Ultimately, of course, I hope a good business model does emerge, preferably one that rewards the actual creators.

In the meantime, there's plenty of perfectly legal content to be had, including Linux distros, freely available audio such as etree, game demos, and more. I'm not sure whether Fanimatrix counts as perfectly legal or not, but in any case I'm sure they appreciate not having to cough up a dime for bandwidth every time someone clicks on the download link.

And anime, one of the earliest adopters of BitTorrent, is also very interesting. While releasing Japanese shows with English subtitles is technically a copyright violation, most in the scene abide by an ethical code which looks out for the interests of the creators - as soon as a show is licensed in the US, it is removed. Before then, I'm sure the exposure to hardcore anime fans helps shows find a market. It also helps out the US licensees, because they get to see which shows are popular. It's hard to point to anyone who's really getting hurt.

As it happens, our family watches the majority of its TV off BitTorrent now. Heather and I are both more than a bit interested in Japanese culture (she spent two years there and has written a novel based on her experiences), and the kids really love the shows. I like reading the subtitles to them; it feels like a nice middle ground between slackjawed TV watching and reading from books (of course, we do more than our share of the latter, as well). Just this evening, Max was asking when the next episode of Machine Robo Rescue was coming out, and yesterday he was absolutely thrilled to get a Hyper Fire Robo toy, imported to the anime/manga store at our local Asian market. Commerce!

There's another impact BT is poised to make. Now, all of a sudden, digital video is on the computer technology curve, not the video technology curve. HDTV has been the video technology of the future for about 20 years now. I still don't have any HDTV gear, but I've watched some movie trailers in 720p, downloaded using BitTorrent, of course. HD isn't widespread yet, but all you need is 3 GHz computers instead of 1 GHz, nice monitors, and fast DSL like they have in Korea. Does anyone seriously doubt that this will happen, and soon?

Lastly, the BT protocol itself is exceedingly robust (it uses cryptographically sound hashes to ensure the integrity of the files it transfers), and has a simple interface to the rest of the world. That's maybe not so easy to see on a PC where flakiness and needless complexity rule, but quite important when integrating BitTorrent into an appliance. I'm hoping to buy one of those mini-ITX boxes for the family this Christmas, and stick some simple Python scripts on it so the kids can select their own shows to download.

P.S. Tim, the reason why your player didn't have the right codec is essentially political, and easily fixed by, say, a full MPlayer install. Maybe you haven't seen it with your own eyes yet, but hopefully I've helped you see why your Editor-of-a-very well-known-publication friend is right. BitTorrent is a game-changer.

Voting

The Independent has a story reinforcing my recent comments about shady dealings in electronic voting. It's nice to see this story getting more widespread attention.

I have one nutty idea about how to maybe make a difference: take out an ad in the IACREOT newsletter. Unlike the NYT full-page ad idea that's floating around in blogspace, it would only cost a few hundred bucks, and it would give a good opportunity to roundly criticize election officials for rolling over for corrupt corporate merchants of tamper-friendly voting machines, and encourage them to clearly take sides. I imagine that very few election officials actively want to preside over one of the darkest eras in democracy facing this country.

Google failure

A couple of followups on my piece on Google's vulnerability to DNS abuse. Charles Cox points out a WebmasterWorld post from a Google person this March that clearly points out that they're aware of the problem and are trying to fix it. I'm not sure exactly why they're still so vulnerable, then.

Also, Andrew Clausen sent me his draft paper analyzing the attack-resistance of Google and its PageRank algorithm. I haven't had a chance to study it in detail yet, but from a skim it looks very good. If you're at all interested in trust issues, it's a must-read

Proofs

Thanks, chalst, for your link to the Girard book. I will indeed study it, which should help clear up my ignorance of the finer points of consequence relations and consistency proofs.

I haven't been writing much about my thoughts on proof systems, but I have been continuing to play with them quite a bit. I've been chipping away at a prototype of the ideas I wrote about here. It's all based very solidly on Metamath, but adds a sound definitional framework (so that axioms and definitions are clearly separated), and the outline of a modular system so that individual proofs can import and export "interfaces", and then you can just wire a big mesh (acyclic network) of these proofs my making the interfaces match up. I think that my current design can do this, but I'm never certain about things until I implement them.

And perhaps there my goal is a bit overly ambitious for my current level of ability. I'm trying to cook up not just one set of axioms but three (based on Z2, HOL, and ZFC, in order of increasing strength), so that you can write individual proofs using the least powerful axiomatization necessary, but then use those proofs in any system at least as strong. Not only that, but I want the theorems proved to be polymorphic. For example, some of the theorems most interesting to me are over the structures induced by pairs and "options" (typically defined as either None or Some x). These include sequences, dictionaries, trees, graphs, and so on, in other words the usual space of "data structures".

All the axiom systems listed above can easily support these data structures, but just proving your theorems in Z2 is not fully satisfying. If you prove a general theorem about trees, there's no reason it shouldn't work for, say, trees of transfinite ordinals, even though there's no way to represent transfinite ordinals in Z2. Most people designing proof systems just pick an axiomatization. That's part of the reason why it's traditionally difficult to port proofs from one system to another - there's no guarantee you can when the underlying axiomatizations differ. If I'm successful, then proof authors and proof users will be subject to a form of Postel's Law: be conservative in what axioms you need, but liberal in what axioms you can work with.

Pairs and options in three axiomatizations

I've never seen these all together like this, and I consider them pretty. They're particularly simple variants of functions that pair up primitive values, and set up option types. The specification of correctness is:

pair(a, b) = pair(a', b') <-> a = a' ^ b = b'
some(a) = some(a') <-> a = a'
some(a) != none().

Z2:
pair(a, b) = (a + b) * (a + b) + b * b
some(a) = a + 1
none() = 0

HOL:
pair(a, b) = \lambda x \lambda y. x = a ^ y = b
some(a) = \lambda x. x = a
none() = \lambda x. false

ZFC:
pair(a, b) = {{a}, {a, b}}
some(a) = {a}
none() = {}
How to defeat Google's PageRank

I've been noticing a lot more evil spam results from Google searches. The easiest way to see them is to try a somewhat dodgy search query, such as "snes pokemon rom". Obviously, I find this interesting, because PageRank is supposed to be an attack-resistant trust metric, just like here at Advogato. If someone has succeeded in attacking it, it would be interesting to find out why.

As far as I was able to figure out, these spam sites use a handful techniques to achieve high Google ranking. Some are related to PageRank, and then there's the generation of random, Markov-chain text to fake out the relevance scores. For example, the top hit, www.jrcrush.com/pc_pokemon_game/ pokemon/pokemon_snes_rom.asp, shows up with this context:

... true to life heaviness. Another toughest pokemon snes rom passionately downloads the evolution for a battle. When you see an avariciousness ...

But this isn't the result you get when you actually visit the page; it seems to be custom generated just for search engines. I've seen other pages that seem to be dynamically generated based on the query in the referer URL. Giving different results than given to search engines has many problems, not the least of which is that it's the best way to get around Google's otherwise solid policy of not returning porn pages for non-adult searches. I'm no prude, but I don't think the average person searching for "pokemon snes roms" ought to be served porn ads.

But this is just relevance. To get to the top of a search, a site has to have good relevance and a high PageRank score. How did such an obvious spam site achieve such a good score? The answer, not surprisingly, is abuse of DNS. In the case of jrcrush.com, it used to be the web site for the Columbus Crush, a junior hockey league based in Ohio. Then, the domain lapsed and got parked at Go Daddy. Within a few months, a scammer took it over. In the meantime, plenty of pages still link to it, even though the link has rotted. There's also evidence that it was listed directly in the Yahoo directory until recently.

Even though Google is showing itself to be vulnerable, the theory of attack resistance is holding up well. According to my analysis, in an attack-resistant system, there should be a near-linear relationship between the "cost" of the attack and the amount of damage done. Quantifying the cost is tricky, of course, because no abstract model will precisely capture real-world cost. The way I do it is to divide up all nodes (in the case of PageRank, a node is roughly equal to a webpage) into bad and otherwise. The latter category is further divided into "good" and "confused". A confused node is one that has a link to a bad node, for whatever reason. My quantification of attack cost is simply the number of confused nodes.

And now we see that by subverting DNS, an attacker can, in one fell swoop, exploit a potentially large number of "confused" nodes. In any situation involving security, the attacker will always go after the most vulnerable link. DNS has many great strengths (without it, URLs, and thus the Web, would have been infinitely more painful), but it sits in a position where all Internet users are forced to trust it, and it has not earned that trust.

There are any number of ways to fix the attack outlined above (and I'm sure Google is working on it), but, long term, the best way is to fix DNS itself. It's clearly broken, but it's not obvious how to best fix it. To me, it's obvious that people need to be building research protypes for better DNS-like service. Obviously, I think that trust needs to be baked-in, but others may have even better ideas.

Another letter quality display

As I've pointed out before, the real movement in high-resolution displays these days is in very small devices. Fujitsu is developing a 250 dpi 4" display, and recently showed a prototype at a Japanese trade show. Still a while before it'll be at your local Fry's, but you can get 216 dpi in Japan now.

Arnold

We, the voters of California, have apparently just lost our friggin' collective minds. I just hope Arnold doesn't do too much damage.

The Great Unraveling

I just read Paul Krugman's book, "The Great Unraveling". It's a great, great book. I'm especially impressed at the way Krugman was able to gain so much insight about the Bush administration simply by looking critically at their economic policy. Highly recommended.

Chris Lydon

Also highly recommended: Chris Lydon's series of audio interviews posted to the Net. These are a great demonstration of how the Net can deliver content completely beyond the reach of the corporate media. I'll be listening, and hope to see more things like it.

QXGA

170 dpi laptop displays haven't reached commodity status, but they're getting closer. Eurocom has a "mobile workstation" with one, priced $600 above the next lowest res, which is actually fairly affordably priced.

My guess is that we'll see more of these. It will be very hard to pass one up when it comes time to replace my laptop.

The Final Solution

You might be an anti-spam kook if you have discoverd the final, ultimate solution to the spam problem (FUSSP). I scored shockingly high on the test. Of course, I realize that using a trust metric to defeat spam, while probably effective, won't be easy.

Electronic voting

Something is seriously rotten in the land of electronic voting. Consider:

  • Rebecca Mercuri was thrown out of a meeting of the IACREOT (International Association of Clerks, Records, Election Officials, and Treasurers) a couple of months ago for voicing criticism of the electronic voting machines being sold.
  • A group of researchers pulished a searing criticism of Diebold's touchscreen voting machines. These machines are a total joke in terms of security - they're based on Microsoft Access, so everything, even the audit logs, can easily be tampered with. Further, their use of crypto is spotty and contains amateurish mistakes such as reusing IV's in CBC mode. Diebold's response is lame, simply ignoring many of the points scored in the original paper.
  • The State of Maryland, on the verge of buying lots of Diebold machines, commissioned an "independent" study of the machines from SAIC (another cog in the military-industrial machine), which identified "several high-risk vulnerabilities" and concludes that the system is not compliant with Maryland's standards. The somewhat unbelievable response from the president of Diebold is: "The thorough system assessment conducted by SAIC verifies that the Diebold voting station provides an unprecedented level of election security."
  • The chief executive of Diebold is also working for the Bush campaign, and, in a recent fund raising letter, wrote that he is "committed to helping Ohio deliver its electoral votes to the president next year."
  • Even though Diebold is emerging as the owner of the most smoking gun, the other election machine vendors aren't coming across as being much better.
  • Diebold successfully takes down blackboxvoting.org by issuing a DMCA notice to their ISP, based solely on links posted at the site.
  • Leaked memos clearly indicate that Diebold routinely violates election guidelines, among other things by using versions of their software other than those certified.
  • In spite of all this, the state of Maryland is going forward with the Diebold contract.

This is a big story, I think. Even the mainstream press is starting to cover it. If there are any people reading this in Maryland who are good smart cards, just put in a million votes for the Green candidates. That ought to wake up the powers that be, and maybe the winner can do some good in the meantime.

It's also clear that we can do some good by raising a stink. The IEEE was all set to approve incredibly weak standards for electronic voting machines, but in response to the EFF's action alert, they actually sent it back to the drawing board.

Diary backlog

It's been a while since my last entry, and there are a lot of things I want to write about. Fortunately, most of them will keep.

Much of the past two weeks was taken up with our Ghostscript staff meeting, our exhibit at the Seybold show, and the various preparation and followup.

One of the nicest things about the time was to have fellow hackers stay with us, first Ralph Giles, then Tor Andersson. The kids got to spend time with both, and I think it was very enriching for them, and I hope enjoyable for our guests.

In fact, this was the first time meeting Tor in person. I'm really enjoying working with him - it's clear we have many goals in common. We had a few days of very intense discussion, covering the good and bad in TeX, the Fitz design (of course), and many meditations on the way software should be built.

Silliness

I've enjoyed drinking from a Sun Java mug since I first became its proud owner, back when Sun was making the original push. A few days ago, Sun announced the "Java Desktop", with all kinds of neat indemnification for its buyers. A few days ago, the handle broke off of my Java mug. I have never dropped it and didn't abuse it in any way, just mostly used it for nuking water for tea. Is there a causal relation between these two events?

I come up with all kinds of funny things in my dreams, but rarely such a good pun as this one: it's clear that discount merchandising needs its own XML language for describing the various flavors of clearance sales, everyday discounts, and so on. Thus, I propose "Markdown Markup language". It's a shame I didn't hook up with Tim Bray at Seybold; we were both there, and I'm sure he would have gotten a chuckle out of it.

353 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!