June 07, 2003

Local blogs?

I see that pb's updating and FAQing his ORblogs.com list of Oregon weblogs. I can see where he'd be interested in local blogs with local info, being new(ish) to Oregon, and after living all my life here, including five years of grade school and five years of college in Corvallis, where he lives, plus most of the rest of my life within forty miles of there, I ought to be able to tell him a thing or two, about coming home from the coast by going up the Siletz clear up through Valsetz when you have far too much time, about dozens of secret waterfalls and lakes and pocket valleys with old abandoned orchards, about speed traps and glass floats and mosquitos and trails laid out by sadists on mules, about avoiding Jeff Park and not missing the Eagle Cap, about Indian salmon and Indian summer and Indian paintbrush on the Iron Mountain trail.

But, I'm not sure I want to. I'm still in love with the idea of reading the thoughts of people in New Zealand and Australia, and being read by people in Norway. I'm not at all sure I want to go back to being just another web-footed, moss-backed redneck who happens to know how to type. There are plenty of lighthouses around here, but I'm not sure they are my lighthouses.

Doing browsers the right way

According to nearly every Mac user I read, the latest beta of OmniWeb uses the open source WebCore and JavaScriptCore from Apple that Safari uses. That's doing it the right way.

What we call a browser is really two (or three, depending on how you count) pieces: you are seeing these words in a particular font, in an invisible box of a particular size in a particular place, thanks to a layout engine/renderer. That (and the JavaScript interpreter, which may or may not be part of it, depending on how you feel about such things) is what needs to be standards compliant, and needs to be developed with as much feedback from as many people as possible. Just finding all the little weirdnesses in the intersections of XHTML, CSS, DOM, and friends, is hard enough, much less getting them all fixed up. Around and over the top of the rendered page, there are menus, toolbars, context menus, and all sorts of other widgets and bells and whistles. That stuff is just normal program UI, like any other program.

What makes perfect sense to me is to develop the renderer in the open, as open source, where anyone can report bugs, suggest improvements, or hack them in if they've got the chops, so that you end up with a single target renderer for your platform, with as many browsers built around it as there are people who think they can do better context menus and bookmarks. That's what Apple and The Omni Group are doing in OS X: whether you use Safari or OmniWeb, web pages that were checked against anything running WebCore should work right.

What make no sense at all to me is tying your browser, your file manager, and the HTML renderer control you want people to embed in applications all together as one thing. The other day I was doing my usual three things at once poorly, quoting some dialog from Windows Explorer, my desktop file manager, in an email, while reading RSS feeds in SharpReader and browsing the web in Mozilla Firebird. I was more than a little puzzled to find SharpReader frozen when I finally got back to it. If I had some dialog open in Internet Explorer, I wouldn't be too surprised to find other IE windows frozen, insisting that I deal with the dialog first, because I've gotten used to that (a situation that's often described to me by novice users as "the mouse stopped working" since the window that opened the dialog is generally buried under a dozen other windows and popus - thanks for that helpful bit of behavior, Microsoft), but for a while I was baffled by having a third party internet application frozen by something I'd been doing in my file manager half an hour before. I can see how that sort of behavior might help Microsoft's lawyers, or maybe their stockholders, but it sure doesn't seem to do a damn thing for their users.

May 29, 2003

Still not here

I'm still on vacation (did I mention I was going on vacation?), but meanwhile, a couple of pictures from last weekend, courtesy of my brother-in-law's new camera (and his internet connection), crabbing:
Driving the boat while crabbing, because it beats having to pull crab traps

and with the Japanese glass fishing float that foolishly decided to wander inside the bay where I could find it:
Posing with a Japanese glass fishing float, though nobody seems to be able to look past the beard

Back, um, next week I guess. Do I really have to go back to work?

May 22, 2003

Best email virus delivery yet

I've seen quite a few very nice bits of social engineering in email viruses lately, including things like a faked From: support@microsoft.com, but the very best I've ever seen just arrived:


From: MAILER-DAEMON
Subject: Undelivered Mail Returned to Sender

There were errors processing you mail. Please, read detailed information in the attachment

with an attachment named error.hta. Very nicely done. Of course, that's MAILER-DAEMON@yahoo.com, the grammatical errors in the body grated on my one remaining nerve, and you and I are probably in a tiny minority actually knowing what an .hta really is, and why we're not about to look at one. Still, very cunning bit of work, and if they've got a good enough payload, I predict a nice run for whatever it is (after all the Klez-related hits I got during it's first outbreak, even if I had bothered to search for a name for it, I wouldn't be posting it).

May 19, 2003

Missing the blog-clog point

Eric misses the point of Google indexing weblogs, and sometimes ranking them rather high for some searches:

most people would consider google to be a better service if i, and a relatively small number of other people, didn't get in the way of the information they really want.

He's referring to the way he tops the list for galaxie 500 window crank by first talking about trying to fix his, and then talking about finding his entry while searching for information. What he's missing, though, is two things: that's an okay first shot search phrase, but as soon as he saw that it didn't work, he should have revised his search, rather than expecting Google to guess right every time, and then searching for something like galaxie 500 "window crank" (note that many of the original results are talking about the window on one car, and the crank(shaft) on another) would point out that he's number one because Google doesn't have a damn thing useful for that search.

However, someone else (if they are at least half-bright), searching for that phrase because they have a broken window crank on their Galaxie, could go to Eric's weblog, track down the entry, see the magic words "i'm fortunate to have the original shop manual", track down his email address, and ask him if he'd be willing to trade copies of a few pages for some extra parts. That's quite a bit more useful than the other results, offering to sell one car with a missing window, a Galaxie, and yet another car with a particular crankshaft.

I feel bad about the fact that at this moment I'm Google's top result for http error 500, not because I'm "clogging up" the results and keeping people from finding useful information, but because I've been shown that there isn't any useful information, and I'm still too lazy to fix the situation. There are a very few half-decent pages listing a bit of information about every HTTP response code, a million or so pages just repeating the explanations from the spec ("The server encountered an unexpected condition which prevented it from fulfilling the request." - gee, thanks), and none that really understand how the web works today, and provide useful information in a useful way.

In the old days of directories, you would dig through Yahoo to find a link to a single page listing HTTP error codes, and find the one you were looking for, and maybe get lucky and get a helpful explanation as well. That's not how it goes anymore: people search for something far too specific, and then give up. What the web needs for this particular class of query are separate pages, with the error code and name in the URL and in an <h1>, with an explanation of what it means, why you are seeing it, and what to do next ("something's screwed up in your script, or your .htaccess, or your server: rename .htaccess and see if it works, then look in your error.log or ask your host to look, or try to remember what you just did to your server").

Google isn't saying "sorry, this sucks for what you searched on, but I'm so confused by all this incestuous linking that I just have to give it to you anyway," what it's saying is "I don't have anything useful for that query, but people seem to think this guy knows his way around, and he's at least used the words in your search, so maybe he linked to something that will prove useful." So far, I haven't, but that's my failure, not Google's. If you actually look at the sorts of things where Google ranks you "too high", you're likely to find that either nobody has anything useful to say, or there's no way for Google to tell what's authoritative yet. If three hundred pages include the term Googlewash, but Google hasn't done a monthly reindex yet, all it can do is show you those, and hope that some highly ranked one linked to the right place. Once it reindexes, it can see that a couple hundred used the term in a link to one page, and that's probably an authoritative source about it, though to cut down on Googlebombs, a page that also includes the term in the page title will probably come first, even with only fifty inbound "Googlewash" links. Your weblog entry isn't ranked high for some searches because Google's confused, it's ranked high because Google needs your help, and expects you to link to things it hasn't had a chance to sort out and fully index, or to link to the most useful thing about the keywords that your entry is highlighting. So get to it; it's damn sure not going to be "content/6/30195.html" providing searchers with useful information and links.

May 18, 2003

The hazards of really large numbers

The new version of Blogger uses 18-digit numbers for post ID numbers, rather than the old 8-digit numbers. Until just now, I thought that was just a handy way to tell new.blogger blogs from old, and an amusing statement on just how big Blogger plans to grow, rather like small companies that use account numbers large enough to cover twice the population of the world (and then inevitably require you to write your account number on your check). Now, I'm not quite so amused, thanks to a question about my remote dotcomments hack.

Though it really should be updated to a more modern and accessible link style, dotcomments and remotedotcomments both link to a comment popup by calling a javascript function with javascript:viewComments(<$BlogItemNumber$>), and then the function sticks the item number into a URL and opens it. Unfortunately, javascript being dynamically typed, and there being no quotes around <$BlogItemNumber$> to force it to be a string, that's an actual number up until it gets concatenated to a string in the URL, when it gets cast to a string. Oopsie, 18 digits is exactly one more than the precision of a javascript number, so your item ID of 123456789012345678 turns into 1.2345678901234568 times 10 to the whatever, with the 78 rounded up to 80, so the number that dotcomments uses for comments on an entry isn't actually the number of the entry. The fix is simple enough, add quotes around the number so it's a string: javascript:viewComments('<$BlogItemNumber$>'), but now I'm trying to quickly think of every bit of javascript I've ever written and released that might treat post numbers as anything but strings, because according to the unlinkable, unarchived, newest post on the Blogger home page, the changeover for everyone is a-comin' soon.

Make that, a problem for javascript and for PHP.

May 16, 2003

RSS validation from a textarea

It looks like I didn't ever get around to blogging this at the time, but using it myself last night reminded me how handy it can be: if you don't want to make a dozen changes in your published RSS just to see what the validator thinks about them, you can just paste a copy in my RSS Sandbox and twiddle there. It uses the validator's SOAP interface, which is mostly why I wrote it in the first place. It's ugly as sin, rough as a cob, and if you want to know "would it be valid if I did this to that?", it'll tell you.

May 13, 2003

MT rebuild type mod

A lovely (but very much not for the faint of heart) MT hack from Sean Willson: mt rebuild type mod changes the options for rebuilding index templates from just "do" or "don't" to All, Never, Entry, Comment, TrackBack, Entry and Comment, Entry and TrackBack, Comment and Trackback.

The biggest gain is setting your Master Archive Index template to only rebuild when you add an entry, because it tends to be a bear to build, and since it probably doesn't have any comment or TrackBack counts it doesn't need to be rebuilt for every ping (with the risk of it taking too long, and having the ping time out even though you really got it, so that you get repinged, and repinged). I saw someone's timing test results the other day (of course, I can't find them again), saying that most index templates took on the order of a second to build, and the archive index took what I remember as 15 seconds, which is just asking for TrackBack pings to look like they didn't go through from the pinger's perspective.

May 12, 2003

Gawker to leave weekend field open

Gawker announces no weekend publishing [via Hylton].

Works for me: you might be surprised just how much of my unearned attention in the world of weblogs is due to the fact that what little publishing I do tends to be on weekends, or very late Pacific time (works nicely for Australia and Europe), or on holidays. Weblogs.com may not list very many sites on US holidays, and lots of people actually hold in their posts until Monday "when more people will be reading," but the people who are obsessed enough to read blogs at "off" times can be quite interesting, and influential. So one more site saving it for Monday (when I'm least likely to read a post, as I try to tear through a couple hundred at a time in my aggregator) is one less site competing with me for a few minutes of your attention. Go Gawker! Take Friday off, too!

A call from Google! Oh, the legal division

Someone should be expecting a call from the lawyers in charge of Google's logo, methinks.

Interesting service, though, and with posts like The AOLing of CPAN? I probably should be reading LaughingMeme. And one more subscription won't break me: it still takes less than an hour to suck them down through my dialup straw.

Not long at all

While I was signing up for a Technorati API key, blo.gs IMed me to say that Mark had updated. I figured I might as well see what he had to say, before I started thinking about what I could do with Technorati, and in what language. Well, PyTechnorati probably answers the language question, at least.

Update: just to muddy the waters: Technorati.py from Phil Pearson, technorati.py from Aaron Swartz, which uses his xmltramp, which is a bit like his TRAMP, which makes dealing with RDF in Python dead simple, MTTechnorati, a Movable Type plugin from Adam Kalsey, phpTechnorati from RevJim, and Radio and Frontier glue from Dave Winer.

Even paranoids get chased by agents

Suppose you were a little bit paranoid about Google, due to the way what's probably the anti-Googlebombing filter works, dropping a weblog post out of the results if a few people mention where it is in the results. Suppose that all the Matrix talk lately had you in a Matrix frame of mind. Then suppose that a comment left by Eadz got you wondering whether or not clicking on the "Vote for|against this page" smilies on the Internet Explorer Google Toolbar would actually cause Google to instantly index a page.

Suppose that "allinurl:what_google_could_do_with_weblogs" showed that the entry hadn't yet been index. Suppose you started IE and voted for that page. Suppose that a half-hour later, you said grep googlebot access.log. Suppose that the only things Googlebot had accessed were robots.txt, and an old, unlinked entry where you made fun of Google's inability to count to three.

Would you be expecting Agent Smith to knock on your door? I am.

What Google could do with weblogs

While I don't know what Google will do about weblogs in search results (which puts me ahead of Andrew "I see the Googlebots walking among us" Orlowski, since I at least know what I don't know), I do know one thing that they could do.

If you remember back to our last flurry of talking about Google and weblogs, when the purchase of Blogger was announced, lots of people were talking about Google using the changes.xml file from weblogs.com as a way to find out what weblogs they ought to index immediately. Anyone whose job depends on having Google index them well would have fallen off their chair laughing at the thought of Google letting websites choose to be instantly indexed: weblogs.com would have been completely overwhelmed as every single commercial site on the entire web began to ping it, and ping it, and ping it. Being indexed quickly by Google is valuable. Until quite recently, practically speaking you needed to allow two months for new content to get into Google's index. But, despite the fact that a self-selected list of things to crawl wouldn't work for Google, Blogger recently released its own changes.xml file.

Even though Andrew "I Can't Remember How Search Engines Work, Because I Hate Blogs So Much" Orlowski doesn't think so, weblogs are useful to search engines. Google News is useful for fresh content about the boring things that big media likes to cover, but people use Google for more than just the latest Britney Spears gossip. If someone creates a "Flog Britney Spears" Flash game, and someone who deleted their forty times forwarded email wants to find it, they won't be pleased to know that Google will add it to their index during the next monthly crawl. They probably don't want to read three hundred weblog posts about how fun it is to flog Britney, they want to flog Britney now! Although the actual weblog posts themselves may not be as useful to Joe Searcher as their current ranking in Google results makes it seem, the links within them, especially in aggregate, are. If three hundred blogs link to the same page, two hundred of them including "Flash game" in the link text, then that page might be a good candidate for a jump in the "Flash game" results, at least for a while (though that sort of thing reopens the whole Googlebombing issue). At the very least, that page needs to get in the index, pronto, if it isn't already there.

Google can't just remove every weblog post from the main index without reducing the quality of their search results: there are things that only appear in weblogs, or where weblogs are the best results. However, there's no need to give a weblog's front page prime results, when it's just a temporary view of the posts that permanently appear elsewhere. By treating any URL that appears in a changes.xml file as the front page of a weblog, Google could more intelligently handle weblogs, returning any other page from the same site with the same keywords instead of returning the front page, so that they would no longer deliver frustrated searchers looking for that post that was on your front page two weeks ago.

With a little cunning to determine what's a part of your weblog, and what's an unrelated part of your site, they could also damp down the extremely high rank they give to things like Movable Type comment and TrackBack popups, which are so well marked up semantically that The Register's Chief Foamer-At-The-Mouth is actually right (for all the wrong reasons) when he fingers TrackBack (and comments, though he doesn't mention them because Joi doesn't use comment popups, so one doesn't come up first when he ego surfs for andrew orlowski googlewash) as being a source of weblog noise in search results. It really has nothing to do with TrackBack the spec, and everything to do with good markup. Google loves small pages with good HTML, so if your TrackBack listing popup has your entry title as the HTML <title>, and especially if it then repeats it in a <h2> in the body, then it's going to rank high for keywords in the title, and if you bury the title for your actual entry, either by having date-based archives so there isn't a single page with the entry title as the HTML title, or, as Joi does, by not using semantic markup to let Google know that the entry title is an important part of the entry, then your TrackBack popup may well outrank your entry itself. Knowing that something is a weblog, based on it having pinged, would let Google damp down that sort of thing with a weblog-specific filter, without having to completely ghettoize us on a tab that nobody would ever search (I pass through Google twenty or thirty times a day, but I would have been hard pressed to name all five tabs, because I use Google to search everything for me, not to have me tell it how to search and where to look).

I don't know in detail how to use weblog links to keep Google fresh while avoiding Googlebombing, or how to keep beautifully marked up but meaningless pages from outranking more confused but richer pages, but then, that's why I don't work at Google. Judging by the more than a thousand posts in various threads in the WebmasterWorld Google forum about Google updates in just the last week or so, I'd say the folks at Google haven't quite run out of ideas for how to change their index around.

Preposting update: while looking around at just how many posts there really were over there, I ran across GoogleGuy (an actual Google employee) saying rather diplomatically that Orlowski's full of shit. Ev saying it was one thing, but they might not tell him everything about the search side of the business, just yet. GG, other than his Googlish way of saying as little as possible, has always seemed quite authoritative. And as an aside, what's up with Ev's archives, with the link in the RSS feed pointing to a weekly archive page that includes his original "Orlowski is full of crap. Again." post, while the front page and its permalink to to a monthly archive with the post snipped? Also, I noticed that the template for the weekly archive was much easier to read, and the blogroll is, um, er, very nice company to keep. I'd roll that version of the template back onto the main page, if it was me (calculating how much an Evhead link is worth in Blogshares money, thinking about a stock split).

May 10, 2003

Perhaps comment modding?

Thinking about how the humorless trolls in Luke's comments sound rather like Slashdot comments with a score of 1 got me thinking that maybe weblog comments ought to adopt a similar moderation scheme.

I really don't want to completely silence people who leave stupid troll comments or turn on shotgun-TrackBack and ping even when they are just linking without comment, but I'd like to not see them most of the time, and not have them indexed.

I doubt that I'll have the attention-span to actually do the heavy lifting, but it shouldn't be all that tough to add a moderation field to the MT comment database, some UI in the "edit this comment" page, a link to that page in the notification email, and a tag plus some CSS to set unwanted comments to display:none with a link to turn them back on. For extra credit, integrate it with MTThreadedComments so that any reply to a modded-down comment defaults to modded-down, with the option to promote it.

May 09, 2003

Yet another CSS slam

I guess linking to CSS rants is last week's news. After all, plenty of people seemed to be linking to Dave The Dinosaur, going back to his font tags. And lots of people linked to old-schooler jwz, unwilling to give up his tables. But I just haven't seen any links to this guy bad mouthing CSS. I mean, come on, he had so much trouble making a simple CSS layout that he, er, wrote XUL instead.

When it works right in a reasonable number of browsers, CSS is a great thing. Even when it doesn't, it can be fun to play with, and use bugs to make fun of the other guy's browser. But CSS is not a religion. It's not all-or-nothing. If what you do is play with CSS and then write about it, great! Please, for my sake, bang your head against it, with a half-dozen browsers on two or three operating systems all going at once, and then tell us what you learned. But, if you want other people, who don't do that for fun, to get into CSS as well, then you need to stop acting as though anyone who ever uses a single table or font tag is evil. If someone whose non-tech writing I love spends all their free time for three days running trying to replace their last remaining table instead of writing something beautiful, you've harmed us both.

Up next on the wheel-o-rants: people who fail to grasp the fact that XHTML+CSS is not a single word synonymous with valid markup.