Home :: About InfoWorld :: Advertise :: Subscribe :: Contact Us :: Awards

TechIndex
iDiscuss :: Columnists :: Special Reports :: Newsletters :: Subscribe
 

     



Top 10 hits for fun with queries on..
Google
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Help link
 2/1/2004; 11:04:07 PM.

Strategic Developer
wedgeThe art and science of software testing
wedgeOpen source lock-in
wedgeMoving pictures
wedgeTwo cultures
wedgeXML for the rest of us
wedgeMining the intranet
wedgeGiving back to open source
wedgeMobile webcasting
wedgeA tale of two Cairos
wedgeLizard brain surgery
wedgeMining message metadata
wedgePersonal SOA
wedgeOpen source citizenship
wedgeHow rich is the rich GUI?

CxO bloggers
wedgeAdam Bosworth, BEA
wedgeChad Dickerson, InfoWorld
wedgeClemens Vasters, newtelligence AG
wedgeDave Winer, UserLand Harvard
wedgeEdwin Khodabakchian, Collaxa
wedgeEugene Belyaev, JetBrains
wedgeGraham Glass, The Mind Electric webMethods
wedgeJamie Lewis, The Burton Group
wedgeJeremy Allaire, Macromedia General Catalyst Partners
wedgeJohn McDowall, Grand Central Communications
wedgeJorgen Thelin, Cape Clear Software Microsoft
wedgeKingsley Idehen, OpenLink Software
wedgeMark O'Neill, Vordel
wedgeMike Cannon-Brooks, Atlassian
wedgeOmar Javaid, Wireless Knowledge
wedgePaul Brown, FiveSight
wedgePaul Everitt, Zope Corporation Zope Europe
wedgePhil Windley, State of Utah independent
wedgeRay Ozzie, Groove
wedgeRoss Mayfield, Socialtext
wedgeSean McGrath, Propylon
wedgeTim Bray, Antarctica Systems
wedgeTim O'Reilly, O'Reilly and Associates

Jon's Menu
 


xpath search

blog home

blog index

infoworld home

infoworld index

library lookup

Click here to send an email to the editor of this weblog. Send Jon an email

currently reading:

Here's how this works.

 
 
 
  Sunday, February 01, 2004 

RSS self-defense

Now that I'm accumulating my inbound feeds as XHTML, in order to database and search them, I find myself in the aggregator business, where I never planned to be. The tools I'm using to XHTML-ize my feeds are Mark Pilgrim's incredibly useful ultra-liberal feed parser and the equally useful HTML Tidy, invented by Dave Raggett, and maintained by folks like Charlie Reitzel, one of CMS Watch's Twenty Leaders to Watch in 2004 (along with yours truly).

Today I finally got around to using the ETag and conditional GET (If-Modified-Since) features of Mark Pilgrim's feed parser. (Apologies to my subscribees who, until now, have been treated impolitely by my indexer.) Of the 200+ feeds to which I subscribe, fifty seem not to support either of these two bandwidth-saving techniques, which means they're probably getting battered unnecessarily by feedreaders. The victims are:

http://altis.pycs.net/rss.xml
http://anopinion.net/Rss.aspx
http://blog.fivesight.com/prb/exec/rss
http://blogs.atlassian.com/rebelutionary/index.rdf
http://fieldmethods.net/backend.php
http://groups.yahoo.com/group/syndication/messages?rss=1&;viscount=15
http://inessential.com/xml/rss.xml
http://inessential.com/xml/rss.xml?comments=1&;postid=2792
http://matt.griffith.com/weblog/rss.xml
http://nhpr.org/view_rss
http://safari.oreilly.com/NewOnSafari.asp
http://seanmcgrath.blogspot.com/rss/seanmcgrath.xml
http://sqljunkies.com/weblog/mrys/Rss.aspx
http://today.java.net/pub/q/29?cs_rid=47
http://today.java.net/pub/q/weblogs_rss?x-ver=1.0
http://usefulinc.com/edd/blog/rss
http://w3future.com/weblog/rss.xml
http://w3future.com/weblog/staplerFeeds/dubinko.xml
http://webvoice.blogspot.com/rss/webvoice.xml
http://www.burtongroup.com/weblogs/jamielewis/rss.xml
http://www.davidgalbraith.org/index.xml
http://www.eighty-twenty.net/blog?flav=rss
http://www.eod.com/devil/rss10.xml
http://www.fuzzyblog.com/rss.php?version=2.0
http://www.g2bgroup.com/blog/rss.xml
http://www.gonze.com/index.cgi?flav=rss
http://www.gotdotnet.com/team/dbox/rssex.aspx
http://www.gotdotnet.com/team/tewald/rss.aspx?version=0.91
http://www.intertwingly.net/wiki/pie/RecentChanges?action=rss_rc
http://www.lucidus.net/blog/rss.cfm
http://www.markbaker.ca/2002/09/Blog/index.rss
http://www.mobilewhack.com/index.rss
http://www.nelson.monkey.org/~nelson/weblog/index.rss091
http://www.neward.net/ted/weblog/rss.jsp
http://www.newsisfree.com/HPE/xml/newchannels.xml
http://www.openlinksw.com/blog/~kidehen/gems/rss.xml
http://www.oreillynet.com/cs/xml/query/q/295?x-ver=1.0
http://www.pepysdiary.com/syndication/rss.php
http://www.photo-mark.com/cgi-bin/rss2.cgi?set_id=16
http://www.pipetree.com/qmacro/xml
http://www.rassoc.com/gregr/weblog/rss.aspx
http://www.sellsbrothers.com/news/rss.aspx
http://www.simplegeek.com/blogxbrowsing.asmx/GetRss?
http://www.testing.com/cgi-bin/blog/index.rss
http://www.voidstar.com/module.php?mod=blog&;op=feed&name=jbond
http://www.webcrimson.com/rss/many.rss
http://www.xmldatabases.org/WK/blog?t=rss20
http://www.xmlhack.com/rss.php
http://www.zope.org/SiteIndex/news.rss

 

Paul Venezia's masterful Linux 2.6 review

Hats off to Paul Venezia for his exhaustive analysis of the Linux 2.6 kernel in this week's InfoWorld:

Will the new Linux really perform in the same league as the big boys? To find out, I put the v2.6.0 kernel through several real-world performance tests, comparing its file server, database server, and Web server performance with a recent v2.4 series kernel, v2.4.23. [InfoWorld: Linux v2.6 scales the enterprise, Paul Venezia]

Paul's not kidding, he went to the mat on this one. In a sidebar on the kernel development process, Paul notes that he twice went to the Linux Kernel Mailing List with what seemed to be -- and in fact were -- bugs. Here's the first LKML thread, and here's the second. Nice going!

 


  Saturday, January 31, 2004 

Analyzing blog content

Suppose that we bloggers, collectively, wanted to migrate toward HTML coding and CSS styling conventions that would make our content more interoperable. Since none of us is starting from a clean slate, we'd need to analyze current practice. Well, now we can. Here, for example, is a concordance of use cases for HTML elements with class attributes, drawn from the database I'm building:

<a class="Troll">

  1. OLDaily: Theory in Chaos

<a class="listLinkLrg">

  1. Kingsley Idehen's Blog: Enterprise Databases get a grip on XML

<a class="nodelink">

  1. Erik Benson: Pat Coa

<a class="offlink">

  1. Erik Benson: Pat Coa

<a class="regularArticleU">

  1. Jeroen Bekkers' Groove Weblog: Groove and Weblogs
  2. Kingsley Idehen's Blog: Enterprise Databases get a grip on XML

<a class="weblogItemTitle">

  1. Seb's Open Research: Mario dans Le Devoir

<blockquote class="posts">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<div class="Section1">

  1. Clemens Vasters: Indigo'ed: Back to Business

<div class="active1">

  1. s l a m: Countering The Bush Doctrine

<div class="blogtitle">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<div class="caption">

  1. Joi Ito's Web: With bloggers inside, Davos secrets are out - IHT article
  2. Windley's Enterprise Computing Weblog: Toysight

<div class="comment">

  1. Organic BPEL: Avalon is NOT representing the convergence between the Web and GUI!

<div class="date">

  1. Comments for Jon's Radio: None

<div class="inlineimage">

  1. Joi Ito's Web: With bloggers inside, Davos secrets are out - IHT article
  2. Windley's Enterprise Computing Weblog: Toysight

<div class="node">

  1. s l a m: Countering The Bush Doctrine

<div class="personquote">

  1. Joi Ito's Web: With bloggers inside, Davos secrets are out - IHT article

<div class="posts">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<li class="MsoNormal">

  1. Hillel Cooperman: None
  2. Rob Howard's Blog: Continued...
  3. cbrumme's WebLog: Memory Model

<p class="ArticleBody">

  1. Telematique, water and fire.: Server vendors launch management initiative

<p class="MsoNormal">

  1. Luann Udell / Durable Goods: Myth #3 about Artists
  2. Clemens Vasters: Indigo'ed: Back to Business
  3. Rob Howard's Blog: Last post on the topic -- at least for now!
  4. cbrumme's WebLog: Memory Model

<p class="blogtitle">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<p class="code">

  1. Duncan Wilcox's weblog: Tag Soup

<p class="editorial">

  1. MobileWhack: Z600 Accessories, Accessories, Accessories

<p class="imagelink">

  1. Kevin Lynch: Intel Centrino

<p class="posts">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<p class="q">

  1. Duncan Wilcox's weblog: Trusting Corporations

<p class="text">

  1. Hillel Cooperman: None

<p class="times">

  1. Telematique, water and fire.: Metro AG and their RFID Plan

<span class="artText">

  1. Kingsley Idehen's Blog: Enterprise Databases get a grip on XML

<span class="bodytext">

  1. Seb's Open Research: Kottke: Guidelines for learning

<span class="byline">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<span class="closed">

  1. s l a m: Countering The Bush Doctrine

<span class="imagelink">

  1. Kevin Lynch: Adam Bosworth on Service Architecture

<span class="nxml-attribute-local-name">

  1. darcusblog: Names (again)

<span class="nxml-attribute-value">

  1. darcusblog: Names (again)

<span class="nxml-attribute-value-delimiter">

  1. darcusblog: Names (again)

<span class="nxml-element-local-name">

  1. darcusblog: Names (again)

<span class="nxml-tag-delimiter">

  1. darcusblog: Names (again)

<span class="nxml-tag-slash">

  1. darcusblog: Names (again)

<span class="nxml-text">

  1. darcusblog: Names (again)

<span class="o">

  1. ongoing: Genx

<span class="ofp">

  1. Seb's Open Research: None

<span class="rss:item">

  1. Blogging Alone: None

<span class="storyHead">

  1. Jeroen Bekkers' Groove Weblog: Disruptive in no small measure

<span class="text">

  1. s l a m: Countering The Bush Doctrine

<span class="title">

  1. Blogging Alone: None

<span class="topstoryhead">

  1. Dive into BC4J: BC4J Mentioned in the Latest Article in the OTN Architecture Series

<ul class="noindent">

  1. Corante: Social Software: Friendster notes
  2. Web Voice: And now for something different
  3. Dan Gillmor's eJournal: Electronic Voting: An Insecure Mess, but Full Speed Ahead

With only a few days' worth of accumulated content, I wouldn't dare to venture any recommendations about these use cases. But as the picture develops over time, we might start to see opportunities for convergence.

Update: I've been hoping for some external validation of this approach, and Giulio Piancastelli provides it today. As part of a much longer posting with lots of detailed technical analysis of RDF-oriented techniques, he writes:

What Jon is searching for, I think, is a good balance between the cost of providing metadata and the benefits gained by working on the provided metadata, while trying not to entirely move away from the web world as we know it. In fact, this is probably the most important characteristic of Jon's experiment: he is working with what he is able to find right now, that is lots of HTML documents, which can be converted to be well-formed XML quite easily, and then searched by means of XPath. While these are ubiquitous technologies, it's difficult to find RDF files spreaded around as such: proving that the RDF world is query-enabled, stating that the right place where to put metadata are RDF files because you can probably get higher quality and more complete results is useless if there are little or no data to query.

From my personal perspective, I see those two worlds, one working with XML and XPath, the other messing around with RDF and RDQL, still very far from each other. Jon's experiment is helping us to become conscious of the fact we already are on a metadata path as far as web content is concerned: XML and XPath are probably the first steps in this journey, leading us to a more semantic web augmented with technologies which nowadays seems not to be successful, but that will hopefully prove to be useful when more complex needs arise. We can only hope the virtuous cycle will start to spin soon.

[Through the blogging-glass]

Amen. Thanks, Guilio!

 


  Friday, January 30, 2004 

More fun with queries

I should probably get a life, but instead I can't stop myself from writing more new queries against my growing database of well-formed blog content. Here are some queries that find the following things in the last few days' worth of my inbound RSS feeds:

paragraphs containing links to apple.com

paragraphs that contain links to apple.com and mention 'XSLT'

paragraphs in items posted today that mention 'Orkut'

January items, posted by Joi Ito or David Weinberger, that mention mention 'Orkut'

items containing tables with cells that mention 'zipcode'

links to amazon.com that also contain images from amazon.com

Either I am crazy, or this is way cool. Or both.

 


  Thursday, January 29, 2004 

Structured search, phase two

The next phase of my structured search project is coming to life. For the new version I'm parsing all 200+ of the RSS feeds to which I subscribe, XHTML-izing the content, storing it in Berkeley DB XML, and exposing it to the same kinds of searches I've been applying to my own content. Here's a taste of the kinds of queries that are now possible:

links from Tim Bray

links from Brent Simmons to InfoWorld.com

books mentioned by AKMA

books, with XQuery in the title, mentioned by Michael Rys

The paint's not dry on this thing yet. I have yet to normalize the dates, and I'm still getting the hang of DB XML, but here are some things that become immediately obvious:

  • Feeds that deliver only partial content are at a disadvantage.

  • HTML Tidy is able to coerce a surprisingly large number of the feeds I take from HTML to XHTML.

  • Once coerced, they're addressable in terms of the elements you find in HTML: links, images, tables, quotes.

Until now, I've thought the major roadblock standing in the way of more richly structured content was the lack of easy-to-use XML writing tools. But maybe I've been wrong about that. If it's going to be practical to XHTML-ize what current HTML writing tools produce, maybe we can make a whole lot more progress than I thought by working toward CSS styling conventions that will also provide hooks for more powerful searching.

At the very least, this will be a nice laboratory in which to experiment with a growing pool of XML content, using a variety of XML-capable databases. My hope, of course, is to offer a service that's as useful to you -- the writers of the blogs I'm reading, aggregating and searching -- as it is to me. And ideally, useful to you in ways that invite you to think about how to make what you write even more useful to all of us. We'll see how it goes.

 


  Tuesday, January 27, 2004 

.NET reality check

There's been some pushback recently, in the .NET blogging community, about Microsoft's habit of living in the future. For example:

It is abundantly frustrating to be keeping up with you guys right now. We out here in the real world do not use Longhorn, do not have access to Longhorn (not in a way we can trust for production), and we cannot even begin to test out these great new technologies until version 1.0 (or 2.0 for those that wish to stay sane). I know there's probably not a whole lot you can do, but this is a plea to you from someone "in the field". My job is to work on the architecture team as well as implement solutions for a large-scale commercial website using .NET. I use this stuff all day every day, but I use the 1.1 release bits.

Here's my point, enough with the "this Whidbey, Longhorn, XAML is so cool you should stop whatever it is you are doing and use it". Small problem, we can't. Please help us by remembering that we're still using the release bits, not the latest technology. [Michael Earls]
In the spirit of Michael's plea, I'm working on an upcoming article in which I'll compare what was promised for the .NET platform (er, framework), two and three years ago, with the current reality as it exists today. Examples of the kinds of issues I want to consider:

  1. Easier deployment. The "end of DLL hell" was one of the early .NET battle cries. CLR metadata, enabling side-by-side execution, was going to make that problem go away. Well, has it? I hear a lot about ClickOnce deployment in Longhorn, but does the existing stuff work as advertised?

  2. Unified programming model. It was obvious that wrapping years of crufty Win32 and COM APIs into clean and shiny .NET Framework classes, and then transitioning app and services to that framework, wasn't going to happen overnight. But, how much progress has been made to date?

  3. Programming language neutrality. Here's a statement, from an early Jeff Richter article about .NET, that provoked oohs and ahhs at the time: "It is possible to create a class in C++ that derives from a class implemented in Visual Basic." Well, does anybody do this now? Is it useful? Meanwhile, the dynamic language support we were going to get, for the likes of Perl and Python, hasn't arrived. Why not?

  4. Security. As security bulletin MS02-06 ("Unchecked buffer in ASP.NET Worker Process") made clear, not everything labeled ".NET" is managed. Still, there is a lot of .NET-based server code running now. Can we articulate the real benefits of .NET's evidence-based approach to code access security? And what have been the tradeoffs? For example, I've noticed that while .NET's machine.config adds a new layer of complexity to an environment, nothing is subtracted. You've still got Active Directory issues, NTFS issues, IIS metabase issues. How do we consolidate and simplify all this stuff?

  5. XML web services. I'd say many of the original goals were met here. Of course the goalposts moved too. .NET Web Services, circa 2000, looked more like CORBA-with-angle-brackets than like service oriented architecture. But while Longhorn's Indigo aims for the latter target, it's worth considering how well the deployed bits are succeeding on their original terms.

  6. XML universal canvas. I hoped the XML features of Office 2003 were going to deliver on this promise. But here, the jury's still out.

  7. WebForms/WinForms. This is a tricky one. The original .NET roadmap charted two parallel courses for client-side developers, one for the rich client and one for the thin client. Or as we say lately: "rich versus reach." There wasn't a write-once strategy for combining the two -- and indeed, in Longhorn, there still isn't -- but it's probably useful to consider how the side-by-side strategy has played out.

  8. Software as a service. Not much progress there, as Bill Gates acknowledged in a July 2002 speech in which he also lamented the failure of "building block services" -- what was envisoned as Hailstorm -- to emerge. What are the roadblocks here? Plenty of business and technical issues to consider.

  9. Device neutrality. The Tablet PC has turned out to be a good platform for .NET apps. Phones and PDAs, less so, for reasons that will be interesting to explore.

  10. User interface / personal information management. A bunch of important themes were sounded in the 2000 .NET rollout speech. Pub/sub notification. Attention management. Smart tags. Today, I'd argue, I'm getting a lot of these effects from blog culture and RSS. Going forward, Longhorn is the focus of the UI/PIM vision articulated for .NET. But living here in the present, as we do, it's worth considering which aspects of current .NET technology are making a difference on this front.

Over the next week or so, I'd like to have conversations with people on all sides of these (and perhaps other, related) issues. I'll be speaking with various folks privately, but here's a comment link (rss) for those who want to register opinions and/or provide feedback.

 


  Monday, January 26, 2004 

Mindreef's SOAPscope 3.0

camtasia Here's a four-minute Flash movie containing three segments from an online demo of the latest version of Mindreef's SOAPscope. The presenter is Frank Grossman; a few others (including me) chime in occasionally. The segments are:

  1. How SOAPscope integrates with the WS-I (Web Services Interoperability Organization) test tools.

  2. How to invoke a WSDL service -- in this case, Microsoft's TerraService -- using SOAPscope to visualize inputs and outputs as pseudocode, and optionally modify and replay messages. You can try this yourself at XMethods.net, but the earlier version 2.0 of SOAPscope that's running there isn't as clever about converting enumerated types in the schema into picklists on the invocation form.

  3. How SOAPscope 3.0 integrates with Visual Studio.NET.

Thanks to the Mindreef guys for playing along with this experiment, and to TechSmith for letting me test-drive Camtasia Studio. If folks think these off-the-cuff videos are useful, I'll try to do more of them. I'm involved in a lot of online demos, and showcasing them in this way is probably win/win both for the companies who present to me and for the readers of this blog.

Update: Just as I was noticing a playback problem, Frank Grossman wrote to report the same thing. Camtasia uses a secondary .SWF file, launched from this HTML, to control playback. Evidently, the idea is to make sure the movie plays at the correct screen size. But what I found, as did Frank, is that after the first time through, progressive playback of the video doesn't work on subsequent playbacks. So now I'm pointing directly at the primary .SWF file which, if you're running at greater than 1024x768 (the resolution of the demo) should work fine. If you're running at 1024x768, though, you'll want to use F11 to maximize the Flash player.

 

The art and science of software testing

Test-driven development does require a lot of time and effort, which means something's got to give. One Java developer, Sue Spielman, sent a Dear John letter to her debugger by way of her Weblog. "It seems over the last year or two we are spending less and less time with each other," she wrote. "How should I tell you this? My time is now spent with my test cases."

Clearly that's a better use of time, but when up to half of the output of a full-blown TDD-style project can be test code, we're going to want to find ways to automate and streamline the effort. Agitar Software's forthcoming Java analyzer, Agitator, which was demonstrated to me recently and is due out this quarter, takes on that challenge. [Full story at InfoWorld.com]

 

Next-generation e-forms

E-forms, a technology that's been around for a long time, is now a hotbed of activity. Microsoft's XML-oriented InfoPath, which shipped with Office 2003 in October, is now deployed and in use. Adobe plans to ship a beta version of its PDF-and-XML-oriented forms designer in the first quarter of this year. And e-forms veterans such as PureEdge and Cardiff, whose offerings are built on an XML core, are lining up behind XForms, the e-forms standard that became an official W3C recommendation in October 2003. [Full story at InfoWorld.com]

 


  Sunday, January 25, 2004 

The forest and the trees

The genius of Jon Udell's work is not sheer technical innovation (not that TransQuery amounted to anything like that either) but rather the ability to make sense of how such technologies can be used in simple but powerful ways over compelling content.

And not getting lost in the trees.

[Evan Lenz]
I greatly appreciate Evan's kind words. Ironically, I've been asking myself the same questions about my current project that Evan asks himself, in his posting, about his earlier (and masterfully done) TransQuery project: why doesn't it provoke the reaction I think it should? Not because my stuff is technically innovative, which it isn't. But rather because it shows how ubiquitous but underexploited technologies (XPath, XSLT, XHTML) can make our everyday information more useful.

Co-incidentally I'm now reading XQuery from the Experts, and am having a curiously mixed reaction to the book. The geek in me is irresistably drawn to this Swiss-army-knife query language that so ambitiously straddles the realms of typed and untyped, hierarchical and relational, declarative and procedural. And I can't wait to use the corpus of XHTML blog content that I'm assembling to explore XQuery implementations, along with the XPath/XSLT techniques I've used so far.

On the other hand: so what? If I can't paint a picture of the forest that people can relate to, then planting a few more trees won't help. The notion of dynamic categories comes closest to answering the "so what?" question. But not close enough. When you work publicly, in blogspace, as I have been doing, reaction to your work is exquisitely measurable. And when I take the pulse of that reaction it's clear that I'm miles away from proving three points:

  1. Ordinary Web content is already full of metadata,

  2. which can enable powerful queries,

  3. which, in turn, can motivate us to enrich the metadata.

As I begin to explore XQuery, I'll try to keep these guiding principles front and center. And if I wander off into the weeds, please feel free to administer a virtual dope slap.

 




     
   About InfoWorld :: Advertise :: Subscribe :: Contact Us :: Awards
 


 
Copyright © 2003, Reprints, Permissions, Licensing
Back to Top Home News Test Center CTO Network TechIndex
s