O'Reilly ONLamp Blog

December 2002 Archives

data: scheme URIs (RFC2397)

listen

Saturday December 28, 2002 9:19PM
by Uche Ogbuji

Related link: http://www.ietf.org/rfc/rfc2397.txt

data: scheme URIs seem to be a very good idea hamstrung by Microsoft’s contempt. This has been remarked several times in my observations when the topics of self-contained Web pages come up. See this note by Benjamin Franz, for example.

Have you ever used data URLs in any way?

Sony's 'copy-proof' CD fails to silence hackers

listen

Tuesday December 24, 2002 7:19AM
by Uche Ogbuji

An oldie, but one to remember. What more demonstration could we have of the gall of Big Media than this kooky scheme for crashing people’s computers in order to prevent them from even playing CDs they have purchased. Yes the RIAA dinosaur is on its way to extinction, but in the meanwhile it’s very annoying to have to dodge its staggering feet.

Where do Intellectual Property and Personal Information Intersect?

listen

Monday December 23, 2002 3:40PM
by Kevin Bedell

Now, to begin with you have to understand that I know nothing about copyright or trademark laws. But I have this idea that’s nagging at me. Maybe I’m crazy.

The idea is this: Can I copyright or trademark my personal information? You know, my personal combination of “Name, Address, telephone, e-mail, etc”. The basic personal information that everyone asks for.

Then, if I can do that, can I gain control over who uses it? Can I control who buys it, sells it or gives it away? Could I gain the same rights over my personal information that publishers have over the lyrics on a soundtrack?

Why not?

If major corporations can keep people from using their name, why can’t I?

Sure there’s likely to be more than one Kevin Bedell, but not more than one with my Social Security number. If I could get copyright control over that combination then why wouldn’t I have the same rights as other intellectual property holders?

It just seems one-sided that a film publisher can send the FBI after me if I make a single copy of a DVD, but they can track all my purchases and visits to their website and sell my personal information to anyone, anywhere for as much as they want for as long as they want. Why don’t I have laws protecting my information the way they have laws protecting theirs?

How can I get that kind of leverage?

Even if I can’t copyright my personal information, why can’t I copyright a name - say, “Kevin Bedell #1″ or something - and use that information whenever I fill out forms? I could say that was my “trade name” and copyright it. Why not?

Some creative mind with the right knowledge is somewhere I know it. Someone can come up with the right spin on this. Maybe I have to incorporate. Maybe I have to publish my personal information in a book or something to get rights to it.

Imagine that. I could publish a book with only one page - containing only my personal information. (I could do it electronically.) I could include a copyright designation. Then, no one could use that information without my consent. Right?

Then we could build a web site with a form to fill out and it would publish/copyright your information. We could get 10,000 people to do this - and then have them all fill out Warranty cards for the same product and bring a class action suit against the manufacturer if they resell the information. That would turn heads - and change behavior if we were right.

Maybe I need a version of the information I use only when conducting business (something like appending the “#1″ as in the example above). Then I could restrict use of that information for business transactions. My ‘personal’ identity would be free to use - but only if you knew it. If you want to use the info I put on the form I filled out (my ‘business’ identity) then it’s under copyright.

I know this seems crazy, but why not? The laws should protect all of us - there has to be some twist I haven’t thought of yet that would make it possible…

It’s time we tried to stand up and take back some of the control we’ve lost or given away. How can we use existing laws to do it?

Am I crazy?

What Do Intellectual Property Owners Want?

listen

Monday December 23, 2002 11:48AM
by Andy Oram

This is a pointer to an article I published last Friday in The American Reporter. Excerpt follows.

Why copyright? Why did this obscure branch of “intellectual
property,” this private concern of entertainment and
software firms, become the most pressing public policy area
of the computer field?

[The Sklyarov and Jonansen cases] make us suspect that the
multiple tentacles of the “intellectual property” leviathan
bear barbed hooks on each end–and that some of the
critical issues in modern democracy and discourse may be
snagged by them.

Open Source Christmas?

listen

Monday December 23, 2002 12:51AM
by Matthew Langham

T’was the night before Christmas, when suddenly Rudolf stormed through the door of Santa’s office to find him buried in his new iBook. “Damn WIFI is down again”, Santa mumbled, “I can’t post anything to my new weblog.”

Rudolf used the break in Santa’s surfing to launch into his well prepared speech. “You know Santa, I’ve been thinking about this whole Christmas project thing. Maybe we should be looking at open sourcing the project � what do you think?”
“Open sourcing Christmas?”, Santa nearly leapt out of his chair, “you must be joking. Why would I want to do that?”

Rudolf had prepared his arguments well. “First, if we open source Christmas then we can start building a real community around it. You could be project lead and I would be a committer. You know, there are lots of people who don’t like Christmas or can’t seem to get in the spirit. If we make the whole thing open source then people could get involved. And we can get people to write new documentation for the project. The documentation we have seems to be a little out of date.”

“What would people want an open source Christmas for?” Santa was trying hard to follow.
“Well, they could change bits they don’t like. Like for instance perhaps people would prefer Christmas to be in say Summer, instead of Winter. Or have a flexible Christmas depending on which hemisphere you live in. Or perhaps they would prefer you to come through the front-door instead of down the chimney. You know that sort of thing. If Christmas was open source then the community could vote on things they would like to change. Maybe we can even get other festivity projects to merge with the Christmas project � wouldn’t that be great.”

Santa gazed out of the window of his office for some time. “You know Rudolf”, he said after a while, “this open source idea is not bad. Really. But there might be a couple of drawbacks. What if the community decides it doesn’t need a Santa to bring the presents. What if they decide that some Brad Pitt lookalike would be better? And as to Rudolf and the other reindeers pulling my sleigh, what if they decide that Brad should arrive driving say the new BMW?”

Rudolf paused for a moment. “Ah, yes, I hadn’t thought about that. Ok. So, when do we leave?”. “Soon, Rudolf, soon.”, smiling to himself, Santa went back to his weblog.

Happy Holidays.

The Social Aspects of Internationalization

listen

Friday December 20, 2002 1:10PM
by Adam Trachtenberg

In my article Internationalization and Localization with PHP, I outline an all PHP method of adding multi-language support to web sites. What I didn’t discuss, however, is that the technical aspect of translation support is only part of the process. There’s also the human aspect — going back and forth with your team of translators to get the translated text and double-checking to make sure it accurately conveys the original message.

A translator doesn’t want to wade through pages of code just to translate your phrases. It’s a pain in the ass; plus, it’s possible they’ll accidently insert a stray character into a line and break the site. In the article, in order to keep the logic behind the code clear, all the classes appear in the same page as the document.

But, unless your site is only one page, this isn’t a good idea. Instead, you should break each class out to a separate file and include them at the top of the page (maybe even using the auto_prepend configuration directive). This allows you to pass back-and-forth an individual translation file without fear. (Of course, you’re using a version control system, like CVS, so it’s easy to compare file revisions and back out breakages, right?)

To use the examples from the article, put the base class, pc_MC_Base, in a file with other common classes. Then, the US English and US Spanish classes go in their own files: pc_MC_en_US.php and pc_MC_es_US.php. Here’s what belongs in the US Spanish file:


class pc_MC_es_US extends pc_MC_Base {
    function pc_MC_es_US() {
        $this->lang ='es_US';
        $this->messages = array(
            'chicken' => 'pollo',
            'cow' => 'vaca',
            'horse' => 'caballo'
        );
    }

    function i_am_X_years_old($age) {
        return "Tengo $age años";
    }
}

From this, it’s pretty easy for a person to go through the pc_MS_en_US::messages array typing the translated words in as the array keys. This also applies to the methods at the bottom of the class, like i_am_X_years_old().

But, even this can be asking for trouble. On the projects where I used this code, developers would only send the translator the portion of the file with the text and then manually integrate the returned document into the class. This not only simplifies the process, but also allows you to verify everything works as needed, like pluralized words and HTML entities.

While we covered the difficult topic of pluralization in PHP Cookbook, I omitted it from the article. On the face of it, it’s easy to think “How hard can it be to pluralize a word, you just add an ’s’ to the end?” But, what about “fish” or “person”? You need to carve out special cases for those words. And, as if the exceptions in English weren’t bad enough, different languages pluralize words using a whole host of rules and exceptions. (And, in the case of languages like Chinese, the single and the pluralized character are one and the same!)

So, instead of forcing all this on your poor translator, it can be easier to have her alert you to this situation and then write the code yourself to make everything work out programmatically. Additionally, don’t make the translator type ñ instead of ñ. Do this yourself.

Another alternative is to use a specially formatted plain text file and write code to convert the document into PHP code. This technique is similar to how GNU gettext utility operates. But, if you’re going to go that route, I advise actually using gettext itself. PHP supports gettext, so while it adds another dependency to your project, it’s worthwhile if you’re doing many translations with non-technically savvy people. gettext still exposes translators to problems of escaping out quotation marks and printf() style place holders, but it already supports a method for handling pluralized words.

Share your comments on managing the i18n process:

Data strips people of their humanity

listen

Friday December 20, 2002 5:12AM
by Andy Oram

Information–don’t we all want more of it? Our government sure does. But a piece of information written down or entered in a database becomes abstract and loses its original meaning. This is fine if you have strictly limited and well-defined goals for collecting the information. But when your dragnet is open-ended, information cheapens humanity. Combined with arrogance and racism, it leads to incidents like what happened in Los Angeles this week.

The United States government recently picked out a dozen Middle Eastern countries and required boys and men from those countries as young as 16 to report to INS offices and register. When they obeyed the law, hundreds were arrested and abused in the frightening conditions described by the Reuters article. The incident inevitably raised the specter of the Nazis, who would order the Jews of a city to meet at a certain time and place, load them into cattle cars, and take them away.

The INS has since reported that most of the men and boys checked out fine and were released within 24 hours. As if that made the arrests OK!

One can rant on for hours about the political meaning of this information screening, but what concerns us as information processing professionals is the light it casts on data gathering and data mining.

I recently found out that my company made some mistakes on my 401K plan. It was routinely corrected, but the results on my account might look strange when taken out of context. Another time, I set off an alarm someplace and drew the police because somebody had mistakenly removed my account from the alarm system. It was for details no greater than these that Middle Eastern men are going to jail.

We do not like to share details about ourselves, because intuitively we sense that people will judge us wrongly. The situation is rarely as dramatic as it was in Los Angeles this week. We often don’t tell our friends about medical conditions we have. Perhaps we say, “I don’t want to be considered a cancer patient (or a diabetic, or an HIV-positive person, or whatever); I want to be seen for myself.”

Many people even express the same restraint through religious doctrine. They say, “Only God can judge.” Abstracted from the religious setting, what they’re saying is that we cannot treat people fairly when judging them by information that is necessarily limited.

Our government feels no such sense of restraint. It is willing to throw all chances of winning cooperation from the people whose cooperation it needs the most in its current anti-terrorist endeavor–Middle Eastern immigrants. It is determinedly putting in place policies that will violate the civil liberties of all of us, immigrant and native alike. It assumes it can get away with its current violation of human decency, because it assumes that no one will protest except the compatriots of the victims. We must prove it wrong.

Where is the information search taking our country?

Perl Success Story: Perl Helps Heal Swiss Hospital's Database Migration Pain

listen

Thursday December 19, 2002 3:05PM
by Todd Mezzulo

After using an Ingres ARUNA database for ten years, the pathology laboratory at the University Hospital of Lausanne, Switzerland decided it was time to replace their software with a more open Oracle DIAMIC package. No easy task for a facility which handles over 50,000 lab samples per year and is charged with managing the medical records of 500,000 patients. No easy task that is, unless you know Perl.

Here’s Marc-Henri Poget’s story about how Perl enabled him to work around some messy stuff and simplify the migration from an Ingres to an Oracle Database.

Using Perl to Migrate Medical Data From an Ingres to an Oracle Database

The University Hospital of Lausanne, Switzerland, is an 850 bed hospital which serves the needs of the 600,000 people living in the French speaking part of Switzerland. The pathology laboratory information system, used by 80 people, handles 50,000 samples per year.

By 2000, the ARUNA software package had been in use in the laboratory for a decade. The need arose to replace it with a package that was more open and had the ability to evolve. In September 2001, the project aimed at replacing ARUNA with the new DIAMIC software package began. As pathologists rely heavily on previous results for their diagnosis, the ARUNA database held more than 20 years of medical data on-line, some data even coming from systems older than ARUNA. This posed a unique challenge of migrating the Ingres ARUNA database into the Oracle DIAMIC database.

The vendor of DIAMIC proposed the following migration strategy. First, an Oracle database with the same structure as the original Ingres ARUNA database is created. Then, flat files are extracted from the Ingres database and fed into the Oracle ARUNA database. Finally, the vendor develops an application to migrate the data from the Oracle ARUNA to the Oracle DIAMIC database.

To support this strategy, the vendor proposed extracting the flat files from the Ingres ARUNA database using Excel and ODBC. This method facilitated processing the lines to modify their format (for instance to remove the non-significant blanks). The last step was to import these files in the Oracle ARUNA, using a tool written by the vendor, which provides bulk load capabilities. The major drawbacks of this approach are the following:

- Both Ingres and Oracle databases are on Unix machines; it doesn’t make sense to use Windows apps as a gateway.

- Excel doesn’t allow the extraction and handling of very large files from the Ingres database (they contains about half a million of patients).

- In our computer department, we already have experience using SQL Loader from Oracle for bulk loads, so it seemed more natural to use that rather than the vendor’s tool.

- An additional issue that the vendor’s solution doesn’t address is how to ensure that the Oracle ARUNA databases� tables have exactly the same structure as the Ingres ARUNA databases.

To overcome the above-mentioned problems, I chose to work entirely on the Unix platforms and to develop Perl scripts for the necessary tasks.

I used an Ingres utility to generate the DDL (Data Definition Language) SQL statements that were used to create the Ingres tables. I then wrote the Perl “DDL Generation Script” to produce the SQL statements for generating the Oracle ARUNA database using the Oracle SQL Plus utility. This script performs mapping between some data types only found in Ingres into their Oracle equivalents. In order to enhance consistency when using SQL Loader, the script also produces the control files that direct SQL Loader’s operations.

Ingres only provides a way to export a whole table as a flat file, but I prefer being able to extract the data in several steps. For this reason, I chose to install DBI and the Ingres DBD, so that I could write the Perl “Extract script”. The script receives the extraction criteria from its command line and it performs the required data formatting on the output files.

Using both the above-mentioned Perl scripts and the DBI interface, I’ve been able to carbon copy an Ingres database into an Oracle database having the same structure. This approach has allowed me to transfer the 500,000 patients and their associated medical records between the two databases in a few hours.

***

Marc-Henri Poget is a project manager in the computer department of the University Hospital of Lausanne, Switzerland. He holds a degree in software engineering from the Swiss Institute of Technology (www.epfl.ch) as well as an MBA. He was instrumental in the deployment of DIAMIC, a software package used in the pathology laboratory. His interests include project management, troubleshooting and open source software. He can be reached at Marc-Henri.Poget@hospvd.ch.

To learn how large and small companies are using Perl to meet their goals, check out Perl Success Stories.

If you have a Perl success story of your own that you’d like to share, please let me know. You can reach me at: todd@oreilly.com.

PyCon DC 2003 (March 26-28, 2003, Washington DC)

listen

Thursday December 19, 2002 1:15AM
by Uche Ogbuji

Related link: http://python.org/pycon/

“PyCon is a community-oriented Python conference emphasizing accessibility and low cost. It is designed to complement the International Python Conference (IPC), which will be a track at the O’Reilly Open Source Convention (OSCON) starting in 2003.” They seek presenters. Submissions must be received by January 15th. Registration fees are being kept minimal to encourage community turnout.

Browser-based WSYWIG editing

listen

Thursday December 19, 2002 12:49AM
by Uche Ogbuji

This is just a listing of tools that allow developers to add browser-based editing facilities to applications. They use various technologies and have different strengths and weaknesses. They’re also offered with varying licensing and cost terms.

I only recently heard of contentEditable, which as Evan Lenz descreibes it is ‘A proprietary attribute starting with MSIE5.5 that allows the user to do in-browser WYSIWYG editing. It is inherited by descendants and can be overridden with contentEditable=”false” for read-only sections’. You can also see Evan’s little contentEditable demo. Evan also says that folks are trying to implement this attribute for Mozilla.

For a Fourthought project we’re about to start we’ve chosen Arbortext Epic Extend, which is basically a terminal applet that uses Winframe to proxy a session of the Epic editor running on the server. They claim multi-platform multi-browser support. It’s fairly dear, and it’s not very easy to get a demo copy. Since the code base is actually the full Epic editor, it does support most typical Word Processing features.

Xopus is “a browser based in-place wysiwyg XML editor. Xopus allows users to edit their XML data in an intuitive word processor alike way. Xopus allows common users to edit complex XML documents without knowing anything about XML without even realising they are editing XML.” Xopus appears to use XSLT in a very clever way. They also recently implemented a contentEditable implementation for Mozilla 1.3. Xopus is open source. They’re also working on a commercial version.


Xopus runs in a browser without the use of plugins. Browsers supported are:
  -Internet Explorer 5.5 and more on Windows
  -Mozilla 1.0 or more on all platforms"

“Ektron eWebEditPro+XML makes it easy (and transparent) for business users to work with XML-based content. The WYSIWYG (What You See Is What You Get) interface means the editing environment looks identical to the finished output.” Supports XHTML, it seems.

This one is an ActiveX control and Windows only, though they claim IE, Netscape and Mozilla support on Windows. It is commercial yet pretty inexpensive. A demo is available on-line.

edit-on is a Java applet serving as WYSIWYG HTML / XHTML editor. “It easily replaces HTML <TEXTAREA> fields and complements existing form-based Content Management Systems (CMS). In addition edit-on Pro provides comprehensive word processor-like features to websites and web applications. So, web authors and content providers can simply create and publish their content online.”

It looks cross-platform tested. “In contrast to any available DHTML/ActiveX (also known as Microsoft IE DHTML editing control) based editor solution, edit-on Pro enables WYSIWYG rich text editing on almost every computer platform.” It is commercial and very inexpensive. A trial license is available.

“EditLive! for Java is an online XHTML authoring tool that empowers business users with an intuitive, easy-to-use interface for creating and publishing web content.”

“EditLive! for Java supports cross-platform authoring on Windows (Netscape and Internet Explorer), Mac OS X (Internet Explorer) and Sun Solaris (Netscape).” Features spell checking, table editing. It’s commercial, but I can’t figure out how much, and there seems to be no easily available demo.

Evan Lenz commented:

The java demos are definitely slower
than the ActiveX one, but I suppose that is to be expected.
The Ephox one does appear to be a bit snappier than the
RealObjects one.
But I haven’t even begun to explore these from the developer’s standpoint
like how to edit custom XML formats.
I know that eWebEditPro+XML can do this, using various XML- and JavaScript-based configuration files.
Interestingly enough, it looks like RealObjects edit-on
Pro will have an ActiveX add-on, presumably to get the best of both worlds (interoperability and better performance when on Windows…)

Bitflux Editor is a browser based Wysiwyg XML Editor ” for any Operating System”. “The Bitflux Editor is Open Source since September 10, 2002. Bitflux open-sourced the fully functional Editor (with tables, lists, picture upload etc.) under the Apache License.” Looks like it’s Mozilla only.

Here Microsoft discusses code for WYSIWYG editing in IE6.

The doczilla Web site, www.doczilla.com, appears to be down. Does anyone know what’s happened to the project?

As I come across other such projects I shall try to keep this document updated.

2002-12-19: Composite. “ComposIte is a chrome overlay which enables a streamlined Mozilla Editor for html composition in textareas. To use the editor, hit ctrl-e in a textarea. Alternately, you can turn on an ‘Edit with Composite’ button in the Composite prefs (v0.0.5 and higher).”

2003-02-04: The Jackpot: TTW WYSIWYG Editors, a list maintained by University of Bristol. I’ll only post further updates if they’re not on this page.

Do you have any pointers to or comments on such tools?

RooDolF: Google in RDF

listen

Wednesday December 18, 2002 4:19PM
by Uche Ogbuji

” RooDolF allows you to query the Google search engine via the Google Api and have your results returned in standard RDF format.” Also see RooDolF 2, with a few new features.

Registering a new URI scheme

listen

Wednesday December 18, 2002 3:43PM
by Uche Ogbuji

For many applications it may be useful to define private URI schemes or at least schemes that have not been blessed by the full consideration of the IETF. Recently we have had occasion to consider a special scheme for resources in a 4Suite repository instance. Mike Brown and I set out to figure out how best to do this. Here are some notes from that effort that might be helpful to others with the same need.

For those interested in the 4Suite design issues that motivated this quest, see threads starting at [1], [2],
and [3]
. Mike Brown summarized results of discussion in [4]. Note the folloing quotes:

> I too am leaning heavily towards ftss://user:pass@host:ftrpcport/.
> Yes it is strictly bogus, but unless we are willing to register a
> protocol, we have a choice between bogosity and file:.
>
> I think all the options of using file: are too confusing and/or
> limiting. Ergo, I'd rather be somewhat bogus with the ftss: scheme.
>
> As I said, I don't forsee any interop problems, because that scheme
> would never be used outside repo context.

Why is it bogus?

> Well, I might be wrong. I thought you can't just invent your own
> top-level scheme.

...

Research into scheme name selection followed, leading me to post
http://lists.w3.org/Archives/Public/uri/2002Dec/0006.html. So far, we got a
response from Dan Connolly, but no definitive advice as to choosing a good
scheme name. It seems the process of giving the IANA authority over
vendor-specific scheme names has stalled indefinitely, so we can probably
squat on whatever we want as long as we publish an Internet Draft about it for
the IETF. If they pick it up again, they'd probably want us to use vnd.ft.ftss
or vnd.fourthought.ftss rather than just ftss. So do we play nice or use the
short name?

We quickly found RFC 2717, and in particular section 3.3 which covers “Alternative Trees”. From this section:

   While public exposure and review of a URL scheme created in an
   alternative tree is not required, using the IETF Internet-Draft
   mechanism for peer review is strongly encouraged to improve the
   quality of the specification.  RFC publication of alternative tree
   URL schemes is encouraged but not required.  Material may be
   published as an Informational RFC by sending it to the RFC Editor
   (please follow the instructions to RFC authors, RFC 2223 [3]).

   The defining document for an alternative tree may require public
   exposure and/or review for schemes defined in that tree via a
   mechanism other than the IETF Internet-Draft mechanism.

   URL schemes created in an alternative tree must conform to the
   generic URL syntax, RFC 2396.  The tree's defining document may set
   forth additional syntax and semantics requirements above and beyond
   those specified in RFC 2396.

   All new URL schemes SHOULD follow the Guidelines for URL Schemes, set
   forth in RFC 2718 [2].

We also found URIs, URLs, and URNs: Clarifications and Recommendations 1.0, which introduces a dichotomy I hadn’t heard of before: between a “Classical” and a “Contemporary” view of URI space partitioning.

We found Dan Connolly’s informal index of URI addressing schemes. In particular, see the section on Registration of naming schemes.

Mike Brown posted a query to uri@w3.org and got this response from from Dan Connolly. From Connolly’s response:

I try to maintain an informal index of schemes...
  http://www.w3.org/Addressing/schemes
but I find it disheartening. It seems that I find a handful
of unregistered schemes every day. I feel obliged to note
the schemes here in uri@w3.org and invite the developers
to register their schemes, but I don't often get around to it.
Help with that sort of thing is much appreciated.

Hmm... here's a bit of advice: whatever you end up doing, please
write a one-page internet draft about it.
http://www.ietf.org/ietf/1id-guidelines.txt

Optimising Python on multi-processor machines

listen

Wednesday December 18, 2002 10:44AM
by Uche Ogbuji

A collation of discussion on optimizing Python for multiprocessing machines, and the problems that pertain to it.

Smart Websites with CherryPy

listen

Wednesday December 18, 2002 10:33AM
by Uche Ogbuji

This is a very good overview of CherryPy, a Web application system for Python. I really like the cleanliness and straightforward purpose of this project. I think there are some areas where syntactic additions are added where they may have been omitted, but I then again, CherryPy does seem to make developers have to adapt to the framework a lot less than many other such systems do.

Have you used CherryPy? What do you think of it?

What's New in Python 2.3

listen

Wednesday December 18, 2002 10:28AM
by Uche Ogbuji

Andrew Kuchling demonstrates over again why he won the Frank Willison award with his comprehensive discussion of changes in Python 2.3. He usually updates this document all the way up to release, as things change even further, so keep an eye on it to be ready for all the goodies coming to good Python programmers.

So are you looking forward to the 2.3 release?

Should Scientific Knowledge be Free?

listen

Wednesday December 18, 2002 6:36AM
by Kevin Bedell

If you perform scientific research or publish related material then you should consider taking steps to ensure your work remains free and open for others to build on.

Isaac Newton is quoted as saying, “If I have seen further it is by standing on the shoulders of giants”. This is the fundamental idea behind all scientific research - build on what others have learned and create something new.

Yesterday, the Public Library of Science (PLoS) announced that Gordon Moore (co-founder of Intel and author of Moore’s Law”) donated $9 Million to fund open access publication of scientific research. These funds, donated through the Gordon and Betty Moore Foundation, will be put to use to publish open access scientific journals and an on-line, searchable archive of scientific research.

The PLoS has published an open letter to the publishers of scientific information around the world appealing to them to release all material they publish to the PLoS within 6 months of initial publication. This letter has now been signed by 32046 scientists in 182 countries (yes, this number is different the one in the opening to this entry - 6 more scientists signed while I was writing!).

Here are some questions:

Should work you publish by released through the PLoS? If so, read their letter and sign it!
Are you involved in an industry with similar dynamics? Would sharing information in an open way benefit everyone in your industry? Why not consider a similar venture?

For more info, check out the Public Library of Science.

Last-minute business RSS

listen

Saturday December 14, 2002 6:20AM
by Matthew Langham

If you’re an RSS junkie like I am, then your second most active program (after the browser) on the desktop is probably some form of RSS aggregator.

When I’m pressed for time - as seems to be most days now, a quick glance over the subscribed feeds in the morning prepares me for the day. My reading is a collection of weblogs and news feeds - much the same as I am sure your RSS aggregator feeds off.

But there is something missing.

Business Data.

This is something I’ve been thinking about for a while now and the advent of the holiday season seems to be a good idea to ask the following questions:

Where is the Travelocity last minute travel feed for my New Years vacation, split into channels per location? See a vacation you like in your aggregator? Click on the link and “buy now”. Yes it’s advertising - but it’s also information I happen to be interested in.

Where is the TOYS’R'US special offer channel for toys (with perhaps a separate channel per interest)?

Even O’Reilly has yet to catch on to what is possible. Why can’t I get this as an RSS feed also? Or monitor new career opportunities from here using my aggregator?

Imagine how up to date the data could be that arrives into your aggregator - and as YOU subscribe, YOU can control what you read and for how long you want to read the feed.

What easier format could there be for businesses to publish business data in? The client side of things is slowly becoming ubiquitous and frameworks like Apache Cocoon make publishing XML easy.Of course there are already lots of XML formats for publishing business data. But RSS is a no-brainer and the infrastructure is already in place.

Aggregating business data into RSS is the next wave. If this takes off (and why shouldn’t it) what’s to stop business software from say SAP from producing RSS feeds out of the box?

Just imagine.

What other RSS-business data would you like to see?

Mac Addicts rise up and catch e-bay thief

listen

Friday December 13, 2002 7:52AM
by Kevin Bedell

Related link: http://www.remodern.com/caught.html

When Jason Smith lost a beautiful new Mac to a thief with a counterfeit cashier check, he didn’t just get mad - he appealed to the Mac community and they responded.

Pretty soon his thief had been tracked down, had his personal information figured out, had pictures of his house and car taken and - the best part - got caught red-handed by a policeman in a FEDEX outfit.

Follow this Hollywood-style detective story here.

Why Create/Convert Documents to XML?

listen

Wednesday December 11, 2002 10:24AM
by Uche Ogbuji

Related link: http://www.cambridgedocs.com/id35.htm

This white paper by CambridgeDocs gives some reasons for managing documents in XML. It is not really a technical discussion, but rather one aimed at information managers. For those of us who often have to communicate with such an audience, I think it quite useful.

Is Santa an Apple Switcher?

listen

Wednesday December 11, 2002 9:26AM
by Adam Trachtenberg

Related link: http://www.apple.com/switch/ads/will/

Just in from the North Pole: Santa has an iPod. Check out these QuickTime Santa switcher commercials with Will Ferrell.

Internet content filters may be bad for one's health

listen

Wednesday December 11, 2002 8:54AM
by Andy Oram

Computer programmers and administrators may not think that censorship and filtering are health issues, but apparently the prestigious Journal of the American Medical Association does. It’s publishing a study that says what every other study of content filters says: they block lots of useful sites. This problem can be made better or worse by careful configuration made by school and library administrator, but the filters remain blunt instruments.

We have to think of this when considering John McCain’s law requiring filters in government-funded schools and libraries. But given Congress’s neglect of health issues, even this information is unlikely to make them regret passing the bill.

Streaming XPath

listen

Tuesday December 10, 2002 1:36PM
by Uche Ogbuji

There has been a lot of discussion of a subset of XPath that can be computed on the fly while parsing a document. A good way of thinking about this is a subset of XPath that could be implemented in SAX with relatively little fuss. In the XPath NG project, we have been discussing such a subset of XPath as a possible goal of XPath NG. This is a summary of the various streaming XPath proposals that have been brought up so far.

Desai, Arpan presented Introduction to Sequential XPath at XML 2001. His introduction says it all:

This paper will provide an explanation of and the subset of XPath which we will tentatively dub: Sequential XPath, or SXPath for ease of use. SXPath allows a event-based XML parser, such as a typical SAX-compliant XML parser, to execute XPath-like expressions without the need of more memory consumption than is normally used within a sequential parser.

Robin Berjon then mentioned another project:

There is another stab at [streaming XPath] (with an implementation):

http://search.cpan.org/author/RBS/XML-Filter-Dispatcher-0.31/lib/XML/Filter/Dispatcher.pm

The beginning describes the module more, but the end focusses on the subset of
XPath which Barrie calls "EventPath".

PS: don't worry if you don't quite understand the code in some of the examples,
it often uses XML::SAX::Machines which is a level of abstraction above SAX filters.

Berend de Boer also brought up Rules for Efficient XPath Evaluation, another such proposal, constructed with mathematical fastidiousness.

Do you know of any other streamable XPath proposals? Do you have any particular preference?

AT&T, Intel, IBM launch Wi-Fi firm

listen

Thursday December 5, 2002 11:45AM
by Andy Oram

Related link: http://news.com.com/2100-1033-976225.html

The presence of AT&T gives me some confidence that this consortium might actually accomplish this pressing need: to blanket the country with hot-spots for 802.11 access. Wide-area networking is a big job, and I don’t think Intel or IBM are quite up to it themselves. AT&T is clearly the grandmaster of it. But they haven’t done so well with their other new ventures since the divestiture–can they do this one right?

Does this look like a winning combination?

Face to Face with (a) Gnu

listen

Thursday December 5, 2002 8:03AM
by David Sklar

I just got back from a weeklong trip to South Africa. My only brush with technology that’s even remotely relevant to the O’Reilly Network was the personal video on demand system on the plane. The system of the woman sitting next to me crashed and had to be rebooted, so I was treated to the blue screen of death peering out from the seatback above her tray table.

I don’t mean to say that there isn’t open source software in South Africa — I just wasn’t looking for it (that’s what vacation is for.) I did see, however, a lot of computer training facilities, especially in smaller towns and in Soweto. Technology skills are definitely on the radar there as a path for economic improvement.

Less technologically relevant but more amusing was the wildlife in Kruger National Park. Among other marvelous creatures, I saw plenty of wildebeest, or as they’re also known, Gnu. They seemed more interested in grazing than intellectual property, but that may be because they already have plenty of freedom.

How to Validate an E-mail Address

listen

Tuesday December 3, 2002 11:10AM
by Kevin Bedell

Anyone who writes web applications has likely built forms for users to enter their e-mail address. It’s considered a part of our basic identity now - like our name and address. It’s expected we have them.

But then, inevitably, once we’ve collected e-mails from all our customers and we go to use them - we get bounces. Which leaves us to clean up the mess - and there usually is a mess to clean up if we can’t tell the customer their order is ready or that we need information from them. Worse yet, we might just assume the information got through correctly and go on our merry way - leaving our customer dangling.

And while we’re quietly cursing our chubby-fingered (yet all important) customers, we wonder (or scream!) to ourselves why we don’t just write some code to validate the e-mail addresses when they’re entered?

But anyone who’s really looked into doing this will tell you that it gets deep quickly. What seems like it should be straightforward ends up arcane and impossible.

To begin with, e-mail address formats are covered by RFC 822 - which is filled with impenetrable discussions on “sequences of lexical symbols” such as “atoms”, “special characters”, “domain-literals” and “comments”.

“comments”? Yes, e-mail addresses can contain comments. I tested them too - and they work. A comment is (to the best of my knowledge) any text placed in parentheses anywhere in the email address. For example, my e-mail can be:

kevin@kbedell.com, or
kev(you da man!)in@kbedell.com, or
kevin@k(evin)bedell.com

All these work - I tried them. Try validating that. I dare you.

Another bit of a twist is that you can also specify an IP address instead of a domain name. For example, I’m not only “kevin@kbedell.com”, I’m also kevin@216.80.243.82.

To make matters worse - as it should be expected to get - many mail servers won’t accept emails even if they are valid. For example, my mail server won’t accept kevin@216.80.243.82 - the anti-spam controls bounce it.

Imagine - all that work to validate it, and it still won’t work. Makes you want to spend your days surfing these pages…

I even ran across one brave soul that came up with a regular expression that he was sure could validate an e-mail address. Here it is:

Wow. That’s a mouthful. Of course, I’m so jaded by now that I’m sure he must’ve missed something. Or that the emails will just get bounced anyway.

So is validating an email address impossible? Here’s the answer: It’s easy!

You don’t have to be a genius to validate email addresses. All you have to do is send a test e-mail to the customer! Really - this is the only way. If it gets through, the address is valid. If it bounces, then it’s not.

Now let’s just hope no one ever changes their email address once we validate it…

Are my points “valid”?

PHP Weekly Updates and RSS Feeds

listen

Monday December 2, 2002 9:26PM
by Adam Trachtenberg

Need to keep track of PHP development, but don’t have the time to follow all the mailing lists? Don’t worry. Your friendly PHP development team publishes a variety of RSS feeds and weekly newsletters to suit your needs:

Weekly Newsletters

PHP Weekly Summary: http://www.zend.com/zend/week/

PEAR Weekly News: http://pear.php.net/weeklynews.php

RSS

PHP News: http://www.php.net/news.rss

PEAR Releases: http://pear.php.net/rss.php

PHP Newsgroups: http://news.php.net/
(Individual RSS and RDF feeds available for each mailing list.)

Enjoy!