Unix Review > Archives > 2004 > September 2004
Print-Friendly Version

September 2004

Regular Expressions: As Time Goes By

by Cameron Laird and Kathryn Soraiz

Have the time? Often, as developers, having the time is difficult. We don't mean just that schedules are hard to meet — though that's true — but that temporal calculations such as, "give up on this lookup if it takes more than ten seconds" or "set a deadline based on a date as entered by the end user" or "tell the customer that his documents will be shipped after three business days", turn out to be far harder to program than an outsider might expect. Why the complexity? What can we do about it?

Time is hard

There's no mystery about the complexity of time calculations: people feel time deeply, so it has accumulated all the complications typical of human affairs. Don Porter, an electrical engineer with the National Institute of Standards and Technology, notes that, "like natural language, people expect it to be easy. They deal with it all the time, and its weird inconsistencies just seem natural. [P]eople are not aware of how many exceptions of exceptions of exceptions they deal with ... as second nature" when thinking about time.

A concrete example helps illustrate this. Your boss tells you that users should be able to set a deadline for completion of a task, and a program should start sending out warnings 24 hours before the deadline elapses. The highest level of this algorithm is simple: is the current time within a day of the deadline? However straightforward that sounds in English, an executable implementation turns out to require a great deal of care.

Suppose for the moment our computing system knows, as most modern operating systems do (at least in principle) what now is. Time calculations are usually based on an internal representation; in Unix this has traditionally been the number of seconds since the Epoch, that is, the beginning of the year 1970. 1970 seemed a handy milestone to the first generation of Unix designers, and Arthur Olson wrote the time-handling library for 4.3BSD, which has served as a reference implementation since. We'll also ignore such minor complications as local timezones that don't conform to POSIX.1 in defining a leap-second table.

What if the user writes the deadline as "09/10/04"? There needs to be a convention for assuming "04" should be understood as the year "2004". We also need a specification for resolving the conflict between the United States, which reads this date as, "September 10th, 2004" (already in the past), and nearly all other Gregorian-calendar-using countries, where it is understood as "9 October 2004" (after the date most of you will read this column).

It doesn't stop there. What time of day does the deadline represent? Midnight? Noon? Is that noon in the Greenwich time zone (or, roughly, the "Coordinated Universal Time", abbreviated as UTC), or noon where the computer is, or noon on the computer where the program was originally compiled (rarely the right answer, but sometimes one chosen accidentally) or noon where the end user is, or noon where the task is to be completed?

What if the deadline is near a "time change" — a switch from standard time to "daylight savings" time? In a case like that, is it okay that "24 hours" might mean the difference between noon one day, and 11:00 am or 1:00 pm (or 13:00) the next?

For many applications, leap seconds and daylight-savings changes are irrelevant; it's more than enough if the program gets things right to within a few hours. Other programs, of course, must respect sub-millisecond tolerances. As Edsger Dijkstra famously noted, a major part of computing's challenge is that we regularly deal with quantities that span unparalleled ranges; our clock-savvy programs rely on the same POSIX.1 application programming interface (API) when timing laser pulses or glacial ablation.

There's more — even with unambiguous requirements and systems specified, how do our computing systems communicate with humans? It's no longer enough to assume everyone online uses the same calendar or alphabet, let alone the same words for days of the week or months.

Recent improvements

The news isn't all bad, though. Run-time libraries are learning more and more about time, with the result that we developers can concentrate on requirements unique to our assignments. That's a lot more productive than trying to recreate on our own what experts like Kevin Kenny of General Electric or Marc-Andre Lemburg of eGenix.com Software have put into Tcl or Python, respectively.

Understand there's real value in their open source contributions. Run-time libraries that haven't been materially updated since Olson's original coding don't know about Unicode, can't use 64-bit arithmetic to accomodate times far into the past or future, and rely on out-of-date timezone assignments. Among systems administrators, it's common to advise shell scripters to invoke Perl or Python one-liners when doing calculations such as "a time three weeks from now." While these languages definitely build in more potent temporal arithmetic than sh, their standard time interfaces are really just easier-to-use wrappers for libc.

For the latest with clocks and calendars, look to the Tcl "head", or Lemburg's mxDateTime module for Python. These make it possible, for example, to ask for the first Friday after the second day of the approaching month with:

    # On our main development host, this returned
    # "Fri Oct 08 00:00:00 UTC 2004".
  clock format [clock scan Friday -base [clock scan "October 02, 2004"]]
With a sufficiently recent version, and the right locale and related settings, it's even possible to recognize languages other than English in time specifications. Part of the reason only a few languages have such capabilities is that internationalization and localization remain poorly understood. Tcl's clock command, for instance, had not only to be re-implemented, to parse in a more "interoperable" way, but also to be re-architected. The source for clock largely resides in a file called clock.tcl, to be read early in startup of a Tcl process. An input-output operation such as reading, though, requires correct initialization of the "encoding tables" that map characters of different human languages. Correct generation of executables that respect all these dependencies demands the precision that only experience brings.

Study the reference documentation for the time packages of the high-level languages of your choice. You might be surprised to learn how much they have to offer, so take advantage of what's there.

When not puzzling over the calendrical calculations involved in putting out "Regular Expressions" each month, Kathryn and Cameron run their own consultancy, Phaseit, Inc., specializing in high-reliability and high-performance applications managed by high-level languages.

Sys Admin Spotlight

CMP DevNet Spotlight

Regular Expressions: Two Easy Steps Better Than One Hard One
Complicated regular expressions and yacc are powerful parsing tools, but they can cause trouble in inexperienced hands. One helpful alternative is "partial evaluation" (PE) or "active data" parsing

In the News

CD-ROM

Sys Admin and The Perl Journal CD-ROM version 11.0

Version 11.0 delivers every issue of Sys Admin from 1992 through 2005 and every issue of The Perl Journal from 1996-2002 in one convenient CD-ROM!

Order now!




MarketPlace

Build IT Knowledge with Current & Trusted Content
Helps Employees Develop & Hone New Technical Programming Skills. Sign Up & Get Full Access.

FREE Sophos Threat Detection Test
Scan for viruses, spyware & adware. Is your AV catching everything?

Automate Software Builds with Visual Build Pro
Easily create an automated, repeatable process for building and deploying software.

See how easy remote support can be. Try WebEx free!
Deliver Support More Efficiently. Remotely Control Applications. Leap Securely through Firewalls!

Wanna see your ad here?