Tcl Scores High in RE Performance
by Cameron Laird and Kathryn Soraiz
Many of the things programmers know about regular expressions are false. While we've written about this before, it's time now to tell more.
That's much of the problem, in fact: there are a lot of aspects to regular expressions (REs), and there's simply a long distance between the thousands of programmers that use them casually, and a deep understanding of all the context RE involves. Even a 500-page book on the subject can't cover everything there is to say. Let's see how much of the distance this month's column can bridge, though.
Perl's special place
First, despite our column's title, we're not specialists in REs, nor even zealots for their use. Regular-Expressions.info is the Web place to go for tutorials and ongoing enthusiasm on the subject of REs, and the Wikipedia article on the subject does a good job of explaining how programmers see REs.
We've run into many programmers who identify REs with Perl. This is, at best, naive.
REs were widely used on computers before Perl -- although there weren't nearly as many computer or programmers as today in the '70s and early '80s when REs first appeared in
grep. Many programmers first encountered the power of REs while programming in Perl. Perl, moreover, is nearly unique among popular languages for making REs a part of its syntax; most languages expose REs through their run-time libraries. While those two ideals -- privileged syntax versus library evaluation -- turn out not to be as easy to distinguish as might first appear, there's no doubt that REs are deeply embedded in Perl parsing and practice.
A couple of mistakes we often encounter, though, are to think that parsing necessarily involves REs, and/or Perl. There are parsing tasks that simply can't be done by REs, notably XML; others that can be written as REs, but only at the cost of extreme complexity (email addresses are an example); and still others where REs work quite well, but slower or more clumsily than such alternatives as
scanf, procedural string parsing, glob-style patterns, the pattern matching such languages as Erlang or Icon build in, and even interesting special-purpose parser other than RE, such as Paul McGuire's Pyparsing or Damian Conway's Parse-RecDescent. Christopher Frenz' book hints at the range of techniques available for practical use.
All this still doesn't exhaust what there is to know about Perl and its RE: Perl REs are actually "extended" REs, and there's controversy about the costs of that extension; Perl 6 is changing Perl RE's again; much of Perl's RE functionality is now readily available in "mainstream" languages such as C, Java, C#; and so on. There's a relative abundance of commenters on Perl, though, so we leave these subjects to others for now.
One final misconception about Perl's REs deserves mention, however: that its REs dominate all others in convenience, correctness, performance, and so on. As a recent discussion in the Lambda the Ultimate forum remarked, Perl's performance on a well-known RE benchmark is less than a quarter of that of the equivalent Tcl. This is not unusual; Tcl's RE engine is quite mature and fast.
The final point to make for this month about REs, though, is that many of these facts simply don't matter! We've seeded this column with an abundance of hyperlinks to outside discussions and explanations; read for yourself about the different implementations and uses of REs. What's remarkable, though, is how little consequence most of these details appear to have to working programmers.
The benchmark just mentioned, for example, and others that demonstrate Tcl can be hundreds of times faster than other languages in RE processing suggests that Tcl might be a favorite of programmers with big parsing jobs. It's just not so. Objectively, Tcl's RE performance is a strength, particularly in light of the fact that it correctly handles Unicode.
Almost no one cares. Cisco, Oracle, IBM, Siemens, Daimler, Motorola, and many other companies and government agencies use Tcl in "mission-critical" applications. We've interviewed managers from dozens of such development staffs, and never has a decision-maker volunteered that the speed of Tcl's RE engine affected choice of the language. Even the most passionate pieces of Tcl advocacy don't mention its RE engine.
Amazingly, Tcl insiders don't regard its RE engine as "wrung out". In fact, it's one of the more conservative parts of the Tcl implementation, and core maintainer Donal Fellows has abundant ideas for its improvement.
This is far from the only paradox having to do with REs. Consider this: Zope is a large, very capable, mature "content management system" (CMS) based on Python. In its standard configuration, Zope hides RE from developers; even though Python nicely implements REs, and Zope gives almost the full power of Python to its programmers, REs are specifically excluded. Why? The designers decided that REs are such a security and performance risk that, for development of complex Web sites, programmers are best off without them (more precisely, programmers need to override defaults at an administrative level to use REs).
The conclusion? REs are powerful, and can, at their best, enormously simplify and accelerate programs. If you need accurate information, though, on whether a particular approach is "best", water-cooler-level folklore is an unreliable guide. REs are such a subtle subject that you'll need to call in the help of an expert, and/or do your own research, any time you need to be certain you're right. Make sure REs work for you, rather than the other way around.
Our thanks to Andreas Kupries, Larry Virden, and Kevin Kenny, who provided special help with this column. Also, thanks to Matthew Persico, for pointing out that last month's column on sprints gave "the short shrift" to the position of Perl Hackathons in the world of sprints. Our response: yes and no. As we already mentioned above, Perl news generally gets its share of attention from others, so we concentrate on quieter corners of the development world. Moreover, Perl Hackathons did not, as best we can tell, influence the early history of sprinting. On the other hands, Chicago and European Perl Hackathons have been great successes in the last few years, and we hope for more of the same in the future.
Kathryn and Cameron run their own consultancy, Phaseit, Inc., specializing in high-reliability and high-performance applications managed by high-level languages. Each month, they write about high-level languages and related topics in their "Regular Expressions" columns.