UnixReview.com
Welcome to UnixReview.com


Main Menu
  Home
  Archives
  Reviews
  Books
  Geek Links
  Contact Us

Sections
  Open Source
  Certification
  Shell Corner
  lost+found

Newsletter
Get the Newsletter
Get the Newsletter

  Regular expressions are endemic to UNIX. It is possible to use and administer UNIX without knowing what they are, but you aren't using the full power of UNIX if you don't understand how to use them.

Daemons And Dragons Even experienced UNIX users can benefit from a regular-expression review. There are some aspects of the expressions, as implemented under UNIX, which are used infrequently and rarely discussed. This month we will review the basics and discuss some of the more-powerful and less-used notations.

Regular expressions date back to the 1950s, when a mathematician named Stephen Kleene came up with a notation to describe a mathematical construct called finite automata. Computer scientists saw the usefulness of this notation for describing lexical analyzers in compilers. The UNIX utility lex relies heavily on regular expressions for its input language, and the expressions are used by grep, ed, sed, vi, Perl, and many other utilities. These expressions were considered so useful that UNIX developers included a set of functions in the standard C library to support them (regexp).

A regular expression describes a set of possible input strings. Computers use regular expressions to answer a basic question: is a given string a member of the set described by the expression? Where other operating systems can only search files for specific strings, UNIX has mechanisms to search files for strings that match a regular expression.

In order to make the use of regular expressions practical, the traditional notation was expanded without sacrificing the basic idea of a regular expression. Although the UNIX C library contains a regular-expression compiler and scanner (regexp), it doesn't implement everything you might want. Some commonly used programs reimplemented regular-expression matching and, as you might expect, not everyone uses the same notation. As a consequence, the exact notation for an expression can vary between commands and applications. Fortunately the differences are minor. For convenience, we will limit our discussion to the expressions accepted by the command egrep.

A list of possible characters can be specified in brackets, [0123456789] for example. This is just alternation in a more compact form (see "Traditional Expressions"). The brackets can also be used to specify ranges of characters, allowing us to write the previous expression as [0-9]. This is assumed to be a range over the ASCII encoding sequence and will match exactly one character in that range. Similarly, a range including upper and lower case letters is represented with [A-Za-z]. We can write an expression to match an integer this way: [0-9][0-9]*, which is far more compact than the traditional notation. Another symbol exists to make this easier: the + is a closure operator similar to * but requires at least one occurance of the preceding expression. So our integer expression can be abbreviated as [0-9]+.

Another addition to the notation is the dot (.), which matches one character regardless of what the character actually is (well, almost: we will discuss the exception later). A very common expression is .*, which matches zero or more of anything. For example, you can write an expression to match a string that starts and ends with a digit, but can have anything in between: [0-9].*[0-9].

In traditional usage, a regular expression has to match the entire string. In other words, if you think of a regular expression as describing a set of possible strings, then the entire string you are checking must be a member of that set. Practical implementations relax this a bit to make expressions easier to use for searching. Consider this example: you want to look for "Unix" or "unix" in a file. In a strict sense, the expression
[Uu]nix should only match if the string is the entire content of a file. But this is unreasonable, so anything in UNIX that uses regular expressions will search for their presence at any point within a file or text block. It is as if we wrote the expression as .*[Uu]nix.*.

Most people tend to think of searching as a line-oriented operation. Even though pure regular expressions don't care about line boundaries, conventions have been adopted to make it easy to perform per-line searches. The prime example of this is the dot. If the dot really matched any character, it would match the line separator, and the expression potentially could match the entire file, which is probably not what we want. In reality, the dot will match any character except the line separator (newline for UNIX). So the expression .* can only match an entire line and not the whole file.

Our augmented regular-expression syntax also lets us use "anchors." These operators don't match characters but are used to indicate position within a line. The carat (^) is used at the beginning of an expression to anchor it at the beginning of a line. The dollar sign anchors the expression at the end of a line. If used in the middle of an expression, these characters (^ and $) don't have any special meaning, but just indicate a match with the actual characters ^ and $. Thus, we can search for a number starting at the beginning of a line with this expression: ^[0-9]+. To avoid matching strings of the form we can put a space at the end of the string ^[0-9]+. If we are looking for lines that only contain numbers, we would use ^[0-9]+$. In this case, we have anchored the string at both ends, requiring that it match an entire line.

One of the disadvantages of regular expressions is the use of ASCII characters for operators. How do you search for an asterisk? Enter the venerable backslash. Any character with a special meaning (*, ^, $, [, +) can be preceded with the backslash to escape the meaning and indicate you want to exactly match the character -- so the string \* matches an asterisk, and \++ matches one or more plus signs.

At this point we should note that many of the characters discussed here also have special meaning to the shell. In order to make sure the characters you type are processed by the command (such as egrep) and not the shell, you should always surround your regular expressions with single quotes when entering them on a command line or in a shell file. This ensures that the shell itself leaves them alone. Here's a pop quiz. What will the command egrep '\**' do?

Advanced Expressions

Now that we've gone over the basics, let's look at some of the fancier notations. A set of characters specified with square brackets can be inverted by using a carat as the first character inside the brackets. For example, to match a character that is not a letter, use [^A-Za-z]. The carat only has this special meaning when it appears immediately after an open bracket. If you want to match any of the characters ^&*$@ you could use the expression [@$^&*]. This specifies what you want, where [^&*@#] means something totally different. Note that an asterisk has no special meaning inside the brackets. One interesting expression is [^ -~], which matches any character that is not printable ASCII. This won't match newlines (as searches are always done per-line) but will match a tab.

Some UNIX regular-expression notation allows for the specification of a repeat count. This indicates that a given expression should appear a specified number of times. The repeat number is given after the expression inside curly braces, but the braces must be escaped with the backslash (contrary to every other known use of the backslash). The expression 'X\{5\}' will match five Xs. If a line contains eight Xs, this expression will match the first five and consider the comparison a success. If you want to look for five Xs followed by something else, you have to use 'X\{5\}[^X]', which won't work when the Xs are placed at the end of the line. This notation can also be used to specify a range of repeats. An expression to match five to nine Xs followed by five to nine Ys is: X\{5,9\}Y\{5,9\}. This notation doesn't work with egrep, but does with anything that uses regexp (such as grep, ed, sed, and vi).

The strangest expression notation is also one that violates the notion of a "regular expression." This is an expression that checks for part of the string that was already matched. Once again, we resort to the use of backslashes to introduce operators. The grouping construct formed with the pair \( and \) indicates parts of an expression that can be rematched later. A backslash followed by a number will match one of these expressions: \1 for the first parenthesized expression, \2 for the second, and so on. The following expression matches any line with a repeated string: \(..* \)\1. The part inside parentheses, ..* matches anything followed by a space, and the notation \1 matches the first parenthesized expression. This is powerful and can be incredibly useful, but is only supported in regexp.

Oddities

As already mentioned, egrep does not accept the backslashed operators of sed, ed, and vi. These operators are part of the regexp package in the C library, so any application that uses it will understand that notation. Oddly enough, the regexp notation does not have a general grouping operator. Regular parentheses, as in (abc) do not work, and the closure operator will not work after \). With egrep it is possible to use an expression such as abc(abc)*, but regexp will not interpret that correctly, nor will it understand abc\(abc\)*. There is also no support for general alternation in regexp. Where egrep permits (word)|(phrase), there is no way to write an equivalent expression in the regexp notation. In general, egrep notation is powerful enough to write any regular expression, but regexp notation has limitations while allowing constructs that are not considered "regular."

Challenges

Pop quiz answer: the expression in the command egrep '\**' file will match any line that contains zero or more occurrences of an asterisk. But wouldn't that be every line in the file? In fact, the command will end up writing every line in "file to the standard output. Here are more mind benders for you. See how many you can get right. I'll post the answers on my Web site at http://www.groupsys.com and will include them in next month's column.

Write an egrep expression to match the following:

1. A line with exactly three space-separated fields.
2. A negative integer.
3. An invalid hostname (valid characters for hostnames are the numbers, letters of both cases, and the hyphen, but a hyphen cannot begin or end the name).
4. Any decimal number (positive or negative) surrounded by spaces.
5. An integer number followed by that number of letters, for example: '6abcdef'.

References

  • Friedl, J. Mastering Regular Expressions, Sebastapol, CA: O'Reilly and Associates Inc., 1997.

  • Johnson, W. L. "Automatic generation of efficient lexical processors using finite state techniques," Communications of the ACM, 11:12, 1968.

  • Lesk, M. E. Lex -- lexical analyzer generator, Computer Science Technical Report 39, AT&T Bell Laboratories, Murray Hill, NJ, 1975.

William LeFebvre has been banging on UNIX systems for 15 years and has been studying Internet technology for almost as long. He is the editor for the SAGE series "Short Topics in System Administration," and he currently operates Group Sys Consulting (Alpharetta, GA). Reach him at wnl@groupsys.com or via the Web at http://www.groupsys.com.


Traditional Expressions

The original notation for regular expressions included only four operations: alternation, concatenation, grouping, and closure. A conventional character would match itself. The simple expression a would only match the letter a. To match an a followed by a b would require the expression ab -- that's concatenation. To match either a or b one would use alternation: a|b. Grouping is denoted by parentheses, so the expression a|b and (a|b) match the same thing.

Finally, closure is represented with * and indicates "zero or more." The expression ab* matches a followed by any number of b characters. Specifically, it would match any of the following strings: a, ab, abb, abbb, and so on, but would not match ac. Closure always operates on the immediately previous expression. Combining closure with grouping lets you write expressions like (a|b)* which would match any string that is a mixture of the characters a and b.

This is all incredibly useful if your files only contain a and b. As it stands, this syntax is cumbersome for real-world applications. Consider the expression that would be needed to match an integer number: (0|1|2|3|4|5|6|7|8|9)*. Even this isn't entirely correct, as it also matches an empty string. To make sure we only matched one or more digits, we would have to repeat that list of digits, making the expression almost twice as long.-W.L.


   
Home | Top


Copyright © 2001 UnixReview.com, UnixReview.com's Privacy Policy
Comments about the website: rreames@cmp.com
SDMG Web Sites: C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, Sys Admin, SD Expo, SD Magazine, UnixReview.com, Windows Developer's Journal, TPJ, BYTE.com