|
|
|
Regular expressions are endemic to UNIX. It is possible to use and
administer UNIX without knowing what they are, but you aren't using
the full power of UNIX if you don't understand how to use them.
Regular expressions date back to the 1950s, when a mathematician named Stephen Kleene came up with a notation to describe a mathematical construct called finite automata. Computer scientists saw the usefulness of this notation for describing lexical analyzers in compilers. The UNIX utility lex relies heavily on regular expressions for its input language, and the expressions are used by grep, ed, sed, vi, Perl, and many other utilities. These expressions were considered so useful that UNIX developers included a set of functions in the standard C library to support them (regexp). A regular expression describes a set of possible input strings. Computers use regular expressions to answer a basic question: is a given string a member of the set described by the expression? Where other operating systems can only search files for specific strings, UNIX has mechanisms to search files for strings that match a regular expression. In order to
make the use of regular expressions practical, the traditional notation
was expanded without sacrificing the basic idea of a regular expression.
Although the UNIX C library contains a regular-expression compiler
and scanner (regexp), it doesn't implement everything you might
want. Some commonly used programs reimplemented regular-expression
matching and, as you might expect, not everyone uses the same notation.
As a consequence, the exact notation for an expression can vary
between commands and applications. Fortunately the differences are
minor. For convenience, we will limit our discussion to the expressions
accepted by the command A list of
possible characters can be specified in brackets, Another addition
to the notation is the dot In traditional
usage, a regular expression has to match the entire string. In other
words, if you think of a regular expression as describing a set
of possible strings, then the entire string you are checking must
be a member of that set. Practical implementations relax this a
bit to make expressions easier to use for searching. Consider this
example: you want to look for "Unix" or "unix" in a file. In a strict
sense, the expression Most people
tend to think of searching as a line-oriented operation. Even though
pure regular expressions don't care about line boundaries, conventions
have been adopted to make it easy to perform per-line searches.
The prime example of this is the dot. If the dot really matched
any character, it would match the line separator, and the expression
potentially could match the entire file, which is probably not what
we want. In reality, the dot will match any character except the
line separator (newline for UNIX). So the expression Our augmented
regular-expression syntax also lets us use "anchors." These operators
don't match characters but are used to indicate position within
a line. The carat One of the
disadvantages of regular expressions is the use of ASCII characters
for operators. How do you search for an asterisk? Enter the venerable
backslash. Any character with a special meaning ( At this point
we should note that many of the characters discussed here also have
special meaning to the shell. In order to make sure the characters
you type are processed by the command (such as egrep) and not the
shell, you should always surround your regular expressions with
single quotes when entering them on a command line or in a shell
file. This ensures that the shell itself leaves them alone. Here's
a pop quiz. What will the command
Advanced Expressions Now that we've
gone over the basics, let's look at some of the fancier notations.
A set of characters specified with square brackets can be inverted
by using a carat as the first character inside the brackets. For
example, to match a character that is not a letter, use Some UNIX
regular-expression notation allows for the specification of a repeat
count. This indicates that a given expression should appear a specified
number of times. The repeat number is given after the expression
inside curly braces, but the braces must be escaped with the backslash
(contrary to every other known use of the backslash). The expression
The strangest
expression notation is also one that violates the notion of a "regular
expression." This is an expression that checks for part of the string
that was already matched. Once again, we resort to the use of backslashes
to introduce operators. The grouping construct formed with the pair
Oddities As already
mentioned, egrep does not accept the backslashed operators of sed,
ed, and vi. These operators are part of the regexp package in the
C library, so any application that uses it will understand that
notation. Oddly enough, the regexp notation does not have a general
grouping operator. Regular parentheses, as in
Challenges Pop quiz answer:
the expression in the command Write an egrep expression to match the following: 1. A line
with exactly three space-separated fields.
References
William LeFebvre has been banging on UNIX systems for 15 years
and has been studying Internet technology for almost as long. He
is the editor for the SAGE series "Short Topics in System Administration,"
and he currently operates Group Sys Consulting (Alpharetta, GA).
Reach him at wnl@groupsys.com
or via the Web at http://www.groupsys.com.
The original
notation for regular expressions included only four operations:
alternation, concatenation, grouping, and closure. A conventional
character would match itself. The simple expression a would only
match the letter Finally, closure
is represented with This is all
incredibly useful if your files only contain |
||||||||||||
|
Home | Top
|
|||||||||||||