Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
Crawl EF from Alexa Internet. This data is currently not publicly accessible.
TIMESTAMPS
The Wayback Machine - https://web.archive.org/all/20051126085952/http://www.interspire.com:80/whitepapers/index.php?pageName=PHP_Regular_Expressions_Primer
A regular expression is a specially formatted pattern that can be used to find instances of one string in another. Several programming languages including Visual Basic, Perl, JavaScript and PHP support regular expressions, and hopefully by the end of this primer, you’ll be implementing some basic regular expression functionality into your PHP pages.
Regular expressions are one of those quirky features that popup in a variety of programming languages, but because they seem to be a difficult concept to grasp, many developers push them away into the corner, forgetting that they even exist.
Let's start by taking a look at what a regular expression is, and why you'd want to use them in your PHP pages.
What is a Regular Expression?
What do you think it is that separates programs like BBEdit and notepad from the good old console-based text editors? Both support text input and let you save that text into a file, however modern text editors also support other functionality including find & replace tools, which makes editing a text file so much easier.
Regular expressions are similar to this functionality, only better. Think of a regular expression as an extremely advanced find & replace tool that saves us the pain of having to write custom data validation routines to check e-mail addresses, make sure phone numbers are in the correct format, etc.
One of the most common functions of any program is data validation, and PHP comes bundled with several text validation functions that allow us to match a string using regular expressions, making sure there's a space in a particular spot, a question mark in another spot, etc.
What you may not know however, is that regular expressions are simple to implement, and once you've mastered a few regular expressions (which are specially formatted strings that we can use to tell the regular expression engine the portion of a string we want to match) you'll be asking yourself why you left regular expressions in the corner for so long.
PHP has two sets of functions for dealing with the two types of regular expression patterns: Perl 5 compatible patterns, and Posix standard compatible patterns.
In this article we will be looking at the ereg function and working with search expressions that conform to the Posix standard. Although they don't offer as much power as Perl 5 patterns, they're a great way to start learning regular expressions. If you're interested in PHP's support for Perl 5 compatible regular expressions, then see the PHP.net site for details on the preg set of PHP functions.
PHP has six functions that work with regular expressions. They all take a regular expression string as their first argument, and are shown below:
ereg: The most common regular expression function, ereg allows us to search a string for matches of a regular expression.
ereg_replace: Allows us to search a string for a regular expression and replace any occurrence of that expression with a new string.
eregi: Performs exactly the same matching as ereg, but is case insensitive.
eregi_replace: Performs exactly the same search-replace functionality as ereg_replace, but is case insensitive.
split: Allows us to search a string for a regular expression and returns the matches as an array of strings.
spliti: Case insensitive version of the split function.
Why Use Regular Expressions?
If you're constantly creating functions to validate or manipulate portions of a string, then you might be able to scrap all of these functions and use regular expressions instead. If you answer yes to any of the questions shown below, then you should definitely consider using regular expressions:
Are you writing custom functions to make sure form data contains valid information (such as an @ and a dot in an e-mail address)?
Do you write custom functions to loop through each character in a string and replace it if it matches a certain criteria (such as if it's upper case, or if it's a space)?
Besides being unfavored methods for string validation and manipulation, the two points shown above can also slow your program down if coded inefficiently. Would you rather use this code to validate an e-mail address: <?php
function validateEmail($email)
{
$hasAtSymbol = strpos($email, "@");
$hasDot = strpos($email, ".");
if($hasAtSymbol && $hasDot)
return true;
else
return false;
}echo validateEmail("mitchell@interspire.com");
?>
Sure, the first function looks easier and seems well structured, but wouldn't it be easier if we could validate an e-mail address using the one-lined version of the validateEmail function shown above?
The second function shown above uses regular expressions only, and contains one call to the ereg function. The ereg function always returns only true or false, indicating whether its string argument matched the regular expression or not.
Many programmers steer clear of regular expressions because they are (in some circumstances) slower than other text manipulation methods. The reason that regular expressions may be slower is because they involve the copying and pasting of strings in memory as each new part of a regular expression is matched against a string. However, from my experience with regular expressions, the performance hit isn't noticeable unless you're running a complex regular expression against several hundred lines of text, which is rarely the case when used as an input data validation tool.
The Regular Expression Syntax
Before you can match a string to a regular expression, you have to create the regular expression. The syntax of regular expressions is a little quirky at first, and each phrase in the expression represents some sort of search criteria. Here's a list of some of the most common regular expressions, as well as an example of how to use each one:
Beginning of string:
To search from the beginning of a string, use ^. For example,
<?php echo ereg("^hello", "hello world!"); ?>
Would return true, however
<?php echo ereg("^hello", "i say hello world"); ?>
would return false, because hello wasn't at the beginning of the string.
End of string:
To search at the end of a string, use $. For example,
<?php echo ereg("bye$", "goodbye"); ?>
Would return true, however
<?php echo ereg("bye$", "goodbye my friend"); ?>
would return false, because bye wasn't at the very end of the string.
Any single character:
To search for any character, use the dot. For example,
<?php echo ereg(".", "cat"); ?>
would return true, however
<?php echo ereg(".", ""); ?>
would return false, because our search string contains no characters. You can optionally tell the regular expression engine how many single characters it should match using curly braces. If I wanted a match on five characters only, then I would use ereg like this:
<?php echo ereg(".{5}$", "12345"); ?>
The code above tells the regular expression engine to return true if and only if at least five successive characters appear at the end of the string. We can also limit the number of characters that can appear in successive order:
<?php echo ereg("a{1,3}$", "aaa"); ?>
In the example above, we have told the regular expression engine that in order for our search string to match the expression, it should have between one and three 'a' characters at the end.
<?php echo ereg("a{1,3}$", "aaab"); ?>
The example above wouldn't return true, because there are three 'a' characters in the search string, however they are not at the end of the string. If we took the end-of-string match $ out of the regular expression, then the string would match.
We can also tell the regular expression engine to match at least a certain amount of characters in a row, and more if they exist. We can do so like this:
<?php echo ereg("a{3,}$", "aaaa"); ?>
Repeat character zero or more times To tell the regular expression engine that a character may exist, and can be repeated, we use the * character. Here are two examples that would return true:
Even though the second example doesn't contain the 't' character, it still returns true because the * indicates that the character may appear, and that it doesn't have to. In fact, any normal string pattern would cause the second call to ereg above to return true, because the 't' character is optional.
Repeat character one or more times To tell the regular expression engine that a character must exist and that it can be repeated more than once, we use the + character, like this:
<?php echo ereg("z+", "i like the zoo"); ?>
The following example would also return true:
<?php echo ereg("z+", "i like the zzzzzzoo!"); ?>
Repeat character zero or one times
We can also tell the regular expression engine that a character must either exist just once, or not at all. We use the ? character to do so, like this:
<?php echo ereg("c?", "cats are fuzzy"); ?>
If we wanted to, we could even entirely remove the 'c' from the search string shown above, and this expression would still return true. The '?' means that a 'c' may appear anywhere in the search string, but doesn't have to.
The space character
To match the space character in a search string, we use the predefined Posix class, [[:space:]]. The square brackets indicate a related set of sequential characters, and ":space:" is the actual class to match (which, in this case, is any white space character). White spaces include the tab character, the new line character, and the space character. Alternatively, you could use one space character (" ") if the search string must contain just one space and not a tab or new line character. In most circumstances I prefer to use ":space:" because it signifies my intentions a bit better than a single space character, which can easy be overlooked. There are several Posix-standard predefined classes that we can match as part of a regular expression, including [:alnum:], [:digit:], [:lower:], etc. A complete list is available here.
Grouping patterns
Related patterns can be grouped together between square brackets. It's really easy to specify that a lower case only or upper case only sequence of characters should exist as part of the search string using [a-z] and [A-Z], like this:
<?php
// Require all lower case characters from first to last
echo ereg("^[a-z]+$", "johndoe"); // Returns true
?>
or like this:
<?php
// Require all upper case characters from first to last
ereg("^[A-Z]+$", "JOHNDOE"); // Returns true
?>
We can also tell the regular expression engine that we expect either lower case or upper case characters. We do this by joining the [a-z] and [A-Z] patterns:
<?php echo ereg("^[a-zA-Z]+$", "JohnDoe"); ?>
In the example above, it would make sense if we could match "John Doe," and not "JohnDoe." We can use the following regular expression to do so:
^[a-zA-Z]+[[:space:]]{1}[a-zA-Z]+$
It's just as easy to search for a numerical string of characters:
<?php echo ereg("^[0-9]+$", "12345"); ?>
Grouping terms
It's not only search patterns that can be grouped together. We can also group related search terms together using parentheses:
In the example above, we have a beginning of string character, followed by "John" or "Jane", at least one other character, and then the end of string character. So ...
Special character circumstances
Because several characters are used to actually specify the grouping or syntax of a search pattern, such as the parentheses in (John|Jane), we need a way to tell the regular expression engine to ignore these characters and to process them as if they were part of the string being searched and not part of the search expression. The method we use to do this is called "character escaping" and involves propending any "special symbols" with a backslash. So, for example, if I wanted to include the or symbol '|' in my search, then I could do so like this:
There are only a handful of symbols that you have to escape. You must escape ^, $, (, ), ., [, |, *, ?, +, \ and {.
Hopefully you've now gotten a bit of a feel for just how powerful regular expressions actually are. Let's now take a look at two examples of using regular expressions to validate a string of data.
Regular Expression Examples
Example One
Let's keep the first example fairly simple, and validate a standard URL. A standard URL (with no port number) consists of three parts:
[protocol]://[domain name]
Let's start by matching the protocol part of the URL. Let's make it so that only http or ftp can be used. We would use the following regular expression to do so:
^(http|ftp)
The ^ character specifies the beginning of the string, and by enclosing http and ftp in brackets and separating them with the or character (|), we are telling the regular expression engine that either the characters http or ftp must be at the beginning of the string.
A domain name usually consists of www.somesite.com, but can optionally be specified without the www part. To keep our example simple, we will only allow .com, .net, and .org domain names to be considered valid. We'd represent the search term for the domain name part of our regular expression like this:
(www\.)?.+\.(com|net|org)$
Putting everything together, our regular expression could be used to validate a domain name, like this:
<?php
function isValidDomain($domainName)
{
return ereg("^(http|ftp)://(www\.)?.+\.(com|net|org)$", $domainName);
}
Example Two
Because I reside in Sydney Australia, let's validate a typical international Australian phone number. The format of an international Australian phone number looks like this:
+61x xxxx-xxxx
The first 'x' is the area code, and the rest is the phone number. To validate that the start of the phone number is '+61' and that it is followed by an area code between two and nine, we use the following regular expression:
^\+61[2-9][[:space:]]
Notice in the search pattern above that the '\' escapes the '+' symbol so that it is included in the search and not interpreted as a regular expression. The [2-9] tells the regular expression engine that we require just one digit between two and nine inclusively. The [[:space:]] class tells the regular expression engine to expect one space in this spot.
Here's the search pattern for the rest of the phone number:
[0-9]{4}-[0-9]{4}$
Nothing out of the ordinary here, we're just telling the regular expression engine that for the phone number to be valid, it must have a group of four digits, followed by a hyphen, followed by another group of four digits and then an end of string character.
Putting the entire regular expreesion together into a function, we can use the code to validate some international Australian phone numbers:
<?php
function isValidPhone($phoneNum)
{
echo ereg("^\+61[2-9][[:space:]][0-9]{4}-[0-9]{4}$", $phoneNum);
}
// true
echo isValidPhone("+619 0000-0000");
// false
echo isValidPhone("+61 00000000");
// false
echo isValidPhone("+611 00000000");
?>
Conclusion
Regular expressions take a lot of the hassle out of writing lines and lines of repetitive code to validate a string. Over the last few pages we've covered all of the basics of Posix standard regular expression patterns including symbols, grouping and the PHP ereg function.
We've also seen how to use the regular expression engine to validate some simple strings in PHP.
About the Author
Mitchell is the Interspire development manager, content editor and a resident PHP guru. He's focuses on making sure the development of Interspire's products occurs smoothly