Getting Comfortable with Regular Expressions - PHP and Jquery

Regular expressions are often perceived as intimidating, difficult tools. In fact, regexes have such a bad reputation among programmers that discussions about them are often peppered with this quote:

Some people, when confronted with a problem, think, “I know, I’ll use regularexpressions.” Now they have two problems.—Jamie Zawinski

This sentiment is not entirely unfounded because regular expressions come with a complex syntax and little margin for error. However, after overcoming the initial learning curve, regexes are an incredibly powerful tool with myriad applications in day -to -day programming.

Understanding Basic Regular Expression Syntax

In this book, you’ll learn Perl-Compatible Regular Expression (PCRE) syntax. This syntax is compatible with PHP and JavaScript, as well as most other programming languages.

Setting up a Test File

To learn how to use regexes, you’ll need a file to use for testing. In the public folder, create a new file called regex.phpand place the following code inside it:

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.wisdomjobs.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.wisdomjobs.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8" />
<title>Regular Expression Demo</title>
<style type="text/css">
em {
background-color: #FF0;
border-top: 1px solid #000;
border-bottom: 1px solid #000;
}
</style>
<body>
<?php
/*
* Store the sample set of text to use for the examples of regex
*/
$string = <<<TEST_DATA <h2>Regular Expression Testing</h2> <p> In this document, there is a lot of text that can be matched using regex. The benefit of using a regular expression is much more flexible &mdash; albeit complex &mdash; syntax for text pattern matching. </p> <p> After you get the hang of regular expressions, also called regexes, they will become a powerful tool for pattern matching. </p> <hr /> TEST_DATA; /* * Start by simply outputting the data */ echo$string;
?>
</body>
</html>

in your browser to view the sample script

The sample file for testing regular expressions

Replacing Text with Regexes

To test regular expressions, you’ll wrap matched patterns with <em>tags, which are styled in the test document to have top and bottom borders, as well as a yellow background.

Accomplishing this with regexes is similar using str_replace() in PHP with the preg_replace()function. A pattern to match is passed, followed by a string (or pattern) to replace the matched pattern with. Finally, the string within which the search is to be performed is passed:

preg_replace($pattern,$replacement, $string); The only difference between str_replace() and preg_replace() on a basic level is that the element passed to preg_replace() for the pattern must use delimiters, which let the function know which part of the regex is the pattern and which part consists of modifiers, or flags that affect how the pattern matches. You’ll learn more about modifiers a little later in this section. The delimiters for regex patterns in preg_replace() can be any non-alphanumeric, non-backslash, and non-whitespace characters placed at the beginning and end of the pattern. Most commonly, forward slashes (/) or hash signs (#) are used. For instance, if you want to search for the letters cat in astring, the pattern would be /cat/ (or #cat#, %cat%, @cat@, and so on). Choosing Regexes vs. Regular String Replacement To explore the differences between str_replace() and preg_replace(), try using both functions to wrap any occurrence of the word regular with <em>tags. Make the following modifications to regex.php: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.wisdomjobs.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.wisdomjobs.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <title>Regular Expression Demo</title> <style type="text/css"> em { background-color: #FF0; border-top: 1px solid #000; border-bottom: 1px solid #000; } </style> </head> <body> <?php /* * Store the sample set of text to use for the examples of regex */$string = <<<TEST_DATA
<h2>Regular Expression Testing</h2>
<p>

In this document, there is a lot of text that can be matched using regex.

The benefit of using a regular expression is muchmore flexible &mdash;

albeit complex &mdash; syntax for textpattern matching.

</p>
<p>
After you get the hang of regular expressions, also called
regexes, they will become a powerful tool for pattern matching.
</p>
<hr />
TEST_DATA;
/*
* Use str_replace() to highlight any occurrence of the word
* "regular"
*/
echostr_replace("regular", "<em>regular</em>", $string); /* * Use preg_replace() to highlight any occurrence of the word * "regular" */ echopreg_replace("/regular/", "<em>regular</em>",$string);
?>
</body>
</html>

Executing this script in your browser outputs the test information twice, with identical results .

The word regular highlighted with both regexes and regular string replacement

Drilling Down on the Basics of Pattern Modifiers

You may have noticed that the word regular in the title is not highlighted. This is because the previous example is case sensitive.

To solve this problem with simple string replacement, you can opt to use the str_ireplace()function, which is nearly identical to str_replace(), except that it is case insensitive.

With regular expressions, you will still use preg_replace(), but you’ll need a modifier to signify case insensitivity. A modifier is a letter that follows the pattern delimiter, providing additional information to the regex about how it should handle patterns. For case insensitivity, the modifier i should be applied.

Modify regex.php to use case-insensitive replacement functions by making the modifications shown in bold:

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.wsdomjobs.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.wisdomjobs.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8" />
<title>Regular Expression Demo</title>
<style type="text/css">
em {
background-color: #FF0;
border-top: 1px solid #000;
border-bottom: 1px solid #000;
}
</style>
<body>
<?php
/*
* Store the sample set of text to use for the examples of regex
*/
$string = <<<TEST_DATA <h2>Regular Expression Testing</h2> <p> In this document, there is a lot of text that can be matched using regex. The benefit of using a regular expression is muchmore flexible &mdash; albeit complex &mdash; syntax for textpattern matching. </p> <p> After you get the hang of regular expressions, also called regexes, they will become a powerful tool for pattern matching. </p> <hr /> TEST_DATA; /* * Use str_ireplace() to highlight any occurrence of the word * "regular" */ echostr_ireplace("regular", "<em>regular</em>",$string);
/*
* Use preg_replace() to highlight any occurrence of the word
* "regular"
*/
echopreg_replace("/regular/i", "<em>regular</em>", $string); ?> </body> </html> Now loading the file in your browser will highlight all occurrences of the word regular, regardless of case A case-insensitive search of the sample data As you can see, this approach has a drawback: the capitalized regular in the title is changed to lowercase when it is replaced. In the next section, you’ll learn how to avoid this issue by using groups in regexes. Getting Fancy with Backreferences The power of regexes starts to appear when you apply one of their most useful features: grouping and backreferences. A group is any part of a pattern that is enclosed in parentheses. A group can be used in the replacement string (or later in the pattern) with a backreference, a numbered reference to a named group. This all sounds confusing, but in practice it’s quite simple. Each set of parentheses from left to right in a regex is stored with a numeric back reference, which can be accessed using a backslash and the number of the backreference (1) or by using a dollar sign and the number of the back reference ($1). The benefit of this is that it gives regexes the ability to use the matched value in the replacement, instead of a predetermined value as in str_replace() and its ilk.

To keep the replacement contents in your previous example in the proper case, you need to use two occurrences of str_replace(); however, you can achieve the same effect by using a backreference in preg_replace()with just one function call.

Make the following modifications to regex.php to see the power of backreferences in regexes:

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.wisdomjobs.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.wisdomjobs.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8" />
<title>Regular Expression Demo</title>
<style type="text/css">
em {
background-color: #FF0;
border-top: 1px solid #000;
border-bottom: 1px solid #000;
}
</style>
<body>
<?php
/*
* Store the sample set of text to use for the examples of regex
*/
$string = <<<TEST_DATA <h2>Regular Expression Testing</h2> <p> In this document, there is a lot of text that can be matchedusing regex. The benefit of using a regular expression is muchmore flexible &mdash; albeit complex &mdash; syntax for text pattern matching. </p> <p> After you get the hang of regular expressions, also calledregexes, they will become a powerful tool for pattern matching. </p> <hr /> TEST_DATA; /* * Use str_replace() to highlight any occurrence of the word * "regular" */$check1 = str_replace("regular", "<em>regular</em>", $string); /* * Use str_replace() again to highlight any capitalized occurrence * of the word "Regular" */ echostr_replace("Regular", "<em>Regular</em>",$check1);
/*
* Use preg_replace() to highlight any occurrence of the word
* "regular", case-insensitive
*/
echopreg_replace("/(regular)/i", "<em>$1</em>",$string);
?>
</body>
</html>

As the preceding code illustrates, it’s already becoming cumbersome to use str_replace() for any kind of complex string matching. After saving the preceding changes and reloading your browser, however, you can achieve the desired outcome using both regexes and standard string replacement

A more complex replacement

Matching Character Classes

In some cases, it’s desirable to match more than just a word. For instance, sometimes you want to erify that only a certain range of characters was used (i.e., to make sure only numbers were supplied for a phone number or that no special characters were used in a username field).

Regexes allow you to specify a character class, which is a set of characters enclosed in squarebrackets. For instance, to match any character between the letter a and the letter c, you would use [a-c] in your pattern.

You can modify regex.phpto highlight any character from A-C. Additionally, you can move the pattern into a variable and output it at the bottom of the sample data; this helps you see what pattern is being used when the script is loaded. Add the code shown in bold to accomplish this:

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.wisdomjobs.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.wisdomjobs.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8" />
<title>Regular Expression Demo</title>
<style type="text/css">
em {
background-color: #FF0;
border-top: 1px solid #000;
border-bottom: 1px solid #000;
}
</style>
<body>
<?php
/*
* Store the sample set of text to use for the examples of regex
*/
$string = <<<TEST_DATA <h2>Regular Expression Testing</h2> <p> In this document, there is a lot of text that can be matched using regex. The benefit of using a regular expression is muchmore flexible &mdash; albeit complex &mdash; syntax for textpattern matching. </p> <p> After you get the hang of regular expressions, also calledregexes, they will become a powerful tool for pattern matching. </p> <hr /> TEST_DATA; /* * Use regex to highlight any occurence of the letters a-c */$pattern = "/([a-c])/i";
echopreg_replace($pattern, "<em>$1</em>", $string); /* * Output the pattern you just used */ echo "n<p>Pattern used: <strong>$pattern</strong></p>";
?>
</body>
</html>

After reloading the page, you’ll see the characters highlighted. You can achieve identical results using [abc], [bac], or any other combination of the characters because the class will match any one character from the class. Also, because you’re using the case-insensitive modifier (i), you don’t need to include both uppercase and lowercase versions of the letters. Without the modifier, you would need to use [A-Ca-c] to match either case of the three letters.

Any character from A-C is highlighted

Matching Any Character Except...

To match any character except those in a class, prefix the character class with a caret (^). To highlight any characters except A-C, you would use the pattern /([^a-c])/i

Highlighting all characters, except letters A-C

Note It’s important to mention that the preceding patterns enclose the character class within parentheses. Character classes do not store backreferences, so parentheses still must be used to reference the matched text later.

Using Character Class Shorthand

Certain character classes have a shorthand character. For example, there is a shorthand class for every word, digit, or space character:

• Word character class shorthand (w): Matches patterns like [A-Za-z0-9_]
• Digit character class shorthand (d): Matches patterns like [0-9]
• Whitespace character class shorthand (s): Matches patterns like [ trn]

Using these three shorthand classes can improve the readability of your regexes, which is extremely convenient when you’re dealing with more complex patterns.

You can exclude a particular type of character by capitalizing the shorthand character:

• Non-word character class shorthand (W): Matches patterns like [^A-Za-z0-9_]
• Non-digit character class shorthand (D): Matches patterns like [^0-9]
• Non-whitespace character class shorthand (S): Matches patterns like [^ trn]

Note t, r, and n are special characters that represent tabs and newlines; a space is represented by a regular space character ( ).

Finding Word Boundaries

Another special symbol to be aware of is the word boundary symbol (b). By placing this before and/or after a pattern, you can ensure that the pattern isn’t contained within another word. For instance, if you want to match the word stat, but not thermostat, statistic, or ecstatic, you would use this pattern:/bstatb/.

Using Repetition Operators

When you use character classes, only one character out of the set is matched, unless the pattern specifies a different number of characters. Regular expressions give you several ways to specify a number of characters to match:

• The star operator (*) matches zero or more occurrences of a character.
• The plus operator (+) matches one or more occurrences of a character.
• The special repetition operator ({min,max}) allows you to specify a range of character matches.

Matching zero or more characters is useful when using a string that may or may not have a certain piece of a pattern in it. For example, if you want to match all occurrences of either John or John Doe, you cause this pattern to match both instances: /John( Doe)*/.

Matching one or more characters is good for verifying that at least one character was entered. Forinstance, if you want to verify that a user enters at least one character into a form input and that thecharacter is a valid word character, you can use this pattern to validate the input: /w+/.

Finally, matching a specific range of characters is especially useful when matching numeric ranges.For instance, you can use this pattern to ensure a value is between 0 and 99: /bd{1,2}b/.

In your example file, you use this regex pattern to find any words consisting of exactly four letters: /(bw{4}b)/

Matching only words that consist of exactly four letters

Detecting the Beginning or End of a String

Additionally, you can force the pattern to match from the beginning or end of the string (or both). If the pattern starts with a caret (^), the regex will only match if the pattern starts with a matching character. If it ends with a dollar sign ($), the regex will match only if the string ends with the preceding matching character. You can combine these different symbols to make sure an entire string matches a pattern. This is useful when validating input because you can verify that the user only submitted valid information. For instance, you can you can use this regex pattern to verify that a username contains only the letters A-Z, the numbers 0-9, and the underscore character: /^w+$/.

Using Alternation

In some cases, it’s desirable to use either one pattern or another. This is called alternation, and it’s accomplished using a pipe character (|). This approach allows you to define two or more possibilities for a match. For instance, you can use this pattern to match either three-, six-, or seven-letter words in regex.php: /b(w{3}|w{6,7})b/

Using alternation to match only three-, six-, and seven-letter words

Using Optional Items

In some cases, it becomes necessary to allow certain items to be optional. For instance, to match both single and plural forms of a word like expression, you need to make the s optional.

To do this, place a question mark (?) after the optional item. If the optional part of the pattern is longer than one character, it needs to be captured in a group (you’ll use this technique in the next section).

For now, use this pattern to highlight all occurrences of the word expression or expressions: /(expressions?)/i

Matching a pattern with an optionals at the end

Putting It All Together

Now that you’ve got a general understanding of regular expressions, it’s time to use your new knowledge to write a regex pattern that will match any occurrence of the phrases regular expression or regex, including the plural forms.

To start, look for the phrase regex: /(regex)/i

Matching the word regex

Next, add the ability for the phrase to be plural by inserting an optional esat the end:/(regex (es) ?)/i

Adding the optional match for the plural form of regex

Next, you will add to the pattern so that it also matches the word regular with a space after it; you will also make the match optional: /(reg(ulars)?ex(es)?)/i

Adding an optional check for the word regular

Now expand the pattern to match the word expression as an alternative to es:/(reg (ulars) ?ex (pression | es) ?)/i