Simple Patterns - Java Script

The patterns used in this chapter so far have all been simple,constructed of string literals.However,a regular expression has many more parts than just matching specific characters.Metacharacters,character classes, and quantifiers are all important parts of regular expression syntax and can be used to achieve some powerful results.

Metacharacters

In the previous section you discovered that a comma has to be escaped(preceded with a backslash) to be matched correctly.That’s because the comma is a metacharacter,which is a character that is part of regular expression syntax.Here are all the regular expression metacharacters:

( [ { ^ $ | ) ? * + .

Any time you want to use one of these characters inside of a regular expression,they must be escaped.

So,to match a question mark,the regular expression looks like this:

var reQMark = /?/;

Or like this:

var reQMark = new RegExp(“?”);

Did you notice the two backslashes in the second line?This is an important concept to grasp: When a regular expression is represented in this (non-literal) form, every backslash must be replaced with two backslashes because the JavaScript string parser tries to interpret ? the same way it tries to interpret .To ensure that this doesn’t happen, place two backslashes(called double escaping) in front of the metacharacter in question.This little gotcha is why many developers prefer to use the literal syntax.

Using special characters

You can represent characters by using their literals or by specifying a character code using either their ASCII code or Unicode code.To represent a character using ASCII,you must specify a two-digit hexadecimal code preceded by x.For example, the character b has an ASCII code of 98, which is equal to hex 62; therefore, to represent the letter b you could use x62:

This code matches the letter b in “blue”.

Alternatively,you can specify the character code using octal instead of hex by including the octal characters after a backslash. For example,b is equal to octal 142, so this will work:

To represent a character using Unicode,you must specify a four-digit hexadecimal representation of the character code. So b becomes u0062:

Note that to use this method of representing characters with the RegExp constructor,you still need to include a second backslash:

Additionally,there are a number of predefined special characters,which are listed in the following table:

predefined special characters

All of these characters must also be double-escaped in order to use them with the RegExp constructor.

Suppose you want to remove all new line characters from a string(a common task when dealing with user-input text). You can do so like this:

Character classes

Character classes are groups of characters to test for. By enclosing characters inside of square brackets, you are effectively telling the regular expression to match the first character,the second character, the third character,or so on.For example,to match the characters a, b, and c, the character class is [abc].This is called a simple class, because it specifies the exact characters to look for.

Simple classes

Suppose you want to match “bat”,“cat”, and “fat”. It is very easy to use a simple character class for this purpose:

The arrMatches array is now be filled with these values: “bat”, “Cat”, “fAt”, “baT”, “faT”, and “cat”. You can also include special characters inside simple classes(and any other type of character class as well). Suppose you replace the b character with its Unicode equivalent:

This code behaves the same as it did in the previous example.

Negation classes

At times you may want to match all characters except for a select few.In this case, you can use a negation class, which specifies characters to exclude.For example, to match all characters except a and b, the character class is [^ab]. The caret (^) tells the regular expression that the character must not match the characters to follow.

Going back to the previous example, what if you only wanted to get words containing at but not beginning with b or c?

In this case, arrMatches contains “fAt” and “faT”, because these strings match the pattern of a sequence ending with at but not beginning with b or c.

Range classes

Up until this point,the character classes required you to type all the characters to include or exclude. Suppose that you want to match any alphabet character, but you really don’t want to type every letter in the alphabet.Instead, you can use a range class to specify a range between a and z: [a-z]. The key here is the dash (-), which should be read as through instead of minus (so the class is read as a through z not a minus z).

Range classes work whenever the characters you want to test are in order by character code. Consider the following example:

After execution, arrMatches contains four items: “num1”, “num2”, “num3”, and “num4” because they all match num and are followed by a character in the range 1 through 4.

Combination classes

A combination class is a character class that is made up of several other character classes. For instance,suppose you want to match all letters a through m, numbers 1 through 4, and the new line character.

The class looks like this:

[a-m1-4 ]

Note that there are no spaces between the different internal classes.

Predefined classes

Because some patterns are used over and over again,a set of predefined character classes is used to make it easy for you to specify some complex classes. The following table lists all the predefined classes:

Predefined classesPredefined classes

Using predefined classes can make pattern matching significantly easier.Suppose you want to match three numbers, without using d. Your code looks like this:

Using d,the regular expression becomes much cleaner:

Quantifiers

Quantifiers enable you to specify how many times a particular pattern should occur. You can specify both hard values (for example, this character should appear three times) and soft values(for example, this character should appear at least once but can repeat any number of times) when setting how many times a pattern should occur.

Simple quantifiers

The following table lists the various ways to quantify a particular pattern.

Code Description

Code Description

For example, suppose you want to match words bread, read, or red. Using the question mark quantifier,you can create just one regular expression to match all three:

var reBreadReadOrRed = /b?rea?d/;

You can read this regular expression as “zero or one occurrence of b, followed by r, followed by e, followed by zero or one occurrence of a, followed by d.” The preceding regular expression is the same as this one:

var reBreadReadOrRed = /b{0,1}rea{0,1}d/;

In this regular expression, the question mark has been replaced with curly braces. Inside the curly braces are the numbers 0, which is the minimum number of occurrences, and 1, which is the maximum.This expression reads the same way as the previous one; it’s just represented differently.Both expressions are considered correct.

To illustrate the other quantifiers,suppose you had to create a regular expression to match the strings “bd”, “bad”, “baad”, and “baaad”. The following table illustrates some possible solutions and which words each match.

Regular Expression Matches

Regular Expression Matches

As you can see,only two of the six expressions adequately solve the problem: ba*d and ba{0,}d.Notice that these two are exactly equal because the asterisk means 0 or more just as {0,} does. Likewise, the first and fourth expressions are equal, and the third and sixth expressions are equal.

Quantifiers can also be used with character classes, so if you wanted to match the strings “bead”, “baed”,“beed”, “baad”, “bad”, and “bed”, the following regular expression would do so:

var reBeadBaedBeedBaadBedBad = /b[ae]{1,2}d/;

This expression says that the character class [ae] can appear a minimum of one time and a maximum of two times.

Greedy, reluctant, and possessive quantifiers

The three kinds of regular expression quantifiers are greedy, reluctant, and possessive.A greedy quantifier starts by looking at the entire string for a match. If no match is found, it eliminates the last character in the string and tries again.If a match is still not found,the last character is again discarded and the process repeats until a match is found or the string is left with no characters.All the quantifiers discussed to this point have been greedy.

A reluctant quantifier starts by looking at the first character in the string for a match. If that character alone isn’t enough,it reads in the next character, forming a string of two characters. If still no match isfound, a reluctant quantifier continues to add characters from the string until either a match is found or the entire string is checked without a match. Reluctant quantifiers work in reverse of greedy quantifiers.

A Possessive quantifier only tries to match against the entire string.If the entire string doesn’t produce a match, no further attempt is made. Possessive quantifiers are, in a manner of speaking, a one-shot deal. What makes a quantifier greedy, reluctant, or possessive? It’s really all in the use of the asterisk,question mark, and plus ymbols. For example, the question mark alone (?) is greedy, but a question mark followed by another question mark (??) is reluctant. To make the question mark possessive, append a plus sign (?+).The following table shows all the greedy, reluctant, and possessive versions of the quantifiers you’ve already learned.

Greedy Reluctant Possessive Description

Greedy Reluctant Possessive Description

To illustrate the differences among the three kinds of quantifiers, consider the following example:

You want to match any number of letters followed by bbb.Ultimately, you’d like to get back as matches “abbb”,“aabbb”,and “aaabbb”. However, only one of the three regular expressions returns this result, can you guess which one?

If you guessed re2,congratulations! You now understand the difference between greedy, reluctant, and possessive quantifiers. The first regular expression, re1, is greedy and so it starts by looking at the whole string.Behind the scenes, this is what happens:

So the only result that re1 returns is “abbbaabbbaaabbb”. Remember,the dot represents any character, and b is included, therefore “abbbaabbbaaa” matches the .* part of the expression and “bbb” matches the bbb part.

For the second regular expression, re2,the following takes place behind the scenes:

Since re2 contains a reluctant quantifier,it returns “abbb”, “aabbb”, and “aaabbb”, just as you’d expect.The final regular expression, re3,actually has no result because it’s possessive. Here’s what it does behind the scenes:

re3.test(“abbbaabbbaaabbb1234”); //false – no match

Because possessive quantifiers only do one test,if that test fails, you get no result.In this case, the “1234” at the end of the string causes the expression not to match. If the string were simply “abbbaabbbaaabbb”, then re3 would have returned the same result as re1.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Java Script Topics