Extended Regular Expressions - Shell Scripting

The POSIX ERE patterns include a few additional symbols that are used by some Linux applications and utilities. The gawk program recognizes the ERE patterns, but the sed editor doesn’t.Caution:- It’s important to remember that there is a difference between the regular expression engines in the sed editor and the gawk program. The gawk program can use most of the extended regular expression pattern symbols, and it can provide some additional filtering capabilities that the sed editor doesn’t have. However, because of this, it is often slower in processing data streams.

This section describes the more commonly found ERE pattern symbols that you can use in your gawk program scripts.

The question mark

The question mark is similar to the asterisk, but with a slight twist. The question mark indicates that the preceding character can appear zero or one time, but that’s all. It doesn’t match repeating occurrences of the character:

$ echo "bt" | gawk ’/be?t/{print $0}’
bt
$ echo "bet" | gawk ’/be?t/{print $0}’
Bet
$ echo "beet" | gawk ’/be?t/{print $0}’
$
$ echo "beeet" | gawk ’/be?t/{print $0}’
$

If the e character doesn’t appear in the text, or as long as it appears only once in the text, the pattern matches.Just as with the asterisk, you can use the question mark symbol along with a character class:

$ echo "bt" | gawk ’/b[ae]?t/{print $0}’
bt
$ echo "bat" | gawk ’/b[ae]?t/{print $0}’
bat
$ echo "bot" | gawk ’/b[ae]?t/{print $0}’
$
$ echo "bet" | gawk ’/b[ae]?t/{print $0}’
bet
$ echo "baet" | gawk ’/b[ae]?t/{print $0}’
$
$ echo "beat" | gawk ’/b[ae]?t/{print $0}’
$
$ echo "beet" | gawk ’/b[ae]?t/{print $0}’
$

If zero or one character from the character class appears, the pattern match passes. However, if either both characters appear, or if one of the characters appears twice, the pattern match fails.

The plus sign

The plus sign is another pattern symbol that’s similar to the asterisk, but with a different twist than the question mark. The plus sign indicates that the preceding character can appear one or more times, but must be present at least once. The pattern doesn’t match if the character is not present:

$ echo "beeet" | gawk ’/be+t/{print $0}’
beeet
$ echo "beet" | gawk ’/be+t/{print $0}’
beet
$ echo "bet" | gawk ’/be+t/{print $0}’
bet
$ echo "bt" | gawk ’/be+t/{print $0}’

If the e character is not present, the pattern match fails. The plus sign also works with character classes, the same way as the asterisk and question mark do:

$ echo "bt" | gawk ’/b[ae]+t/{print $0}’
$
$ echo "bat" | gawk ’/b[ae]+t/{print $0}’
bat
$ echo "bet" | gawk ’/b[ae]+t/{print $0}’
bet
$ echo "beat" | gawk ’/b[ae]+t/{print $0}’
beat
$ echo "beet" | gawk ’/b[ae]+t/{print $0}’
beet
$ echo "beeat" | gawk ’/b[ae]+t/{print $0}’
beeat
$

This time if either character defined in the character class appears, the text matches the specified pattern.Using braces Curly braces are available in ERE to allow you to specify a limit on a repeatable regular expression. This is often referred to as an interval. You can express the interval in two formats:

  • m: The regular expression appears exactly m times.
  • m,n: The regular expression appears at least m times, but no more than n times.

This feature allows you to fine-tune exactly how many times you allow a character (or character class) to appear in a pattern.

Caution:- By default, the gawk program doesn’t recognize regular expression intervals. You must specify the --re-interval command line option for the gawk program to recognize regular expression intervals.

Here’s an example of using a simple interval of one value:

$ echo "bt" | gawk --re-interval ’/be{1}t/{print $0}’
$
$ echo "bet" | gawk --re-interval ’/be{1}t/{print $0}’
bet
$ echo "beet" | gawk --re-interval ’/be{1}t/{print $0}’
$

By specifying an interval of one, you restrict the number of times the character can be present for the string to match the pattern. If the character appears more times, the pattern match fails.

There are lots of times when specifying the lower and upper limit comes in handy:

$ echo "bt" | gawk --re-interval ’/be{1,2}t/{print $0}’
$
$ echo "bet" | gawk --re-interval ’/be{1,2}t/{print $0}’
Bet
$ echo "beet" | gawk --re-interval ’/be{1,2}t/{print $0}’
beet
$ echo "beeet" | gawk --re-interval ’/be{1,2}t/{print $0}’
$

In this example, the e character can appear once or twice for the pattern match to pass; otherwise, the pattern match fails.The interval pattern match also applies to character classes:

$ echo "bt" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
$
$ echo "bat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
bat
$ echo "bet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
bet
$ echo "beat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
beat
$ echo "beet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
beet
$ echo "beeat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
$
$ echo "baeet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
$
$ echo "baeaet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
$

This regular expression pattern will match if there are exactly one or two a’s or e’s in the text pattern, but it will fail if there are any more in any combination. The pipe symbol The pipe symbol allows to you to specify two or more patterns that the regular expression engine uses in a logical OR formula when examining the data stream. If any of the patterns match the data stream text, the text passes. If none of the patterns match, the data stream text fails.

The format for using the pipe symbol is:

expr1|expr2|...
Here’s an example of this:
$ echo "The cat is asleep" | gawk ’/cat|dog/{print $0}’
The cat is asleep
$ echo "The dog is asleep" | gawk ’/cat|dog/{print $0}’
The dog is asleep
$ echo "The sheep is asleep" | gawk ’/cat|dog/{print $0}’
$

This example looks for the regular expression cat or dog in the data stream. You can’t place any spaces within the regular expressions and the pipe symbol, or they’ll be added to the regular expression pattern.

The regular expressions on either side of the pipe symbol can use any regular expression pattern, including character classes, to define the text:

$ echo "He has a hat." | gawk ’/[ch]at|dog/{print $0}’
He has a hat.
$

This example would match cat, hat, or dog in the data stream text. Grouping expressions Regular expression patterns can also be grouped by using parentheses. When you group a regular expression pattern, the group is treated like a standard character. You can apply a special character to the group just as you would to a regular character. For example:

$ echo "Sat" | gawk ’/Sat(urday)?/{print $0}’
Sat
$ echo "Saturday" | gawk ’/Sat(urday)?/{print $0}’
Saturday
$

The grouping of the day ending along with the question mark allows the pattern to match either the full day name or the abbreviated name It’s common to use grouping along with the pipe symbol to create groups of possible pattern matches:

$ echo "cat" | gawk ’/(c|b)a(b|t)/{print $0}’
cat
$ echo "cab" | gawk ’/(c|b)a(b|t)/{print $0}’
cab
$ echo "bat" | gawk ’/(c|b)a(b|t)/{print $0}’
bat
$ echo "bab" | gawk ’/(c|b)a(b|t)/{print $0}’
bab
$ echo "tab" | gawk ’/(c|b)a(b|t)/{print $0}’
$
$ echo "tac" | gawk ’/(c|b)a(b|t)/{print $0}’
$

The pattern (c|b)a(b|t) matches any combination of the letters in the first group along with any combination of the letters in the second group.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Shell Scripting Topics