Regular Expressions in Action - Shell Scripting

Now that you’ve seen the rules and a few simple demonstrations of using regular expression patterns, it’s time to put that knowledge into action. The following sections demonstrate some common regular expression examples within shell scripts.

Counting directory files

To start things out, let’s look at a shell script that counts the executable files that are present in the directories defined in your PATH environment variable. To do that, you’ll need to parse out the PATH variable into separate directory names. showed how to display the PATH environment variable:

$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/games:/usr/java/
j2sdk1.4.1 01/bin
$

Your PATH environment variable will differ, depending on where the applications are located on your Linux system. The key is to recognize that each directory in the PATH is separated by a colon. To get a listing of directories that you can use in a script, you’ll have to replace each colon with a space. You now recognize that the sed editor can do just that using a simple regular expression:

$ echo $PATH | sed ’s/:/ /g’
/usr/local/bin /bin /usr/bin /usr/X11R6/bin /usr/games /usr/java/
j2sdk1.4.1 01/bin
$

Once you’ve got the directories separated out, you can use them in a standard for statement to iterate through each directory: mypath=`echo $PATH | sed ’s/:/ /g’` for directory in $mypath

do
...
done

Once you have each directory, you can use the ls command to list each file in each directory, and use another for statement to iterate through each file, incrementing a counter for each file.The final version of the script looks like this:

$ cat countfiles
#!/bin/bash
# count number of files in your PATH
mypath=`echo $PATH | sed ’s/:/ /g’`
count=0
for directory in $mypath
do
check=`ls $directory`
for item in $check
do
count=$[ $count + 1 ]
done
echo "$directory - $count"
count=0
done
$ ./countfiles
/usr/local/bin - 79
/bin - 86
/usr/bin - 1502
/usr/X11R6/bin - 175
/usr/games - 2
/usr/java/j2sdk1.4.1 01/bin - 27
$

Validating a phone number

The previous example showed how to incorporate the simple regular expression along with sed to replace characters in a data stream to process data. Often regular expressions are used to validate data to ensure that data is in the correct format for a script.A common data validation application is checking phone numbers. Often data entry forms request phone numbers, and often customers fail to enter a properly formatted phone number. In the United States, there are several common ways to display a phone number:

(123)456-7890
(123) 456-7890
123-456-7890
123.456.7890

This leaves four possibilities for how customers can enter their phone number in a form. The regular expression must be robust enough to be able to handle either situation. When building a regular expression, it’s best to start on the left-hand side, and build your pattern to match the possible characters you’ll run into. In this example, the first thing is that there may or may not be a left parenthesis in the phone number. This can be matched by using the pattern: ^(?

The caret is used to indicate the beginning of the data. Since the left parenthesis is a special character, you must escape it to use it as a normal character. The question mark indicates that the left parenthesis may or may not appear in the data to match. Next comes the three digit area code. In the United States area codes start with the number 2 (no area codes use the digits 0 or 1), and can go to 9. To match the area code, you’d use the pattern: [2-9][0-9]{2}

This requires that the first character be a digit between 2 and 9, followed by any two digits. After the area code, the ending right parenthesis may or may not be there:)?

After the area code there can be a space, no space, a dash, or a dot. You can group those using a character group along with the pipe symbol:(| |-|.)

The very first pipe symbol appears immediately after the left parenthesis to match the no space condition. You must use the escape character for the dot; otherwise, it’ll take on its special meaning of matching any character. Next comes the three-digit phone exchange number. Nothing special required here:[0-9]{3}

After the phone exchange number, you must match either a space, a dash, or a dot (this time you don’t have to worry about matching no space, since there must be at least a space between the phone exchange number and the rest of the number):( |-|.)

Then to finish things off, you must match the four digit-local phone extension at the end of the string: [0-9]{4}$

Putting the entire pattern together results in this ^(?[2-9][0-9]{2})?(| |-|.)[0-9]{3}( |-|.)[0-9]{4}$

You can use this regular expression pattern in the gawk program to filter out bad phone numbers. All you need to do now is create a simple script using the regular expression in a gawk program, then filter your phone list through the script. Remember, when you use regular expression intervals in the gawk program you must use the --re-interval command line option or you won’t get the correct results.

Here’s the script:
$ cat isphone
#!/bin/bash
# script to filter out bad phone numbers
gawk --re-interval ’/^(?[2-9][0-9]{2})?(| |-|.)
[0-9]{3}( |-|.)[0-9]{4}/{print $0}’
$

While you can’t tell from this listing, the gawk command is on a single line in the shell script. You can then redirect phone numbers to the script for processing:

$ echo "317-555-1234" | ./isphone
317-555-1234
$ echo "000-555-1234" | ./isphone
$

Or you can redirect an entire file of phone numbers to filter out the invalid ones:

$ cat phonelist
000-000-0000
123-456-7890
212-555-1234
(317)555-1234
(202) 555-9876
33523
1234567890
234.123.4567
$ cat phonelist | ./isphone
212-555-1234
(317)555-1234
(202) 555-9876
234.123.4567
$

Only the valid phone numbers that match the regular expression pattern appear.

Parsing an e-mail address

In this day and age e-mail addresses have become a crucial form of communication. Trying to validate e-mail addresses has become quite a challenge for script builders, due to the myriad of ways to create an e-mail address. The basic form of an e-mail address is:

username@hostname The username value can use any alphanumeric character, along with several special characters:

  • Dot
  • Dash
  • Plus sign
  • Underscore

These characters can appear in any combination in a valid e-mail userid. The hostname portion of the e-mail address consists of one or more domain names and a server name. The server and domain names must also follow strict naming rules, allowing only alphanumeric characters, along with the special characters:

  • Dot
  • Underscore

The server and domain names are each separated by a dot, with the server name specified first, any subdomain names specified next, and finally, the top-level domain name without a trailing dot. At one time there were a fairly limited number of top-level domains, and regular expression pattern builders attempted to add them all in patterns for validation. Unfortunately, as the Internet grew so did the possible top-level domains. This technique is no longer a viable solution.

Let’s start building the regular expression pattern from the left side. We know that there can be multiple valid characters in the username. This should be fairly easy: ^([a-zA-Z0-9 -.+]+)@

This grouping specifies the allowable characters in the username, and the plus sign to indicate that there must be at least one character present. The next character is obviously going to be the @ symbol, no surprises there. The hostname pattern uses the same technique to match the server name and the subdomain names: ([a-zA-Z0-9 -.]+)

This pattern matches the text:

server
server.subdomain
server.subdomain.subdomain

There are special rules for the top-level domain. Top-level domains are only alphabetic characters, and they must be no fewer than two characters (used in country codes) and no more than five characters in length. The regular expression pattern for the top-level domain is: .([a-zA-Z]{2,5})$

Putting the entire pattern together results in:^([a-zA-Z0-9 -.+]+)@([a-zA-Z0-9 -.]+).([a-zA-Z]{2,5})$

This pattern will filter out poorly formatted e-mail addresses from a data list. Now you can create your script to implement the regular expression:

$ echo "rich@here.now" | ./isemail
rich@here.now
$ echo "rich@here.now." | ./isemail
$
$ echo "rich@here.n" | ./isemail
$
$ echo "rich@here-now" | ./isemail
$
$ echo "rich.blum@here.now" | ./isemail
rich.blum@here.now
$ echo "rich blum@here.now" | ./isemail
rich blum@here.now
$ echo "rich/blum@here.now" | ./isemail
$
$ echo "rich#blum@here.now" | ./isemail
$
$ echo "rich*blum@here.now" | ./isemail
$

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Shell Scripting Topics