Creating sed Utilities - Shell Scripting

As you’ve seen in the short examples presented so far in this chapter, there are lots of cool data-formatting things you can do with the sed editor. This section shows a few handy well-know n sed editor scripts for performing common data-handling functions.

Double spacing lines

To start things off, take a look at a simple sed script to insert a blank line between lines in a text file:

$sed ’G’ data2 This is the header line. This is the first data line. This is the second data line. This is the last line.$

That was pretty simple! The key to this trick is the default value of the hold space. Remember, the G command simply appends the contents of the hold space to the current pattern space contents. When you start the sed editor, the hold space contains an empty line. By appending that to an existing line, you create a blank line after the existing line.

You may have noticed that this script also adds a blank line to the last line in the data stream, producing a blank line at the end of the file. If you want to get rid of this, you can use the negate symbol and the last line symbol to ensure that the script doesn’t add the blank line to the last line of the data stream:

$sed ’$!G’ data2
This is the header line.
This is the first data line.
This is the second data line.
This is the last line.
$Now that looks a little better. As long as the line isn’t the last line, the G command appends the contents of the hold space. When the sed editor gets to the last line, it skips the G command. Double spacing files that may have blanks To take double spacing one step further, what if the text file already has a few blank lines, but you want to double space all of the lines? If you use the previous script, you’ll get some areas that have too many blank lines, as each existing blank line gets doubled:$ cat data6
This is line one.
This is line two.
This is line three.
This is line four.
$sed ’$!G’ data6
This is line one.
This is line two.
This is line three.
This is line four.
$Now you have three blank lines where the original blank line was located. The solution to this problem is to first delete any blank lines from the data stream and then use the G command to insert new blank lines after all of the lines. To delete existing blank lines, you just need to use the d command with a pattern that matches a blank line: /^$/d

This pattern uses the start line tag (the caret) and the end line tag (the dollar sign). Adding this to the script produces the desired results:

$sed ’/^$/d;$!G’ data6 This is line one. This is line two. This is line three. This is line four.$
Perfect!

Numbering lines in a file

How to use the equal sign to display the line numbers of lines in the data stream:

$sed ’=’ data2 1 This is the header line. 2 This is the first data line. 3 This is the second data line. 4 This is the last line.$

This can be a little awkward to read, as the line number is on a line above the actual line in the data stream. A better solution would be to place the line number on the same line as the text. Now that you’ve seen how to combine lines using the N command, it shouldn’t be too hard to utilize that information in the sed editor script. The trick to this utility though is that you can’t combine the two commands in the same script.

Once you have the output for the equal sign command, you can pipe the output to another sed editor script that uses the N command to combine the two lines. You also need to use the substitution command to replace the newline character with either a space or a tab character. Here’s what the final solution looks like:

$sed ’=’ data2 | sed ’N; s/\n/ /’ 1 This is the header line. 2 This is the first data line. 3 This is the second data line. 4 This is the last line.$

Now that looks much better. This is a great little utility to have around when working on programs where you need to see the line numbers used in error messages.

Printing last lines

So far you’ve seen how to use the p command to print all of the lines in a data stream or just lines that match a specific pattern. What if you just need to work with the last few lines of a long listing, such as a log file?The dollar sign represents the last line of a data stream, so it’s easy to display just the last line:

$sed -n ’$p’ data2
This is the last line.
$Now how can you use the dollar sign symbol to display a set number of lines at the end of the data stream? The answer is to create a rolling window. A rolling window is a common way to examine blocks of text lines in the pattern space by combining them using the N command. The N command appends the next line of text to the text already in the pattern space. Once you have a block of 10 text lines in the pattern space, you can check if you’re at the end of the data stream using the dollar sign. If you’re not at the end, continue adding more lines to the pattern space, but removing the original lines (remember the D command, which deletes the first line in the pattern space). By looping through the N and D commands, you add new lines to the block of lines in the pattern space, while removing old lines. The branch command is the perfect fit for the loop. To end the loop, just identify the last line and use the q command to quit. Here’s what the final sed editor script looks like:$ sed ’{
> :start
> $q > N > 11,$D
> b start
> }’ /etc/passwd
mysql:x:415:416:MySQL server:/var/lib/mysql:/bin/bash
rich:x:501:501:Rich:/home/rich:/bin/bash
katie:x:502:506:Katie:/home/katie:/bin/bash
jessica:x:503:507:Jessica:/home/jessica:/bin/bash
testy:x:504:504:Test account:/home/testy:/bin/csh
barbara:x:416:417:Barbara:/home/barbara/:/bin/bash
ian:x:505:508:Ian:/home/ian:/bin/bash
emma:x:506:509:Emma:/home/emma:/bin/bash
bryce:x:507:510:Bryce:/home/bryce:/bin/bash
test:x:508:511::/home/test:/bin/bash
$The script first checks if the line is the last line in the data stream. If it is, the quit command stops the loop. The N command appends the next line to the current line in the pattern space. The 11,$D command deletes the first line in the pattern space if the current line is after line 10. This creates the sliding window effect in the pattern space.

Deleting lines

Another useful utility for the sed editor is removing unwanted blank lines in a data stream. It’s easy to remove all the blank lines from a data stream, but it takes a little ingenuity to selectively remove blank lines. This section shows a couple of quick sed editor scripts that you can use to help remove unwanted blank lines from your data.

Deleting consecutive blank lines

One nuisance is when extra blank lines crop up in data files. Often you have a data file that contains blank lines, but sometimes a data line is missing and produces too many blank lines (as you saw in the double spacing example earlier).

The easiest way to remove consecutive blank lines is to check the data stream using a range address. Chapter 16 showed how to use ranges in addresses, including how to incorporate patterns in the address range. The sed editor executes the command for all lines that match with in the specified address range.

The key to removing consecutive blank lines is creating an address range that includes a nonblank line and a blank line. If the sed editor comes across this range, it shouldn’t delete the line. However, for lines that don’t match that range (two or more blank lines in a row), it should delete the lines.

Here’s the script to do this:/./,/^$/!d The range is /./ to /^$/. The start address in the range matches any line that contains at least one character. The end address in the range matches a blank line. Lines within this range aren’t deleted. Here’s the script in action:

$cat data6 This is the first line. This is the second line. This is the third line. This is the fourth line.$ sed ’/./,/^$/!d’ data6 This is the first line. This is the second line. This is the third line. This is the fourth line.$

No matter how many blank lines appear between lines of data in the file, the output only places one blank line between the lines.

Deleting leading blank lines

Another nuisance is data files containing multiple blank lines at the start of the file. Often when trying to import data from a text file into a database, the blank lines create null entries, throwing off any calculations using the data.

Removing blank lines from the top of a data stream is not too difficult of a task. Here’s the script that accomplishes that function:/./,$!d The script uses an address range to determine what lines are deleted. The range starts with a line that contains a character and continues to the end of the data stream. Any line within this range is not deleted from the output. This means that any lines before the first line that contain a character are deleted.Take a look at this simple script in action:$ cat data7
This is the first line.
This is the second line.
$sed ’/./,$!d’ data7
This is the first line.
This is the second line.
$The test file contains two blank lines before the data lines. The script successfully removes both of the leading blank lines, while keeping the blank line within the data intact. Deleting trailing blank lines Unfortunately, deleting trailing blank lines is not as simple as deleting leading blank lines. Just like printing the end of a data stream, deleting blank lines at the end of a data stream requires a little ingenuity and looping.Before I start the discussion, let me show you what the script looks like: sed ’{ :start /^\n*$/{$d; N; b start } }’ This may look a little odd to you at first. Notice that there are braces within the normal script braces. This allows you to group commands together within the overall command script. The group of commands applies to the specified address pattern. The address pattern matches any line that contains only a newline character. When one is found, if it’s the last line, the delete command deletes it. If it’s not the last line, the N command appends the next line to it, and the branch command loops to the beginning to start over. Here’s the script in action:$ cat data8
This is the first line.
This is the second line.
$sed ’{ :start /^\n*$/{$d ; N; b start } }’ data8 This is the first line. This is the second line.$

The script successfully removed the blank lines from the end of the text file.

Removing HTML tags

In this day and age it’s not uncommon to download text from a Web site to save or use as data in an application. Sometimes, though, when you download text from the Web site, you also get the HTML tags used to format the data. This can be a problem when all you want to see is the data.

A standard HTML Web page contains several different types of HTML tags, identifying formatting features required to properly display the page information. Here’s a sample of what an HTML file looks like:

$cat data9 html head titleThis is the page title/title> head body p This is the firstline in the Web page. This should provide some useful information for us to use in our shell script. /body /html$

HTML tags are identified by the less-than and greater-than symbols. Most HTML tags come in pairs. One tag starts the formatting process (for example, <b> for bolding), and another tag stops the formatting process (for example, </b> to turn off bolding).

Removing HTML tags creates a problem though if you’re not careful. At first glance, you’d think that the way to remove HTML tags would be to just look for text that starts with the less-than symbol and ends with a greater-than symbol, with any data in between s/<.*>//g

Unfortunately, this command has some unintended consequences:

$sed ’s/<.*>//g’ data9 This is the line in the Web page. This should provide some information for us to use in our shell script.$

Notice that the title text is missing, along with the text that was bolded and italicized. The sed editor literally interpreted the script to mean any text between the less-than and greater-than sign, including other less-than and greater-than signs! Every place where text was enclosed in HTML tags (such as <b>first</b>), the sed script removed the entire text.

The solution to this problem is to have the sed editor ignore any embedded greater-than signs between the original tags. To do that, you can create a character class that negates the greater-than sign. This changes the script to: s/<[^>]*>//g

This script now works properly, displaying the data you need to see from the Web page HTML code:

$sed ’s/<[^>]*>//g’ data9 This is the page title This is the first line in the Web page. This should provide some useful information for us to use in our shell script.$

That’s a little better. To clean things up some, you can add a delete command to get rid of those pesky blank lines:

$sed ’s/<[^>]*>//g;/^$/d’ data9
This is the page title
This is the first line in the Web page. This should provide
some useful information for us to use in our shell script.
\$
Now that’s much more compact; there’s only the data you need to see.