Variables in your gawk programs - Shell Scripting

One important feature of any programming language is the ability to store and recall values using variables. The gawk programming language supports two different types of variables:

  • Built-in variables
  • User-defined variables

There are several built-in variables available for you to use in gawk. The built-in variables contain information used in handling the data fields and records in the data file. You can also create your own variables in your gawk programs. The following sections walk you through how to use variables in your gawk programs.

Built-in variables

The gawk program uses built-in variables to reference specific features within the program data. This section describes the built-in variables available for you to use in your gawk programs and demonstrates how to use them.

The field and record separator variables

The data field variables allow you to reference individual data fields within a data record using a dollar sign and the numerical position of the data field in the record. Thus, to reference the first data field in the record, you use the $1 variable. To reference the second data field, you use the $2 variable, and so on.

Data fields are delineated by a field separator character. By default the field separator character is a whitespace character, such as a space or a tab. Chapter 16 showed how to change the field separator character either on the command line by using the -F command line parameter or within the gawk program by using the special FS built-in variable.

The FS built-in variable belongs to a group of built-in variables that control how gawk handles fields and records in both input data and output data. Table below lists the built-in variables contained in this group.

The FS and OFS variables define how your gawk program handles data fields in the data stream. You’ve already seen how to use the FS variable to define what character separates data fields in a record. The OFS variable performs the same function but for the output by using the print command.

Table The gawk Data Field and Record Variables By default, gawk sets the OFS variable to a space, so when you use the command:

print $1,$2,$3The gawk data filed and record variables
you’ll see the output as:
field1 field2 field3
You can see this in the following example:
$ cat data1
$ gawk ’BEGIN{FS=","} {print $1,$2,$3}’ data1
data11 data12 data13
data21 data22 data23
data31 data32 data33

The print command automatically places the value of the OFS variable between each data field in the output. By setting the OFS variable, you can use any string to separate data fields in the output:

$ gawk ’BEGIN{FS=","; OFS="-"} {print $1,$2,$3}’ data1
$ gawk ’BEGIN{FS=","; OFS="--"} {print $1,$2,$3}’ data1
$ gawk ’BEGIN{FS=","; OFS="‹--›"} {print $1,$2,$3}’ data1

The FIELDWIDTHS variable allows you to read records without using a field separator character. In some applications, instead of using a field separator character, data is placed in specific columns within the record. In these instances, you must set the FIELDWIDTHS variable to the match the layout of the data in the records.

Once you set the FIELDWIDTHS variable, gawk ignores the FS and calculates data fields based on the provided field width sizes. Here’s an example using field widths instead of field separator characters:

$ cat data1b
$ gawk ’BEGIN{FIELDWIDTHS="3 5 2 5"}{print $1,$2,$3,$4}’ data1b
100 5.324 75 96.37
058 10.12 98 100.1

The FIELDWIDTHS variable defines four data fields, and gawk parses the data record accordingly. The string of numbers in each record is split based on the defined field width values.
Caution:- It’s important to remember that once you set the FIELDWIDTHS variable, those values must remain constant. This method can’t accommodate variable-length data fields.

The RS and ORS variables define how your gawk program handles records in the data stream. By default, gawk sets the RS and ORS variables to the newline character. The default RS variable value indicates that each new line of text in the input data stream is a new record.

Sometimes you run into situations where data fields are spread across multiple lines in the data stream. A classic example of this is data that includes an address and phone number, each on a separate line:

Riley Mullen
123 Main Street
Chicago, IL 60601

If you try to read this data using the default FS and RS variable values, gawk will read each line as a separate record, and interpret each space in the record as a field separator. This isn’t what you intended.

To solve this problem, you need to set the FS variable to the newline character. This indicates that each line in the data stream is a separate field and all of the data on a line belongs to the data field. However, now you have the problem of not knowing where a new record starts.

To solve this problem, set the RS variable to an empty string, then leave a blank line between data records in the data stream. The gawk program will interpret each blank line as a record separator. Here’s an example of using this technique:

$ cat data2
Riley Mullen
123 Main Street
Chicago, IL 60601
Frank Williams
456 Oak Street
Indianapolis, IN 46201
Haley Snell
4231 Elm Street
Detroit, MI 48201
$ gawk ’BEGIN{FS=" "; RS=""} {print $1,$4}’ data2
Riley Mullen (312)555-1234
Frank Williams (317)555-9876
Haley Snell (313)555-4938

Perfect, the gawk program interpreted each line in the file as a data field and the blank lines as record separators.

Data variables

Besides the field and record separator variables, gawk provides some other built-in variables to help you know what’s going on with your data and extract information from the shell environment. Table below shows the other built-in variables in gawk.

More Gawk built in variables

Table More gawk Built-in Variables You should recognize a few of these variables from your shell script programming. The ARGC and ARGV variables allow you to retrieve the number of command line parameters and their values from the shell. This can be a little tricky though, as gawk doesn’t count the program script as part of the command line parameters:

$ gawk ’BEGIN{print ARGC,ARGV[1]}’ data1
2 data1

The ARGC variable indicates that there are two parameters on the command line. This includes the gawk command and the data1 parameter (remember, the program script doesn’t count as a parameter). The ARGV array starts with an index of 0, which represents the command. The first array value is the first command line parameter after the gawk command.
Note that unlike shell variables, when you reference a gawk variable in the script, you don’t add a dollar sign before the variable name.
The ENVIRON variable may seem a little odd to you. It uses an associative array to retrieve shell environment variables. An associative array uses text for the array index values instead of numeric values.

The text in the array index is the shell environment variable. The value of the array is the value of the shell environment variable. Here’s an example of this:

$ gawk ’
› print ENVIRON["HOME"]
› print ENVIRON["PATH"]
› }’

The ENVIRON["HOME"] variable retrieves the HOME environment variable value from the shell. Likewise, the ENVIRON["PATH"] variable retrieves the PATH environment variable value. You can use this technique to retrieve any environment variable value from the shell to use in your gawk programs.

The FNR, NF, and NR variables come in handy when you’re trying to keep track of data fields and records in your gawk program. Sometimes you’re in a situation where you don’t know exactly how many data fields are in a record. The NF variable allows you to specify the last data field in the record without having to know its position:

$ gawk ’BEGIN{FS=":"; OFS=":"} {print $1,$NF}’ /etc/passwd

The NF variable contains the numerical value of the last data field in the data file. You can then use it as a data field variable by placing a dollar sign in front of is The FNR and NR variables are similar to each other, but slightly different. The FNR variable contains the number of records processed in the current data file. The NR variable contains the total number of records processed. Let’s look at a couple of examples to see this difference:

$ gawk ’BEGIN{FS=","}{print $1,"FNR="FNR}’ data1 data1
data11 FNR=1
data21 FNR=2
data31 FNR=3
data11 FNR=1
data21 FNR=2
data31 FNR=3

In this example, the gawk program command line defines two input files. (It specifies the same input file twice.) The script prints the first data field value and the current value of the FNR variable. Notice that the FNR value reset back to 1 when the gawk program processed the second data file.

Now, let’s add the NR variable and see what that produces:

$ gawk ’
› BEGIN {FS=","}
› {print $1,"FNR="FNR,"NR="NR}
› END{print "There were",NR,"records processed"}’ data1 data1
data11 FNR=1 NR=1
data21 FNR=2 NR=2
data31 FNR=3 NR=3
data11 FNR=1 NR=4
data21 FNR=2 NR=5
data31 FNR=3 NR=6
There were 6 records processed

The FNR variable value reset when gawk processed the second data file, but the NR variable maintained its count into the second data file. The bottom line is that if you’re only using one data file for input the FNR and NR values will be the same. If you’re using multiple data files for input, the FNR value will reset for each data file, and the NR value will keep count throughout all the data files. Note :-You’ll notice when using gawk that often the gawk script can become larger than the rest of your shell script. In the examples in this chapter, for simplicity I just run the gawk in a shell script, you should place different gawk commands on separate lines. This’ll make it much easier to read and follow, rather than trying to cram it all onto one line in the shell script.

User-defined variables

Just like any other self-respecting programming language, gawk allows you to define your own variables for use within the program code. A gawk user-defined variable name can be any number of letters, digits, and underscores, but it can’t begin with a digit. It’s also important to remember that gawk variable names are case sensitive.

Assigning variables in scripts

Assigning values to variables in gawk programs is similar to doing so in a shell script, using an assignment statement:

$ gawk ’
› testing="This is a test"
› print testing
› }’
This is a test

The output of the print statement is the current value of the testing variable.Like shell script variables, gawk variables can hold either numeric or text values:

$ gawk ’
› testing="This is a test"
› print testing
› testing=45
› print testing
› }’
This is a test

In this example, the value of the testing variable is changed from a text value to a numeric value. Assignment statements can also include mathematical algorithms to handle numeric values:

$ gawk ’BEGIN{x=4; x= x * 2 + 3; print x}’

As you can see from this example, the gawk programming language includes the standard mathematical operators for processing numerical values. These can include the remainder symbol (%)and the exponentiation symbol (using either ∧ or **).

Assigning variables in the command line

You can also use the gawk command line to assign values to variables for the gawk program. This allows you to set values outside of the normal code, changing values on the fly. Here’s an example of using a command line variable to display a specific data field in the file:

$ cat script1
{print $n}
$ gawk -f script1 n=2 data1
$ gawk -f script1 n=3 data1

This feature allows you to change the behavior of the script without having to change the actual script code. The first example displays the second data field in the file, while the second example displays the third data field, just by setting the value of the n variable in the command line.

There’s one problem with using command line parameters to define variable values. When you set the variable, the value isn’t available in the BEGIN section of the code:

$ cat script2
BEGIN{print "The starting value is",n; FS=","}
{print $n}
$ gawk -f script2 n=3 data1
The starting value is

You can solve this using the -v command line parameter. This allows you to specify variables that are set before the BEGIN section of code. The -v command line parameter must be placed before the script code in the command line:$ gawk -v n=3 -f script2 data1

The starting value is 3

Now the n variable contains the value set in the command line during the BEGIN section of code.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

Shell Scripting Topics