Sas Programming

Sas Programming

This course contains the basics of Sas Programming

Course introduction
Test Your Caliber
Interview Questions
Pragnya Meter Exam

Sas Programming

PROC FORMAT

Very simply put,formats allow you to associate descriptive text with internal values of variables in a SAS data set. You can instruct the SAS System to print "MALE" or "FEMALE" instead of the actual values of "M" or "F". You can also use formats to create new variables as translated versions of other variables.

 The SAS System provides a plethora of ready-to-use formats and informats, but usually they are not enough. When this is the case, you can use the FORMAT procedure to create your own formats or informats.

The FORMAT procedure is one of the most potentially useful and productive tools available in the SAS System,yet it remains one of the most underused by beginning programmers.Although this does not present an exhaustive in-depth look at this powerful tool,it does cover such topics as creating and storing your own formats and creating a user-defined format from a SAS data set.

 This latter feature can be extremely useful when you have a long list of values and descriptions stored in a raw data file or a SAS data set and you want to create a format associating the values with their descriptions.Without the automating features of CNTLIN/CNTLOUT,this could be a tedious process.

One of the most basic concepts in the SAS System,and also one of the most commonly misunderstood among beginning SAS programmers, is the notion that formats exist independently of variables and must be associated with them by the use of a FORMAT statement in a procedure or in a DATA step.Just creating formats with some PROC FORMAT code doesn't automatically link the formats you create to any variables you may use in subsequent procedures.The word "subsequent" here is quite important.It may seem obvious (but it's not) that you have to create a format before you can associate it with a variable.This chapter introduces some of the concepts necessary for a basic understanding of the format/informat creation process.

Formatting Values in a Questionnaire

In this first example, you have collected questionnaire data and you want to create a finished report showing more meaningful descriptions rather than the numeric codes that were used for data entry Your raw data file contains:

Your raw data file contains Scales which express a range of attitudes using numbers such as 1 to 5 are sometimes referred to as Likert scales,named after a psychometrician who published studies on how to measure attitudes.The five-point Likert scale you are using is: l=strongly disagree, 2=disagree, 3=no opinion, 4=agree, 5=strongly agree. You also want to create a new variable called AGEGROUP which groups age into 20 year intervals up to age 60, and includes all ages above 60 in a single group.

The formats you need are not supplied with the system, so you have to create them.This is done with PROC FORMAT, wherein you create one or more formats, each one created with a VALUE statement. Here is the code to produce the formats you need, as well as the DATA step to create the data set and the PROC PRINT to produce the finished report.

Example

PROC FORMAT;
VALUE GENDER 1 = 'Male'
2 = 'Female'
. = 'Missing'
OTHER = 'Miscoded';
VALUE $RACE 'C' = 'Caucasian'
'A' - 'African American'
'H' - 'Hispanic'
'N' = 'Native American'
OTHER = 'Other'
' ' = 'Missing';
VALUE $LIKERT '1' = 'Str dis'
'2' = 'Disagree'
'3' = 'No opinion'
'4' = 'Agree'
'5' = 'Str agree'
OTHER * ' ';
VALUE AGEGROUP LOW-<20 = '< 20'
20-<40 = '20 to <40'
40-<60 = '40 to <60'
60-HIGH = '60+';
DATA QUESTION;
INPUT ID $ 1-2
GENDER 4
RACE $ 6
AGE 8-9
SATISFY $ 11
TIME $ 13;
FORMAT GENDER GENDER.
RACE $RACE.
SATISFY TIME $LIKERT.;
AGEGROUP=PUT {AGE,AGEGROUP.);
DATALINES;
01 1 C 45 4 2
02 2 A 34 5 4
03 1 C 67 3 4
04 N 18 5 5
05 9 H 47 4 2
06 1 X 55 3 3
07 2 56 2 2
08 20 1 1
RUN;
PROC PRINT DATA=QUESTION NOOBS;
TITLE 'Data listing with formatted values';
RUN;

This example demonstrates many of the basic features of PROC FORMAT and the use of formats in general.The first format you create, GENDER, associates the text strings 'Male'and'Female'(placed in single or double quotation marks) with the numeric data values 1 and 2, respectively.

Missing values (.) are represented as 'Missing,' and all other values (OTHER) are identified as 'Miscoded.' This is a numeric format. Note that this format is not connected to any variable in any data set; it just exists waiting to be called into use. This is true of all the formats you create and is a point worth repeating.

The next format you create, $RACE, is a character format. Here you begin the format name with a dollar sign ($) to denote it as a character format (to be used with character variables) and place the data values to be formatted in single or double quotation marks. RACE codes other than 'C', 'A', 'H', or 'N' are associated with 'Other,' and missing values (' ') are associated with 'Missing.' The $LIKERT and AGEGROUP formats are created similarly.In the numeric AGEGROUP format, you assign inclusive ranges of values to specific labels.This is used in the DATA step to create a new grouped variable.

At this point in the code, the formats are created. Now you have to use them. The DATA step reads in the raw data and permanently makes the following format assignments: format GENDER.to variable GENDER, format $RACE. to variable RACE, and format $LIKERT.to variables SATISFY and TIME. Note that $LIKERT.was assigned to more than one variable — perfectly legal.

Of much more importance to note is the period (.) after the format names in the FORMAT statement. You do not define the formats with a period in the PROC FORMAT code,but you must include the period whenever you refer to the format in subsequent code.This is a very common beginning mistake, and most of you will make it. We did, and we still do from time to time.Since you link the variables to the formats in the DATA step, you do not have to do so in the PROC PRINT code.

To finish the example,you use the AGEGROUP.format with a PUT function to create a character variable (also conveniently called AGEGROUP) based on the formatted values of the numeric variable AGE (see Chapter 2, "Data Recoding," Example 3, for more information on this technique).The listing from this program follows:

Output from Example - Formatting Values in a Questionnaire

Notice that PROC FORMAT produces no output of its own, with one optional exception which is covered in the last two examples in this.

Encountering a Subtle Problem with Missing Values, Formats, and PROC FREQ

In the previous example,you used formats to either display a translated version of a variable's values, or to create a new variable as a grouped version of another variable. Formats can also be used to aggregate data values together for group processing and presentation in procedures that operate on groups of values such as PROC FREQ,a procedure that counts the number (frequency) of occurrences for each formatted value of a variable.

This process of labeling or grouping data values using a format usually has no adverse side effects; however,unexpected results can occur in PROC FREQ, specifically when formatting a variable that contains missing values. Let's look at an example that demonstrates this rather subtle problem.

Example

PROC FORMAT;
VALUE BADFMT 1='ONE'
2='TWO'
OTHER='MISCODED';
RUN;
DATA TEST;
INPUT X Y;
DATALINES;
1 1
2 2
5 5
3 .
;
PROC FREQ DATA-TEST;
TABLES X Y;
FORMAT X Y BADFMT.;
RUN;

Output from Example - Encountering a Subtle Problem with Missing Values, Formats, and PROC FREQ

Notice that Output 2 shows the category MISCODED with a frequency count of 2 for the variable X. This represents the two data values that fall into the special format category called OTHER, namely the 5 in Observation 3,and the 3 in Observation 4. Next, look at the output for the variable Y.Notice that there is no longer a row with a label of MISCODED. Why not? The two values in Observations 3 and 4,namely the 5 and the period (missing), are neither 1 nor 2, and therefore should (you would think) be included in the OTHER category.Instead of this, however, the output shows Frequency Missing = 2 at the bottom. Even though there was only one observation that contained a missing value for the variable Y(Observation 4),PROC FREQ reports two missing values.

The reason for this is a little convoluted, but worth the struggle to understand.In this example,since missing values do not have their own separate range in the BADFMT. format, they automatically fall into the OTHER category.Unfortunately,the inclusion of missing values in this OTHER category declares the entire category as missing.The reason is that when a range of values,or a group of discrete values, are grouped together in a single format, all the values in the range or series are represented by the lowest value in the range or series.

Since missing is the lowest value possible,the inclusion of missing in a range, or series, of values translates all the values into missing for formatting purposes.This means that if any observation contains a missing value,all the observations that fall into the OTHER range are treated as missing.PROC FREQ determines the number of missing values after formatting has occurred. Most other procedures that work with grouped data values, such as PROC MEANS and PROC TABULATE,check for missing values before formatting.

Resolving the Subtle Problem

To prevent the intermingling of missing and non-missing values,and the resulting incorrect designation of some non-missing values as missing,simply include a separate range for missing values as follows:

Example

«-
PROC FORMAT;
VALUE GOODFMT . = 'MISSING'
1 = 'ONE'
2 = 'TWO'
OTHER = 'MISCODED
.

The new format is able to distinguish between missing values and out-of-range values,and this solution provides you with acceptable output as shown in Output 3:

Output from Example - Resolving the Subtle Problem

You are now accurately categorizing the missing value as missing, as is noted in the Frequency Missing = 1 note.

Checking for Invalid Values: A DATA Step Approach (Setting Invalid Values to Missing)

When you have raw data that you want to turn into a SAS data set,and you want to change the format in which some of the data values are stored,you can do it in one of two ways.You can either read the data as they occur in their raw form,insert them into the SAS data set you are building,and then change the values via conditional assignment statements, or you can "prescreen" the raw data on the way into the data set and make changes to the values before they are stored as SAS variable values.

The latter method is usually used for efficiency purposes. In this example, you use both methods.First, you read raw data into a SAS data set as they appear and change the stored values.Then you use an alternative method to change the data as they are being read in.

Suppose you have data coded as 'M' for male and 'F' for female, and you want to set all values other than 'M' or 'F' to a missing value (blank). A straightforward way to do this is to simply supply some logic statements in the DATA step as follows:

Example

DATA SCREEN4 ;
INPUT ID 1-3
GENDER $ 4 ... ;
IF GENDER NOT IN ('M','F')
THEN GENDER = ' ';
*

The IF statement with the IN operator as shown is equivalent to:

IF GENDER NOT='M' AND GENDER  NOT='F' THEN GENDER=' ';

The result of this code is that all values of GENDER other than 'M' or 'F',including missing values (' '), are first read into data set SCREEN4,and then changed into a missing value (blank).Each value already coded as blank is "changed" into itself.

Checking for Invalid Values: A DATA Step Approach (Separating Invalid and Missing Values)

The previous strategy might be sufficient, but most likely you would still want to differentiate between invalid data and missing data. An alternative and slightly more sophisticated approach,but one that still has to read in data and then translate them,is to keep track of invalid codes separately from the missing responses. The following code accomplishes this:

Example

DATA SCREEN5;
INPUT ID 1-3
GENDER $ 4 ... ;
IF GENDER NOT IN ('M','F','')
THEN GENDER='X';
*

In this revised code, values for GENDER that are invalid and non-missing are coded as an 'X'

Using a User-Created Informat to Filter Input Data (Setting Invalid Values to Missing)

Now you take a different approach and use a simple and elegant method that yields, but performs the process in a more efficient manner by changing the data on the way into the data set.Here you filter the data values as they are read in and do the necessary conversion at INPUT time by first creating a user-defined informat. As before,you proceed using two alternate methods. First, you will handle all values other than 'M' or 'F' (including missing) as a single group.The next example separates missing from invalid data. Here is the code for the first method:

Example

PROC FORMAT;
INVALUE $GENDER 'M', 'F'= __SAME_
OTHER='""';
RUN;
DATA SCREEN3;
INPUT €1 ID 3.
€4 GENDER $GENDER1. ;

Here's how it works.You first define an informat in the PROC FORMAT code by using an INVALUE statement instead of a VALUE statement.Since you are defining an informat that will result in a character value, you specify an informat name that starts with a dollar sign ($).(If you wanted the result to be a numeric value, you would leave off the beginning $.)Informat names can only be seven characters in length, including the $ if it is used.

This is in contrast to format names, which can be up to eight characters in length (including the $ if it is used.) The reason is that the system needs room to add an internal tag to the front of the name declaring it to be an informat - it actually adds a @ which you can see in system generated messages.

Following the informat name, you indicate specific values and/or ranges of values on the left and their corresponding resultant informatted values on the right (of the =).If a value is read in that matches a value in the list or lies within a specified range, the informatted value gets assigned to the variable.

In this example, if you wanted to store expanded versions of the 'M' and 'F', you might have coded 'M' = 'MALE', 'F' = 'FEMALE'. Since you are satisfied with the one- character data values, you choose not to change them. You accomplish this by using the keyword _SAME_ which instructs the system to leave these values intact. Values other than an 'M' or an 'F' are, however, set equal to missing because of the OTHER = ' ' assignment.

There are also other special key words that can be used in the range specifications on the left of the = such as HIGH and LOW.These can be used when creating formats as well as informats. Although you create your informat as $GENDER, you write it as $GENDER1. in the INPUT statement. Just as you do with formats, you must include a period(.) at the end of the informat when you use it in a subsequent DATA step but not when you create it.When creating informats, or formats, you cannot end the name with a number.

You can, however, append a number to user-defined, as well as system-supplied, formats and informats when you use them in subsequent DATA or PROC steps. With formats,the number determines the display width of the formatted variable. With informats, the number specifies how many characters to read from the input record. The default length for an informat is the longest informatted value (in this case it is equal to 1).

It is good coding practice to follow all user-defined informats with a number to ensure that you read the correct number of columns from the input record. You should, however,realize that like a LENGTH statement,this can also establish the length of a variable.

Using a User-Created Informat to Filter Input Data (Separating Invalid and Missing Values)

To keep track of miscoded values separately from missing values, you can write:

Example

PROC FORMAT;
INVALUE $GENDER 'M', 'F',' '=_SAME_
OTHER ='X'; -
RUN;
DATA SCREEN7;
INPUT II IB 3.
€4 GENDER $GENDER1. ;

Here you add a blank (a missing character value) to the first list and code all other values to 'X'.Thus, the values 'M', 'F', and ''(missing) are not modified, while all other values are assigned a value of 'X'. You can use informats for numeric as well as character values.The next example demonstrates such a use.

Checking Ranges for Numeric Variables

In this example you provide two informats for two numeric variables.These informats set values outside of the normal range for the variables equal to missing values. Here is the code:

Example

PROC FORMAT;
INVALUE SBPFMT 40 - 300=_SAME_
OTHER = .;
INVALUE DBPFMT 10 - 150=SAME_
OTHER =.;
RUN;
DATA FORMAT8;
INPUT $1 ID $3.
£4 SBP SBPFMT3,
@7 DBP DBPFMT3.;
DATALINES;
001160090
002310220
003020008
004 080
005150070
;
PROC PRINT DATA=FORMAT8;
RUN;

You create two informats,one for SBP (systolic blood pressure — the larger of the two numbers in a blood pressure reading, e.g. the 120 in 120/80) and one for DBP (diastolic blood pressure — the lower of the two numbers).SBP values below 40 or above 300 are set equal to missing (.), as are DBP values below 10 or above 150. Here is the PROC PRINT output from this code.

Output from Example - Checking Ranges for Numeric Variables

Notice that the out-of-range values are all set to missing values.

Using Different Missing Values to Keep Track of High and Low Value

You can easily extend to provide additional information concerning out-ofrange values. To do this, you use two alternate missing values,namely .H and .L. The use of numeric missing values other than a single period (.) is another underused SAS System feature.Besides the single period that specifies a missing value, there are 27 additional missing value designations:._ (that's a period followed by an underscore) and .A through .Z. In this example, you use separate missing values to represent high and low out-of-range values.You are then able to produce counts of these out-of-range values, and at the same time, you are able to compute statistics without including them. Here is the program:

Example

PROC FORMAT?
INVALUE SBPFMT LQW-<40 ».L 0
40-300 =_SAME_
3Q1-BIGH*.H?
INVALUE DBPFMT LOW-<10 «.L
10-150 =_SAME_
151-HIGH=*.H;
VALUE CHECK .H»'High' 0
.L='LOW
. ='Missing'
OTHER*'Valid'',
RON;
DATA FORMATS;
INPUT fl ID $3.
S4 SBP SBPFMT3.
€7 DBF DBPFMT3.;
DATALINES;
001160090
002310220
003020008
004 080
005150070

/
PROC PRINT DATA«FORMAT9 NOOBSj ©
TITLE 'Listing from Example 9';
RUN;
PROC FREQ DATA=FORMAT9;
FORMAT SBP DBP CHECK.?
TABLES SBP DBP / MISSING NOCUM; 0
RUN;
PROC MEANS DATA-FORMAT9 N MEAN MAKDBC«1;
VAR SBP DBP;
RUN;

The output from this program follows.

Output from Example  - Checking Ranges for Numeric Variables

You conveniently choose .H and .L as the special missing values for high and low out-ofrange values.1 You alternatively could have chosen any of the 27 available missing values such as .A and .B to represent the low and high values. Values in the valid ranges are untouched because you specify _SAME_ for your informatted value.

In the PROC FORMAT code,you also use a VALUE statement © to create an output format so that you can label the appropriate missing values. You use this format to label the variables SBP and DBP in the PROC FREQ output, but not in the PROC PRINT listing 0,so that you can see the actual "internal" values stored in the SAS data set. Note that the missing SBP value for ID number 004 is truly missing, whereas the out-of-range missing values are stored as the appropriate special missing values.

In this example,you use the MISSING and NOCUM options 0 on the TABLES statement of PROC FREQ.Without the missing option, the missing values would only be noted at the bottom of the table in the form Frequency Missing = .When the MISSING option is specified, the missing value(s) are listed in the body of the frequency table, and the frequencies and percentages are calculated based on all the observations, including the ones with missing values. You use the NOCUM option to eliminate the Cumulative Frequency and the Cumulative Percent columns.

PROC MEANS is run to demonstrate that no distinction is made between different missing values when statistical calculations are made. A missing value by any other name is still missing.

Creating and Using an Enhanced Numeric Informat

Here is an elegant way of reading a mixture of numeric and character data using what is called an enhanced numeric informat. This type of format became available in Release 6.07 of the SAS System. With this type of informat, you have the ability to specify numeric or character ranges.Here is the temperature example from ,using this type of informat:

Example

PROC FORMAT;
INVALUE TEMPER 70-11Q=JSAME_
'N' =98,6
OTHER=.;
RUN;
DATA TEST;
INPUT TEMP : TEMPER. @@;
DATALINES;
99.7 N 97.9 N N 112.5
;
PROC PRINT DATA=TEST NOOBS;
TITLE 'Temperature Listing';
RUN;

Output from Example - Creating and Using an Enhanced Numeric Informat

As you can see from the listing,the N's were converted to 98.6 and the out-of-range value was set to missing.

Using a SAS Data Set to Create a Character Format

In this example,you use a set of variables in an existing SAS data set to create a format.This is most useful when you have a long list of items and expansions, such as ICD-9 codes and descriptions.For those not familiar with the term, ICD-9 stands for the International Classification of Diseases, ninth revision. Here is a very short portion of the entire list.


If you have a relatively short list such as this one, you can go ahead and write a VALUE statement list this:

Example - Using PROC FORMAT Directly

PROC FORMAT;
VALUE $ICDFMT '072'»'Mumps'
'410'='Heart Attack'
'487'='Influenza'
'493'='Asthma'
'700'='Corns';
RUN;

When the list of codes is hundreds of entries long as it is with the ICD-9, or even thousands of entries long, this gets to be tedious.There is a better way — the CNTLIN option of PROC FORMAT.By using a SAS data set which contains the codes and descriptions needed, and which meets very rigid structural and naming conventions, you can have PROC FORMAT read this data set (which is called a control data set) and automatically create a format for you. Standard sets of codes and descriptions are usually available in some electronic format which can easily be transformed into a SAS control data set as the first step in creating a very useful data formatting tool.

We demonstrate this technique with two simple examples; the current one creates a character format and the next example creates a numeric format.These two short examples serve merely to introduce this powerful facility.

The following program first converts the previous ICD-9 raw data subset to a SAS data set, then to a SAS control data set, and then to a user-defined SAS format. It then uses the format to associate labels to sample raw data read into another SAS data set and prints out the formatted values. The example is a little cumbersome for educational purposes,and it is explained after the code.

Example - Using a CNTLIN Data Set

DATA CODES; 1
INPUT $1 ICD9 $3. 2
@5 DESCRIPT $12.?
DATALINES;
072 MUMPS
410 HEART ATTACK
487 INFLUENZA
493 ASTHMA
700 CORNS
;
DATA CONTROL; 3
RETAIN FMTNAME '$ICDFMT' 4
TYPE 'C' ;
SET CODES (RENAME*(ICD9«START ©
DESCRIPT-LABEL));
RUN;
PROC FORMAT CNTLIN=CONTROL; 6
RUN;
DATA EXAMPLE; 0
INPUT ICD9 $ @@;
FORMAT ICD9 $ICDFMT.J 8
DATALINES;
072 493 700 410 072 700
;
PROC PRINT NOOBS DATA=EXAMPLE; 9>
TITLE 'Using a Control Data Set";
VAR ICD9;
RUN;

Well now you've seen the code, and if that didn't frighten you away, here is the explanation.The first step is to turn the raw data into the SAS data set CODES to contain the collection of codes and descriptions as SAS variables O.Pretty straight forward stuff.You could have made things easier on yourself by naming the variables in the CODES data set START and LABEL, the names required for the control data set and the CNTLIN feature of PROC FORMAT, which is what this example is all about.

You choose however,to give them other, more realistic names 0 and then rename them 5.This makes the coding more general,and more like the situation you may encounter if you ever need to convert an existing SAS data set with previously assigned variable names into a control data set. The next step is to create the control data set (which you cleverly call CONTROL) ©.The control data set consists of one observation for each pair of codes and  descriptions.Each observation must contain a specific set of variables with prescribed names These are as follows:

Variable Name Description of the Variable

You use a RETAIN statement 0 to assign values to the variables FMTNAME and TYPE since they are the same on every observation.You could use assignment statements instead,but the RETAIN statement is more efficient .After the control data set is created, you merely have to tell PROC FORMAT to use it to create a format.This is done with the CNTLIN option 6. You don't even have to name the format you are creating; it's all in the control data set.At this point in the example,you have a character format called $ICDFMT. created and waiting to be used.

You test out your new format by creating another SAS data set called EXAMPLE from a collection of sample ICD-9 data 0. It's important to realize that this represents a set of actually collected data whereas the data used to create data set CODES is a set of all the unique codes that are possible with their descriptions.You permanently assign the formatted values to the codes in the data set EXAMPLE with the FORMAT statement ©.The last step is to print out the data set EXAMPLE ©,with the following result:

Output from Example  - Using a SAS Data Set to Create a Character Format

One very useful option with PROC FORMAT is the FMTLIB option, which gives you a descriptive listing of your format.In the previous program, you could have added this option as follows:

PROC FORMAT CNTLIN=CONTROL  FMTLIB;

This would have generated the following output:

Note that the START value is repeated in the END column because you are not using ranges, but rather discrete values.

Using a SAS Data Set to Create a Numeric Format

The previous technique can also be used to create a numeric format. In this example, you have a long list of numeric codes and country names, and you want to create a format assigning a country name to each of the numeric codes.You create the CNTLIN data set in one step and (cleverly) use the variable names START and VALUE for the numeric codes and the country names respectively. Here is the program:

Example

DATA COUNTRY;
RETAIN FMTNAME 'COUNTRY'
TYPE 'N';
INPUT START 1-2
LABEL $ 3-15;
DATALINES?
01UNITED STATES
02FRANCE
03ENGLAND
04SPAIN
05GERMANY
PRCK: FORMAT CNTLIN=COUNTRY FMTLIB;
RUN;

Since you do not use the format you create to format actual data, the only output you have is from the FMTLIB option. Here it is:

Output from Example - Using a SAS Data Set to Create a Numeric Format

Although control data sets seem complicated at first,when the alternative is writing hundreds (or thousands) of assignment statements,we strongly recommend you consider using them.

Problems

  1. You have a SAS data set containing survey data which is coded as follows: Write the necessary statements to create SAS formats for these variables. Also, write the FORMAT statement you would include in the DATA step to assign the formats you have created to each of the variables listed.
  2. You have a collection of raw data (in external file ZIP) with the following file layout:Use this raw data file to create a character format called $ZIPCODE which will assign town names to ZIP codes.
  3. Here are some sample data:

      08822Flemington
    08903New Brunswick
    78028Kerrville

    Hint: Use the CNTLIN feature of the FORMAT procedure.

  4. You have a SAS data set INVENTRY which contains the numeric variable PART_NO (part number) and the character variable DESCRIPT (part description.) Use this existing data set to create a format called PARTS which associates the part numbers with the descriptions.
  5. Here is part of data set INVENTRY:

Searches relevant to you
Top