Subsetting a SAS Data Set: Selecting Observations That Meet Certain Conditions - SAS Programming

Suppose you have employee records in a SAS data set EMPLOY that contains the variables ID, GENDER(l=male, 2=female), DEPT(department code), SALARY (yearly salary),YEARS(number of years employed with the firm), and STATE 2 digit state of residence code).You want to create a new SAS data set, NJEMPLOY, which is a subset of EMPLOY made up of only those employees who live in New Jersey.You want NJEMPLOY to contain all the variables from the original EMPLOY file.There are several ways to accomplish this. Here's one:

Example

LIBNAME MARY 'C:EMPLOYEEJQBDATA';
DATA NJEMPLOY;
SET MARY.EMPLOY;
IF STATE EQ 'NJ';
RUN;

Reading Observations from an Existing SAS Data Set

Before we discuss the process of selecting a subset of data,we have to talk about how you access the existing SAS data set that you are going to subset. This process is handled by the SET statement.Think of the SET statement as being similar to an INPUT statement,with the difference being that a SET statement reads records from a SAS data set while an INPUT statement reads raw data from an external file (or instream data).

In this case,the data being used to create the SAS data set NJEMPLOY are being read in from the existing SAS data set EMPLOY.But what about that LIBNAME statement? In,"INPUT and INFILE,"you used a FILENAME statement to create a fileref (file reference).This enabled you to access raw data stored external to the code you were writing.

The LIBNAME statement is similar in that it creates a libref (library reference) to refer to an external SAS data library where you store SAS data sets (along with other SAS entities).Here,the LIBNAME statement creates the libref called MARY that refers to the SAS data library called
C:EMPLOYEEJOBDATA which contains a SAS data set called EMPLOY,possibly among other things. If it all seems a little confusing,it's because it all is a little confusing. Don't worry though; it all does make sense and will become second nature to you.

Subsetting IF Statement

When an IF statement is used in this context, it is called a subsetting IF statement. It is shorthand for,IF NOT condition THEN DELETE;(i.e., if this condition is true,continue processing the remaining statements in the DATA step; if not, delete this observation and return to the top of the DATA step).

In this case,it stands for IF NOT (STATE EQ 'NJ') THEN DELETE;.The resulting SAS data set NJEMPLOY contains only those observations from EMPLOY where STATE is equal to NJ.Subsetting IF statements often do the job quite adequately, but there is an alternative method called a WHERE statement.(There is also a WHERE data set option available, but this time you use the statement version.) Here is the code to accomplish the subsetting you want by using a WHERE statement:

Example

LIBNAME MARY 'C:EMPLOYEEJOBDATA';
DATA NJEMPLOY;
SET MARY.EMPLOY;
WHERE STATE EQ 'NJ'j
RUN;

fferences between IF and WHERE Statements

This program is identical to the previous example except that the sub setting IF statement is replaced with a WHERE statement.Are there important differences? Both of these programs produce identical results,but there are differences between IF and WHERE statements.

The WHERE statement may be more efficient then the subsetting IF (especially if you are taking a very small subset from a large file) because it checks on the validity of the condition before the observation is brought into a temporary holding area, whereas the subsetting IF statement brings in the entire observation and then checks the condition to see if the observation is to be kept or not.

This temporary holding area is called the program data vector (PDV). A WHERE statement can only be used with variables in the existing data set, whereas a subsetting IF statement can be used with raw data as well.

Another difference between a subsetting IF statement and a WHERE statement may surface when you use the FIRST.and LAST.logical variables when the WHERE condition is not true,the observation is not brought into the PDV,and therefore it does not affect the logical values of the FIRST.and LAST.variables.Another major difference between IF and WHERE statements is that you may include WHERE statements in SAS procedures! For example,if you have a data set called ALL (containing the variables ID, SEX, and SALARY),and you want a listing only for MALES (M), you could

code this as:
PROC PRINT DATA=ALL;
WHERE SEX = 'M';
RUN;

This saves you the work of creating a new data set just to obtain your listing

WHERE Statement Operators

Before we consider variations on this example,let's discuss some of the special WHERE statement operators.Suppose you want to select only people between the ages of 20 and 40 (inclusive).There are several ways to code this using the WHERE statement.You could write:

WHERE AGE GE 20 AND AGE LE 40;
or,
WHERE 20 LE AGE LE 40;

The BETWEEN values can be constants (numeric or character) or expressions.

Some useful WHERE operators and their actions are as follows:

Subsetting a SAS Data Set: Selecting Observations That Meet Certain Conditions

Subsetting a SAS Data Set: Selecting Observations That Meet Certain Conditions

Subsetting a SAS Data Set: Selecting Observations That Meet Certain Conditions

KEEP and DROP Statements

Now let's get back to the example at hand. Let's modify this program in several ways. First,suppose that you only require the variables ID and DEPT in the new data set.You can easily accomplish this with a KEEP statement or a KEEP= data set option.First,let's see how KEEP and DROP statements work.

When you create a new data set from an old one, you may want to keep only selected variables.You have several ways to do this.A KEEP statement placed anywhere in the DATA step causes only those variables listed to be written to the newly created SAS data set. For example, if you want to keep only ID and DEPT in the new data set (and also want to do the NJ subsetting),you can do it like this:

Example

LIBNAME MARY 'CtXEMPIiOYEEXJOBDATA';
DATA NJEMPLOY;
SET MARY.EMPLOY;
WHERE STATE EQ 'NJ';
KEEP ID DEPT;
RUN;

Data set NJEMPLOY contains only the variables ID and DEPT and only those observations for which the STATE code is equal to NJ.If you want to keep most of the variables from the old data set in the one you are creating,you may choose to use a DROP statement to exclude the variables you don't want.

There is absolutely no difference between keeping the variables you want or dropping the variables you don't want.The choice usually depends on which list is shorter to write.You can replace the previous KEEP statement with the equivalent DROP statement:

DROP GENDER SALARY YEARS STATE;

Since SAS programmers (actually, all programmers that we know) don't like to type any more than necessary,the KEEP statement is the preferred one to use here.When in doubt about which to use, give the edge to the KEEP statement since it is clearer to see which variables will be in the new data set.

KEEP= and DROP= Data Set Options

An alternative to a KEEP or DROP statement is a KEEP= or DROP= data set option.The variables to be kept or dropped are listed following the data set name,in parentheses.Following is a program that produces identical results to the previous one except that it uses a KEEP= data set option instead of a KEEP statement.In this case,the data set option is more efficient than the statement version.First we show the code and then the explanation.

Example

LIBNAME MARY ' C:EMPLOYEEJOBDATA' ;
DATA NJEMPLOY;
SET MARY.EMPLOY (KEEP=ID DEBT STATE);
WHERE STATE EQ 'NJ';
DROP STATE;
RUN;

The KEEP= data set option instructs the DATA step to read only the variables ID,DEPT, and STATE into the PDV (remember this stands for the program data vector) for processing.Only those variables listed in the KEEP= option list are available in the DATA step, and only those same variables are written out to the SAS data set you are creating.

Using the KEEP= data set option instead of the KEEP statement allows the software to deal with fewer variables in the PDV and is therefore more efficient. Now you understand the difference between the KEEP= data set option and the KEEP statement.

But why include STATE in the KEEP= data set option if you only want to keep DEPT and ID in the new data set you are building? The answer lies in the WHERE statement. In order to test the value of STATE, the SAS System has to have it included in the PDV.

If you omit STATE from the KEEP= list,the WHERE statement will not work, and the log will show you the error of your ways.Since you do not want the variable STATE in your new data set,use a DROP statement to indicate that it is not to be written to the data set NJEMPLOY.For the last subsetting example, suppose you want to choose records from the states of NY, NJ, FL, and CA. You can do this the long way as follows:

WHERE STATE EQ'NJ' OR STATE EQ'NY' OR
STATE EQ'FL' ORSTATE EQ 'CA';

or you can do it the short way like this:

WHERE STATE IN ('NJ','NY','FL','CA');

The two expressions produce identical results.The IN operator also works with numeric variables.To select observations for ID values 3, 7, and 9, you would write:

WHERE ID IN (3, 7, 9);
or
WHERE ID IN (3 7 9);

You may use either commas or spaces to separate the numeric values.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

SAS Programming Topics