Creating More Than One Data Set at a Time - SAS Programming

Now let's once again change the scenario a bit.Suppose the demographic data set,DEMOG2 contains data on all employees,both active and inactive,and the EMPLOYEE data set contains data on only the active employees.You want to see all the data for active employees in one listing,and only the demographic data for the inactive employees in another listing.By introducing a few more tools of the SAS System,you can accomplish your goals easily.As usual,first we show you the code and then the explanation.

Example

PROC SORT DATA=DEMOG2;
BY ID;
RUN;
PROC SORT DATA*EMPLOYEE;
BY ID;
RON;
DATA ACTIVE INACTIVE (KEEP=ID GENDER STATE);
MERGE DEMOG2 EMPLOYEE (IN=ACT);
BY ID;
IF ACT=1 THEN OUTPUT ACTIVE;
ELSE OUTPUT INACTIVE;
RUN;
PROC PRINT DATA*ACTIVE;
TITLE 'Active Employees';
RUN;
PROC PRINT DATA«INACTIVE;
TITLE 'Inactive Employees';
RUN;

Now here's what's happening in the expanded DATA step.First of all, you are creating more than one SAS data set in the same DATA step.You do this by placing both data set names in the DATA statement.You control which variables get written to each data set with KEEP= or DROP= data set options and which observations get written to each one with OUTPUT statements later in the DATA step.

ACTIVE will contain all the variables that are in DEMOG2 and EMPLOYEE,whereas INACTIVE will only contain ID,GENDER, and STATE.The IN= data set option is associated with data set EMPLOYEE and will be true (equal to 1) if a contribution is being made from the EMPLOYEE data set for a given ID.

The next feature to explain is the OUTPUT statement.By default, the end of every DATA step has an implicit OUTPUT statement in it which adds each observation to the data set(s) being built.If an actual explicit OUTPUT statement is contained anywhere in the DATA step, then each observation is added to the data set(s) being built only as explicitly directed by an OUTPUT statement.

The IF ACT=1 part of the IF statement can also be written in a shorthand manner as: IF ACT (since ACT is a logical variable and if true (1), will satisfy the condition of the IF statement.)If the value of ACT is 1, the subsetting IF statement instructs the system to output (add) the observation being built to the ACTIVE data set only,not to the INACTIVE data set.

This will only be the case when EMPLOYEE adds data to the observation.In other words,the ACTIVE data set will only contain observations for those records found in EMPLOYEE.If the condition is not true (no data came from EMPLOYEE, only from DEMOG2),then ACT will be false (equal to 0) and the observation being built will only be output to the INACTIVE data set.

There is also something of importance to note here in the PROC PRINTstatements.You must explicitly tell the system which data set to print with the DATA= option.By default (no DATA= option),the PRINT procedure operates on the last data set created in the code.Since you are creating more than one data set, you have to be explicit. Here is the output:

Output from Example-Creating More Than One Data Set at a Time

Output from Example-Creating More Than One Data Set at a Time

What If the BY Variable Has a Different Name in Each Data Set?

Let's throw another monkey wrench into the mix.Suppose you have to merge two data sets on a common identifier, but this identifier has a different variable name in each of the two data sets.Let's say you have a variable called EMP_ID in data set ONE and a variable called EMP_NUM in data set TWO.

They both represent employee identification numbers but have different variable names. Since you need to have the same variable name in both data sets to use as a BY variable, you have to change one of the names. Easy. Here's how: (let's assume that the two data sets are already sorted,ONE by EMPJD and TWO by EMP_NUM.)

Example

DATA COMBINED;
MERGE ONE TWO (RENAME=(EMP_NUM=EMP_ID));
BY EMP^ID;
RUN;

As you can see,the solution is to use the RENAME=(old_name=new_name) data set option following the data set name containing the variable to be
renamed.The renaming is temporary for data set TWO in this case since it is not being output; it is only serving as input to the merging process being used to build data set COMBINED.

Data set COMBINED contains a variable called EMPJD; it does not contain a variable called EMP_NUM.All of the data set options are placed in parentheses.In addition,some of the options such as RENAME= have the list of old and new variable names in their own set of parentheses.Be careful with the parentheses. "Table Lookup," contains examples of merging with more than one BY variable.

Merging Data Sets That Contain Variables with the Same Name,Not Used as BY Variables

Time for one more complication before getting on to the last topic in this chapter.If both data sets being merged contain a variable of the same name other than the BY variable,the rule is that the variable in question in the resulting merged data set will take on values from the last data set named in the MERGE statement.

This potentially can present a problem if there are missing values in the last named data set. There are numerous ways to code around this problem,such as using the RENAME= data set option to keep both versions of the duplicated variable by renaming one of them,or by using a series of IF-THEN/ELSEstatements to check on missing values.There is one specialized situation,
however, where the SAS System provides a method that deals with this,as well as other possible problems.

DATA COMBINED;
MERGE ONE TWO (RENAME=(EMP_NUM=EMP_ID));
BY EMP^ID;RUN;

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

SAS Programming Topics