Performing - SAS Programming

The present example is slightly more advanced than others in this book,and it covers an somewhat esoteric topic.You may want to skip it for now(orforever),and that's just fine.We use the term "fuzzy" in this example to refer to the merging of similar but not exact matches between two files.

The most common use for "fuzzy" merges is when you use a name as the matching variable.In this example, we merge data from two files, using similar sounding names,and date of birth as the matching variables.SAS System Releases 6.07 and higher support a SOUNDEX function which follows the algorithm described in D.E.Knuth's book, "The Art of Computer Programm ing, Volume 3.Sorting and Searching," Reading, MA: Addison-Wesley.

This algorithm discards most vowels and substitutes numbers for groups of like-sounding consonants.The result is that like-sounding words or names will translate to the same SOUNDEX description.In the following program, you use the SOUNDEX function to translate the names in both data sets to their SOUNDEX equivalents and then merge the two data sets by the SOUNDEX name and the date of birth.

The use of another variable such as date of birth is necessary since there may be too many like-sounding names in the two data sets to be merged, but it would be unlikely to have like-sounding names with the same date of birth in both data sets. Depending on the size of the two files, you may need to use additional variables to add in the merge.

Data sets ONE and TWO are used to illustrate a "fuzzy" merge.

Performing

The program is shown next:

Example

DATA ONE_TEMP;
SET ONE (RENAME=(HAME*NAMEJ)NE}};
S_NAME * SOUNDEX{NAMEJ>NE)7 RUN;~
DATA TWOJTEMP;
SET TWO (RENAME=(NAME=NAME_TWO));
S_NAME * SOUNDEX(NAMEJ?WO);
RUN;~
PROC SORT DATA^ONEJTEMP?
BY S_NAME DOB;
RON;
PROC SORT DATA«TWO_TEMP;
BY S_NAME DOB; ~
RUN;
PROC PRINT DATA=ONE_TEMP NOOBS;
TITLE 'Data Set OKEJTEMP';
RUN;
PROC PRINT DATA=TWO_TEMP NOOBS;
TITLE 'Data Set TWOJTEMP';
RUN;
DATA BOTH;
MERGE ONE_TEMP (IN=ONE)
TWO_TEMP (IN=TWO);
BY SJJAME~DOB;
IF ONE = 1 AND TWO « 1;
FORMAT DOB MMDDYY8,;
RUN;
PROC PRINT DATA=BOTH NOOBS;
TITLE 'Data Set BOTH*;
RUN;

You start out by creating two new data sets,ONEJTEMP and TWO_TEMP,from the two data sets you want to merge, ONE and TWO.Each of the new data sets contains all the variables from the original plus a new variable, S_NAME, which is the SOUNDEX equivalent of NAME.

You rename the variable NAME in each of the data sets (to NAME_ONE and NAMEJTWO)so that you can maintain the original names from the original data sets and,thereby, see which names were actually matched.If you did not rename the variable NAME,then only the value of NAME from data set TWO would remain in the merged data set.The new data sets are used for your merging process but are not saved.

Here are the two data sets ONE_TEMP and TWO_TEMP followed by the resulting merged
data:

Here are the two data sets ONE_TEMP and TWO_TEMP followed by the resulting merged data

Here are the two data sets ONE_TEMP and TWO_TEMP followed by the resulting merged data

There are three matches between these two data sets where both the SOUNDEX equivalent and the date of birth are the same.Keep in mind that this program is only an example of how to match observations from multiple files on inexact criteria and is only intended to serve as a model of how " fuzzy& quot; matching is performed.You can expect a certain percentage of incorrect matches with procedures such as this,and careful testing must be performed to determine how to perform such a matching task.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

SAS Programming Topics