Reading in Parts of Raw Data Files - SAS Programming

Up till now you have always read the entire raw data file as input to your program.That is usually what is desired, but there are cases in which only part of the raw data file needs to be read.You can read a specified number of records from the beginning of the file,from the end of the file, or from the middle of the file.

When developing a program,it is always a good idea to work with a small subset of the data until the code is working exactly as desired.The easiest thing to do in the developmental stage is to read only the first n records of a file by using the OBS=n option in the INFILE statement.

The value of n is actually the record number of the last record to be read,but if you are starting from the beginning of the file,then n will also give you n records.

Reading the First 100 Records

Suppose you want to read only the first 100 records of a file called BIGDATA.The following INFILE statement accomplishes this handily:


This could also have been accomplished globally by using an OPTIONS statement of theform:


However,the use of an OBS= option in an INFILE statement allows you to control how many records should be processed from individual raw data files if more than one is being read (we'll get to how to read multiple data files in the next example).

There are many processing options that can be set via the OPTIONS statement.The OPTIONS statement can appear before or after any DATA step or procedure.The included options take effect from that point onward and remain in effect until they are changed by subsequent OPTIONS statements or you exit out of the SAS System.Be forewarned,don't forget to turn off the OBS= option when you no longer want or need it.To set the observation limit back to the default value, use:


Skipping the First n Records of a File

The INPUT statement begins reading from the first data line encountered in the raw data file by default.This does not have to be the case.You can instruct the SAS System which record to read first by use of the FIRSTOBS= option in the INFILE statement. From that point on,records are read sequentially.

Suppose your file were constructed in such a way that the first 100 records were preliminary and not part of the real data set. After using these first 100 records for development as above, you could then start your real analysis with the next record as follows:INFILE 'BIGFILE' FIRSTOBS=101;

Reading a Subset from the Middle of a File

You can also deal with subsets of records from the middle of a raw data file.If,for example, your data set was constructed in such a fashion that you wanted to analyze subsets of 100 records sequentially, you could accomplish this by picking out 100 records to deal with at a time.

The F1RSTOBS= option denotes the first record to read,and the OBS= option denotes the last record to read.The thing to remember is that OBS= does not refer to the number of records to read, but rather it refers to the last record to read. So,to read the second 100 records in the raw data file, code the INFILE statement as follows:


A Special Caution with Short Records in an External File

If you have missing values at the end of a record of an external file, you have to be very careful.Suppose you have a raw data file called HT_WT that contains values for an ID number, HEIGHT,and WEIGHT.Suppose further,that you are missing either weights, or both heights and weights, for some of the subjects. A typical file might look something like this:

File HT_WT
001 68 155
002 64
004 72 220

Notice that you are missing a value for WEIGHT for observation 2 and both a HEIGHT and WEIGHT for observation 3. If these lines do not contain blanks to indicate that you are missing values for these observations,you need to use an option to indicate this.

By the way, just looking at the file on your computer screen may not tell you that there are blanks in these missing positions.You need an editor that shows you where the records end to see if there are blanks or not.

To be on the safe side, you should use the PAD or MISSING option in the INFILE statement to be sure that external files with short records are read correctly.The MISSOVER option is used with list input.It instructs the program not to go to a new record if all the variables have not been assigned values when the end of the current record is reached.Instead, variables that have not received values are set to missing. The PAD option is used with fixed record layouts and pads all short records to the length specified by the logical record length (see Example ).For example, to read data from the HT_WT file, you use the following INFILE and INPUT statements:

or, using list input:

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status

SAS Programming Topics