Saturday, December 29, 2007

Tools of the Trade II: Using SAS to Extract the Data

In this post, we will be using SAS to read in the input data from different file formats.

Reading data from a text file
Reading data from a text file is the most basic and popular way to read in data for a SAS program.Usually this is how analysts first learn how to read in data into their programs. It gives you the greatest control while reading in the data, and you can read in as much data as you want. Let's assume that there is a text file named 'hsbctxns.txt' saved in a folder called hsbcdata in your computer's C\: drive. The code for reading that file is given below:

data raw; infile
"C:\hsbcdata\hsbctxns.txt" missover pad lrecl=1000;

input
@1 cardname $8.

@9 merchant $20.

@29 amount 8.
;
run;


If you want greater control over the text file you want to read in, you can utilize the following code. The double questions marks (??) allows the code to read in data even if the format type is different what is specified in your code. In the example below, your code would read in amount values, even if some of them are in character format, although the amount field has been specified as numeric.

data raw;
infile "C:\hsbcdata\hsbctxns.txt" missover pad lrecl=1000;

input
@1 cardnom1 ?? $8.
@8 cardnom2 ?? $8. @;
cardnom2 = cardnom1;
input
@19 merchant ?? $20.

@39 amount ?? 8.
;
run;


To save type typing code and directly import a spaced tab *.txt file to your program, use the following code:

PROC IMPORT OUT= WORK.INDAT
DATAFILE= "C:\REJECT_REPORT.txt"

DBMS=TAB REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;


To import a delimited *.txt file, use the following code:

PROC IMPORT OUT= WORK.Profit_Rpt
DATAFILE= "C:\Profit_Rpt.txt"

DBMS=DLM REPLACE;

DELIMITER='00'x;

GETNAMES=YES;

DATAROW=2;

RUN;


Reading data from a *.csv file

Now we'll read in data from delimited or *.csv file. The delimiter let's the program know when the current field ends and the next field begins.
This method is advantageous over the previous method because you don't have to specify the position and format for each variable you read. Also, this method lets you read in as much data as you want. Let's assume there's another file named 'deptamt.csv' in the hsbcdata folder in your C:\ drive. Here is how you would read in that data:

data rawcsv;
infile "C:\hsbcdata\deptamt.csv" dlm=',' dsd missover pad firstobs=2 lastobs=1000;

length name $8. type $4. amount 8.2;

input name $ type $ amount;

run;

One problem you might encounter using comma (',') as a delimiter is that numbers (i.e., 23,000) or certain fields (i.e., LastName, FirstName) may have commas within their values that may corrupt your dataset. A less frequently used special character, such as tilda ('~'), as a delimiter can help you avoid this problem.

To import a *.csv file, use the following code:

PROC IMPORT OUT= WORK.Employee_Records
DATAFILE= "C:\Employee_Records.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;


Reading data from an Excel file
Now on to reading in data from a Microsoft excel file. This saves the effort of converting your Excel file into another file format and reading it in. However, Excel limits the row and columns of data you can read in. This code lets you read in the data from an Excel file named hsbcfigs.xls:

filename toSAS dde"Excel[hsbcfigs.xls]hsbcstat!r1c1:r100c100" notab;
data rawxls;

infile toSAS dlm='09'x dsd missover;

length name $8 type $4 amount 8.2;

input name$ type$ amount;
run;


To import an Excel file, use the following code:

PROC IMPORT OUT= WORK.indat
DATAFILE= "C:\valid_tests.xls"

DBMS=EXCEL2000 REPLACE;

GETNAMES=YES;

RUN;


Reading data from a DB2 Database
Utilizing the SQL pass thru code below, you can read data directly from DB2 database to your SAS program. The same pass thru code can also be used to read in data from an Oracle database.

proc sql;
connect to db2 (database=app user=appguest password=app123 dsn=core schema=appcore
);
create table testdata as

select * from connection to db2
(select
m.name, m.type, m.amount
from appcore.sales m
where m.salesdate='10JUN2007' and m.amount<=100.00);
disconnect from db2;

quit;


Often, the data for your analytical model would come to you from different sources in various forms. You can use SAS to read in the data based on how the data is made available to you.

No comments: