Saturday, December 29, 2007

Tools of the Trade II: Using SAS to Extract the Data

In this post, we will be using SAS to read in the input data from different file formats.

Reading data from a text file
Reading data from a text file is the most basic and popular way to read in data for a SAS program.Usually this is how analysts first learn how to read in data into their programs. It gives you the greatest control while reading in the data, and you can read in as much data as you want. Let's assume that there is a text file named 'hsbctxns.txt' saved in a folder called hsbcdata in your computer's C\: drive. The code for reading that file is given below:

data raw; infile
"C:\hsbcdata\hsbctxns.txt" missover pad lrecl=1000;

input
@1 cardname $8.

@9 merchant $20.

@29 amount 8.
;
run;


If you want greater control over the text file you want to read in, you can utilize the following code. The double questions marks (??) allows the code to read in data even if the format type is different what is specified in your code. In the example below, your code would read in amount values, even if some of them are in character format, although the amount field has been specified as numeric.

data raw;
infile "C:\hsbcdata\hsbctxns.txt" missover pad lrecl=1000;

input
@1 cardnom1 ?? $8.
@8 cardnom2 ?? $8. @;
cardnom2 = cardnom1;
input
@19 merchant ?? $20.

@39 amount ?? 8.
;
run;


To save type typing code and directly import a spaced tab *.txt file to your program, use the following code:

PROC IMPORT OUT= WORK.INDAT
DATAFILE= "C:\REJECT_REPORT.txt"

DBMS=TAB REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;


To import a delimited *.txt file, use the following code:

PROC IMPORT OUT= WORK.Profit_Rpt
DATAFILE= "C:\Profit_Rpt.txt"

DBMS=DLM REPLACE;

DELIMITER='00'x;

GETNAMES=YES;

DATAROW=2;

RUN;


Reading data from a *.csv file

Now we'll read in data from delimited or *.csv file. The delimiter let's the program know when the current field ends and the next field begins.
This method is advantageous over the previous method because you don't have to specify the position and format for each variable you read. Also, this method lets you read in as much data as you want. Let's assume there's another file named 'deptamt.csv' in the hsbcdata folder in your C:\ drive. Here is how you would read in that data:

data rawcsv;
infile "C:\hsbcdata\deptamt.csv" dlm=',' dsd missover pad firstobs=2 lastobs=1000;

length name $8. type $4. amount 8.2;

input name $ type $ amount;

run;

One problem you might encounter using comma (',') as a delimiter is that numbers (i.e., 23,000) or certain fields (i.e., LastName, FirstName) may have commas within their values that may corrupt your dataset. A less frequently used special character, such as tilda ('~'), as a delimiter can help you avoid this problem.

To import a *.csv file, use the following code:

PROC IMPORT OUT= WORK.Employee_Records
DATAFILE= "C:\Employee_Records.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;


Reading data from an Excel file
Now on to reading in data from a Microsoft excel file. This saves the effort of converting your Excel file into another file format and reading it in. However, Excel limits the row and columns of data you can read in. This code lets you read in the data from an Excel file named hsbcfigs.xls:

filename toSAS dde"Excel[hsbcfigs.xls]hsbcstat!r1c1:r100c100" notab;
data rawxls;

infile toSAS dlm='09'x dsd missover;

length name $8 type $4 amount 8.2;

input name$ type$ amount;
run;


To import an Excel file, use the following code:

PROC IMPORT OUT= WORK.indat
DATAFILE= "C:\valid_tests.xls"

DBMS=EXCEL2000 REPLACE;

GETNAMES=YES;

RUN;


Reading data from a DB2 Database
Utilizing the SQL pass thru code below, you can read data directly from DB2 database to your SAS program. The same pass thru code can also be used to read in data from an Oracle database.

proc sql;
connect to db2 (database=app user=appguest password=app123 dsn=core schema=appcore
);
create table testdata as

select * from connection to db2
(select
m.name, m.type, m.amount
from appcore.sales m
where m.salesdate='10JUN2007' and m.amount<=100.00);
disconnect from db2;

quit;


Often, the data for your analytical model would come to you from different sources in various forms. You can use SAS to read in the data based on how the data is made available to you.

Thursday, December 6, 2007

Tools of the Trade I: The Reporting System Architecture

Want to know more about your customers' buying behavior? What was the impact of the $5 coupons that you handed out at the metro station? Should you hand out more? Is there really a correlation between the sales of diapers and beer in your store after midnight? What other items are these grumpy dads buying? Nachos to go with the beer? Or do they prefer pretzels? In today's blog entry, I want to discuss about the tools you will be using for your analytics strategy so that you can answer questions like these. Specifically, I want to show you how to put together an analytical reporting system using SAS and Microsoft Excel to analyze your data. This system will to pull data from various sources, scrub it clean, perform your analyses, and output the result(s) in accordance to your needs. The architecture that I have shown below should be sufficient to solve a large number of your analytics problems; basically I think that you're limited only by your imagination. Most large firms have SAS licenses and almost everyone with a PC has Microsoft Excel. Both softwares work seamlessly together and create a very powerful analytics tool; however, I have noticed most companies use only a fraction of the capabilities that these two systems offer. A reason why these two products complement each other so well is because they counterbalance their weaknesses. For instance, SAS can read in hundreds of millions of records and perform very sophisticated statistical analyses on very large datasets. SAS, however, can't produce professional looking deliverables with graphs and charts that you can hand out at a meeting. Excel, on the other hand, can produce pretty looking deliverables, but is severely restrained on the number of records it can work on (65,536 rows per worksheet). The best approach, therefore, is to have SAS do the heavy lifting, and Excel do the trimmings. A note to freshman analysts. If you can, at the very least, learn and execute the components needed to implement the system shown below, there will always be food on your table and a roof over your head.



The structure above has four main components. These are:

Extracting the Data
I'll show how to extract data from different types of files. In one example, we will read in the entire data from an Excel file. In another example, we will read in data from a specific row (In case someone provided you an Excel file with headers and titles that you want to exclude when extracting data). We will also read in data from a comma delimited file and a text file. I will also tell you what file formats are preferable in what circumstances (extracting from text files gives you the greatest level of control on what you want to read into your program, Excel gives you the least. However, Excel being a very popular file format means that you may be receiving a lot of your data in Excel format). In another example, we will extract data from a DB2 database (i.e., if you want to read data from your corporate databases) using SQL code in a pass-thru program (if your organization prefers Oracle, similar code can be used to extract data from Oracle databases too). A lot of people don't know that SAS has specific code that produces a GUI to add or delete individual records from a table (think of it as a back door approach to handle one-sies and two-sies, so you don't need to run the data extraction programs again). We will utilize that code too.

Data Validation
After reading in the data, you will want to ensure that it is not corrupted or useless for your purpose. In this stage, we will employ common sense techniques and filtering procedures to ascertain the quality and integrity of our data.

The Analytics Engine
This is the heart of your system, where all the 'magic' takes place. Whether four lines of a regression datastep or 120 pages of extremely sophisticated linear programming code to execute an optimization strategy or pick stocks, think of it as a black box where raw data enters in one end, and actionable information comes out the other end. This black box will contain all your business rules.

Reporting Your Output
SAS allows us to output data in Excel, comma delimited, or text formats. Moreover, one of the nice features about SAS is that it allows us to kick off Excel VBA macros from the program itself. We will use this feature to automatically format the Excel workbook to which SAS outputs our results, eliminating the need to manually pretty up our deliverable. The idea is to have the deliverable ready as soon as the SAS program finishes its run, as well as eliminate any unnecessary manual work or possibility for human errors. SAS also allows us to output data in HTML format to a specific directory. Therefore, we can output the results to an internet directory and have a web based report available to everyone with a web browser, preferably at the same time the spreadsheet deliverables are available for distribution. Think web based Dashboards with hard copy deliverables available instantaneously for distribution.

Over the next several weeks, I'll be elaborating further on how each of these system components work. I might also be tweaking the overall system to introduce improvements, such as introducing a control program and scheduling reports to run automatically. Finally, I will be providing several actual examples where this architecture has been implemented.