Saturday, December 29, 2007

Tools of the Trade II: Using SAS to Extract the Data

In this post, we will be using SAS to read in the input data from different file formats.

Reading data from a text file
Reading data from a text file is the most basic and popular way to read in data for a SAS program.Usually this is how analysts first learn how to read in data into their programs. It gives you the greatest control while reading in the data, and you can read in as much data as you want. Let's assume that there is a text file named 'hsbctxns.txt' saved in a folder called hsbcdata in your computer's C\: drive. The code for reading that file is given below:

data raw; infile
"C:\hsbcdata\hsbctxns.txt" missover pad lrecl=1000;

input
@1 cardname $8.

@9 merchant $20.

@29 amount 8.
;
run;


If you want greater control over the text file you want to read in, you can utilize the following code. The double questions marks (??) allows the code to read in data even if the format type is different what is specified in your code. In the example below, your code would read in amount values, even if some of them are in character format, although the amount field has been specified as numeric.

data raw;
infile "C:\hsbcdata\hsbctxns.txt" missover pad lrecl=1000;

input
@1 cardnom1 ?? $8.
@8 cardnom2 ?? $8. @;
cardnom2 = cardnom1;
input
@19 merchant ?? $20.

@39 amount ?? 8.
;
run;


To save type typing code and directly import a spaced tab *.txt file to your program, use the following code:

PROC IMPORT OUT= WORK.INDAT
DATAFILE= "C:\REJECT_REPORT.txt"

DBMS=TAB REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;


To import a delimited *.txt file, use the following code:

PROC IMPORT OUT= WORK.Profit_Rpt
DATAFILE= "C:\Profit_Rpt.txt"

DBMS=DLM REPLACE;

DELIMITER='00'x;

GETNAMES=YES;

DATAROW=2;

RUN;


Reading data from a *.csv file

Now we'll read in data from delimited or *.csv file. The delimiter let's the program know when the current field ends and the next field begins.
This method is advantageous over the previous method because you don't have to specify the position and format for each variable you read. Also, this method lets you read in as much data as you want. Let's assume there's another file named 'deptamt.csv' in the hsbcdata folder in your C:\ drive. Here is how you would read in that data:

data rawcsv;
infile "C:\hsbcdata\deptamt.csv" dlm=',' dsd missover pad firstobs=2 lastobs=1000;

length name $8. type $4. amount 8.2;

input name $ type $ amount;

run;

One problem you might encounter using comma (',') as a delimiter is that numbers (i.e., 23,000) or certain fields (i.e., LastName, FirstName) may have commas within their values that may corrupt your dataset. A less frequently used special character, such as tilda ('~'), as a delimiter can help you avoid this problem.

To import a *.csv file, use the following code:

PROC IMPORT OUT= WORK.Employee_Records
DATAFILE= "C:\Employee_Records.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;


Reading data from an Excel file
Now on to reading in data from a Microsoft excel file. This saves the effort of converting your Excel file into another file format and reading it in. However, Excel limits the row and columns of data you can read in. This code lets you read in the data from an Excel file named hsbcfigs.xls:

filename toSAS dde"Excel[hsbcfigs.xls]hsbcstat!r1c1:r100c100" notab;
data rawxls;

infile toSAS dlm='09'x dsd missover;

length name $8 type $4 amount 8.2;

input name$ type$ amount;
run;


To import an Excel file, use the following code:

PROC IMPORT OUT= WORK.indat
DATAFILE= "C:\valid_tests.xls"

DBMS=EXCEL2000 REPLACE;

GETNAMES=YES;

RUN;


Reading data from a DB2 Database
Utilizing the SQL pass thru code below, you can read data directly from DB2 database to your SAS program. The same pass thru code can also be used to read in data from an Oracle database.

proc sql;
connect to db2 (database=app user=appguest password=app123 dsn=core schema=appcore
);
create table testdata as

select * from connection to db2
(select
m.name, m.type, m.amount
from appcore.sales m
where m.salesdate='10JUN2007' and m.amount<=100.00);
disconnect from db2;

quit;


Often, the data for your analytical model would come to you from different sources in various forms. You can use SAS to read in the data based on how the data is made available to you.

Thursday, December 6, 2007

Tools of the Trade I: The Reporting System Architecture

Want to know more about your customers' buying behavior? What was the impact of the $5 coupons that you handed out at the metro station? Should you hand out more? Is there really a correlation between the sales of diapers and beer in your store after midnight? What other items are these grumpy dads buying? Nachos to go with the beer? Or do they prefer pretzels? In today's blog entry, I want to discuss about the tools you will be using for your analytics strategy so that you can answer questions like these. Specifically, I want to show you how to put together an analytical reporting system using SAS and Microsoft Excel to analyze your data. This system will to pull data from various sources, scrub it clean, perform your analyses, and output the result(s) in accordance to your needs. The architecture that I have shown below should be sufficient to solve a large number of your analytics problems; basically I think that you're limited only by your imagination. Most large firms have SAS licenses and almost everyone with a PC has Microsoft Excel. Both softwares work seamlessly together and create a very powerful analytics tool; however, I have noticed most companies use only a fraction of the capabilities that these two systems offer. A reason why these two products complement each other so well is because they counterbalance their weaknesses. For instance, SAS can read in hundreds of millions of records and perform very sophisticated statistical analyses on very large datasets. SAS, however, can't produce professional looking deliverables with graphs and charts that you can hand out at a meeting. Excel, on the other hand, can produce pretty looking deliverables, but is severely restrained on the number of records it can work on (65,536 rows per worksheet). The best approach, therefore, is to have SAS do the heavy lifting, and Excel do the trimmings. A note to freshman analysts. If you can, at the very least, learn and execute the components needed to implement the system shown below, there will always be food on your table and a roof over your head.



The structure above has four main components. These are:

Extracting the Data
I'll show how to extract data from different types of files. In one example, we will read in the entire data from an Excel file. In another example, we will read in data from a specific row (In case someone provided you an Excel file with headers and titles that you want to exclude when extracting data). We will also read in data from a comma delimited file and a text file. I will also tell you what file formats are preferable in what circumstances (extracting from text files gives you the greatest level of control on what you want to read into your program, Excel gives you the least. However, Excel being a very popular file format means that you may be receiving a lot of your data in Excel format). In another example, we will extract data from a DB2 database (i.e., if you want to read data from your corporate databases) using SQL code in a pass-thru program (if your organization prefers Oracle, similar code can be used to extract data from Oracle databases too). A lot of people don't know that SAS has specific code that produces a GUI to add or delete individual records from a table (think of it as a back door approach to handle one-sies and two-sies, so you don't need to run the data extraction programs again). We will utilize that code too.

Data Validation
After reading in the data, you will want to ensure that it is not corrupted or useless for your purpose. In this stage, we will employ common sense techniques and filtering procedures to ascertain the quality and integrity of our data.

The Analytics Engine
This is the heart of your system, where all the 'magic' takes place. Whether four lines of a regression datastep or 120 pages of extremely sophisticated linear programming code to execute an optimization strategy or pick stocks, think of it as a black box where raw data enters in one end, and actionable information comes out the other end. This black box will contain all your business rules.

Reporting Your Output
SAS allows us to output data in Excel, comma delimited, or text formats. Moreover, one of the nice features about SAS is that it allows us to kick off Excel VBA macros from the program itself. We will use this feature to automatically format the Excel workbook to which SAS outputs our results, eliminating the need to manually pretty up our deliverable. The idea is to have the deliverable ready as soon as the SAS program finishes its run, as well as eliminate any unnecessary manual work or possibility for human errors. SAS also allows us to output data in HTML format to a specific directory. Therefore, we can output the results to an internet directory and have a web based report available to everyone with a web browser, preferably at the same time the spreadsheet deliverables are available for distribution. Think web based Dashboards with hard copy deliverables available instantaneously for distribution.

Over the next several weeks, I'll be elaborating further on how each of these system components work. I might also be tweaking the overall system to introduce improvements, such as introducing a control program and scheduling reports to run automatically. Finally, I will be providing several actual examples where this architecture has been implemented.

Monday, November 19, 2007

A Brief Respite From Analytics

Today I'm taking a break from writing about Analytics, and just posting the video that I took while visiting the Monterey Bay Aquarium this summer. The video is pretty shoddy, but I shot it entirely with my Treo 650 cell phone. Guess I better save some money for an iPhone.

Monday, November 12, 2007

Laying the Analytics Foundation I: Designing a Sample

On my blog, I have tried to focus on my experience with analytical methods and techniques applied in the corporate work environment, which may be different than what we were taught in school. Very often, companies are eager to obtain data as quickly and cheaply as possible, and do not apply the same rigor that would be expected when writing an academic research paper. In most circumstances, we need not worry as long as the obtained data has some decision making value. Nevertheless, there will be times when you may be called to defend your analyses to senior management or a regulatory agency. A question that I am asked very frequently in such situations is, “How much confidence do you have in your numbers?” In these cases, the more you are able to follow the academic concepts and theories picked up in school, the more you are able reduce your exposure to criticism. After all, nobody in your audience is going to dispute Cochran's formulas.

First, every effort should be made to obtain reliable secondary data before contemplating a sampling study. However, if the necessary data is scarce or nonexistent, and the costs for conducting a study on the population is too prohibitive, you will need to extract a sample from your population. But before you decide on the size of your sample, you would need to accurately determine your primary sampling unit. Your primary sampling unit is the smallest indivisible unit of your population that you intend to sample. All elements of what you deem to be your primary sampling unit should have identical or similar characteristics. The results of your study could vary greatly if you are not careful when determining your primary sampling unit. Unfortunately, this is also one of the most neglected aspects of sampling that I've witnessed in many corporate environments. The U.S. Postal Service, for instance, has over 500 bulk mail processing plants that vary in size. Smaller plants have 1-3 AFCS sorting machines, medium sized plants have between 5-7 and larger plants more than 10 AFCS sorting machines. The characteristics of a plant with one sorting machine is very different than that with 10 sorting machines. If you pull a random sample of 50 plants from a list of the 500 hundred plants; you could end up with all small plants, and your study would not have any representation of medium and large sized plants. Nonsensical? I have seen it happen. The results are not pretty.

The next step is ensuring the stability of your sampling frame, another aspect of sampling that's often overlooked in the corporate environment. Your sampling frame is the population list of primary sampling units from which you are to choose your sample. I have seen good statisticians analyzing the frame to ensure that the list doesn't grow or retract in subsequent periods. So what if your sampling frame fluctuates greatly from one period to another? Easy. You don't (or rather can't) do a sample. In such circumstances, if you have no other choice, you can pull your sample from the most current frame. But you shouldn't put too much 'confidence in your numbers'.

Now you are ready to pull your sample. Most corporations that I've worked at commonly use a simple rule of thumb to determine sample size, which is 10 percent of the population. However, those who want to follow a more scientific method, the formula for determining sample size is given below:



The assumption behind these formulas are that the more variation between your sample and population means, the larger should be your sample size. You can obtain your population parameters by (1) doing a pilot study, (2) using that of a previous study of a similar population, and/or (3) taking an initial sample and using the mean from that sample. There are also ways to introduce more precision to the above formulas, if needed.

Once you have determined your sample size, there are several ways to pick your sample from your sampling frame. Some of the more popular methods are:

Simple Random Sampling
Simple random sampling is, by far, the most popular method used by businesses to construct their samples. In this method, you
randomly select the sampling units - equal to the number of your sample size - from your sampling frame without any bias or restrictions. In other words, every item in your sampling frame has an equal chance (or same probability) of being included in your sample. One way to achieve this objective would be to assign every sampling unit a unique number, and then use a random number generator to select a set of numbers - equal to your sample size - from that pool of sampling units.

Stratified Sampling
In this method, you group your population into various homogeneous groups or strata based on some broad characteristic shared by the units in each group or stratum, but not by the others. You then randomly select your sampling items from each stratum, all of which should equal your sample size. In the simple random sampling method explained previously, you risk excluding certain sampling units, whose characteristics you wish to include in your analyses.
This method ensures all the characteristics of your population you wish to study are represented in your sample. In the example of the USPS bulk mail processing plants that I provided, the study was flawed because we had conducted a simple random sample. This mistake could have been averted if we had carefully evaluated our sampling methodology and conducted a stratified sample instead.

Interval Sampling
In this method,also known as Systematic Sampling, you divide your population by your sample size to obtain a factor. From your sampling frame, you then select your sampling items using an interval that equals the factor calculated. For example, if your population size is 1,000 units and your sample size is 100; your factor would be 10 (1,000/100). Using this method, you have to select every 10th item from your sampling frame to construct your sample.

Judgment Sampling
This is a biased sampling method where the choice of selecting the sampling items rests exclusively on the judgment of the analyst(s) carrying out the study. If a sample of 10 students, for instance, has to be selected from a class of 100 students; the analyst chooses the 10 students that he/she thinks represents the class best. Judgment sampling is good for quick and dirty studies where the business can't afford to spend too much money.

Using the approaches discussed above, you should now be able to extract a sample from your population.

Saturday, September 22, 2007

Kick Starting Your Analytics Strategy

“Did you watch this movie ‘The Bridges of Madison County’? It’s really great,” my cousin asked me one afternoon. “Isn’t that a really old chick flick?” I said, “Nah, I don’t watch those”. “Well,” she replied, “I really like these kinds of movies. I can’t get enough”. “How do get your movies?” I asked. “Netflix”, she replied. “Why not Blockbuster? They offer the same service, and you can even get movies from the store if you want to watch something right away. Isn’t that better? I don’t know why Netflix is still around”. “Well”, she answers, “I’m quite happy with my Netflix service. I don’t see why I should change”.

One of the reasons why Netflix prevailed over its more powerful and well entrenched competitor is because of its shrewd use of analytics. At the heart of Netflix’s analytics strategy is ‘Cinematch’, a proprietary software that analyzes customers’ choices and feedback on the movies they have previously rented, and recommends new movies in ways that optimize both the customer’s taste and Netflix’s inventory. This is also one of the primary reasons why so many of Netflix’s rentals are old movies rather than new releases (and explains why my cousin ended up renting ‘The Bridges of Madison County’). Blockbuster, on the other hand, focuses on new releases and how quickly titles move from its shelves. If a title is on Blockbuster’s shelf too long, it’s discarded. In similar circumstances, Netflix - with the help of Cinematch - is trying to find a customer for those titles. The result is greater customer satisfaction (why my cousin refuses to ditch Netflix for Blockbuster), while lowering costs (After all, old titles cost much less than new releases). No wonder Netflix has offered to pay $1 million to anyone who can improve Cinematch’s accuracy by a mere 10 percent.

Now you want to do something similar for your organization. You want to harness the awesome power of analytics to crush your competitors and leave them in the dust. But how do you get started?

To jump start your analytics strategy, here are some of the steps you need to take:

Define your business problem.

There used to be a saying at IBM,"There are no problems, only opportunities". True. The impetus to do anything in a business environment would be because there is a need to act. That need would be your business problem. For example, the need to find out why sales are down would obviously qualify as a business problem. Likewise, the need to further increase sales can also be considered a business problem. Although, the purpose of an analytics strategy is to provide solutions to business problems, the problem itself may not be so obvious. In our example above, a company might confuse both problems as the same and offer heavy discounts to increase sales. However, if the company was in the business of making floppy disks, it would have realized that more people were using memory sticks because they're more convenient to use. The company just hurt itself financially by lowering the price for diehard floppy disk users. Successful analytical solutions depend to a great degree on how well you have analyzed and scoped your business problem. If you have not had at least one meeting with all your stakeholders, you have not yet defined your business problem. Also, keep in mind that every business problem is also an opportunity to improve.

Determine your metrics.

After you have spent significant time to understand and define your business problem, the next step is to determine the data points or measures needed to execute and evaluate your strategy. Analysts, for instance, use anything from earnings to net income to measure the health of a company; but I knew a stock analyst who liked to use gross profit minus operating expenses because he believed that the trend for that data gave him a raw assessment of how seriously a company was trying to boost productivity. The metrics have to be determined and agreed on before you execute your analytics strategy, it cannot be 'lets figure it out as we go along'. Think of your strategy as driving to a destination, and the data points are road signs telling you which road to take and how close you are to your destination. If you're driving from Monterey to Salinas and do not see any road signs after half an hour, that road will not take you to Salinas.

See what information you have.

Once you're determined the data points to execute and evaluate your strategy, you need to figure out the availability of the data needed to execute the strategy. Data that already exists or was gathered for some other purpose is known as Secondary data. It is always preferable to look for secondary data, because it makes it cheaper and easier to carry out your strategy. Most large organizations suffer from an information silo effect, which means the information that you need is gathered and used somewhere in the organization, but not known to you or your group. Sometimes, you may have to purchase the required data from a vendor like Gallup or Nielson. If data is sparse or non-existent, then you may have to gather information from Primary sources. This may involve conducting a census or a survey. Obtaining data from primary sources should be one of your last resorts, after every effort to obtain secondary data has failed. That is because primary data will would readily increase the cost of executing your analytic strategy. In dire circumstances, where data is sparse or lacking, and you can't spend too much money to obtain data from primary sources, you can look for data obtained for purposes that are similar, but not quite meet the current requirements, and make minor adjustments to it (i.e., a company trying to introduce a new sports drink in a town can use existing data from a survey of athletes who drink carbonated beverages). However, in these cases, you should be aware of the risks inherent in using data obtained in such a manner.

Validate your data.

After you have obtained your data, the next step is to assure the integrity of your data. Never assume that your data is 100% reliable, no matter who the source is or how strong the brand name for that source. A recent survey by Gartner found that at least 25 percent of all data for Fortune 1000 firms were bad or corrupted. There are many ways to determine whether your data is good. Some may include determining whether the data is in the proper format, where you have missing or null values where there should be none, or whether the data fall into proper ranges. Another crude way would be to determine the count of records being input into the system and verifying that count in different points of the process to ensure that no records are getting dropped. Finally, you could pull a small random sample of 100 - 500 records from the raw dataset and just view it to see that there are no obvious errors.

Select your tools.

This depends on what tools you already have, the amount of data required to execute your strategy, and what kind of analytics you intend to perform. If your dataset is relatively small, Microsoft Excel - utilizing visual basic applications - can be used to perform a surprisingly large range of sophisticated analytical analyses. Excel is not a good tool, however, if you have to perform analytics on large datasets or have to transfer data to other applications. Statistical software such SAS, however, can perform analytics on large datasets and can output data in different formats that can be fed into other applications. Personally, I have found that using SAS in conjunction with Excel (an advantage of SAS is that it allows you to automatically kickoff Excel VBA macros, simplifying much of your work) provides plenty of latitude to execute a relatively wide range of analytics strategies.

Validate the strategy.

Before relying on your strategy to provide the solution to your business problem, it would be a good idea to test the reliability of your strategy. You can do this by carrying out pilot tests prior to the actual implementation. Create a small subset from your data population. Determine what the end result would be. Run the data through your solution. See whether the end result matches your expectations. Another benefit of conducting pilot tests is that it allows you to uncover and fix any unexpected quirks that you failed to anticipate.

Don't reinvent the wheel.

Finally, do not reinvent the wheel if you don't have to. Analytics primary involves data and quantitative concepts. More often than not, the same strategy can be tweaked and applied for similar business problems across the board. The probability is also high that someone in your organization faced a similar problem and came up with a successful strategy. If that is the case, you can save a lot of money and energy, just borrowing or building your solution from that strategy.


Wednesday, September 19, 2007

A Working Definition for Analytics

It's Christmas season and your wife finally got you to agree to mail the Christmas gifts. Perturbed, you drag the small hill of packages to the local post office, only to see a half mile long waiting line and two very indifferent postal clerks taking their sweet time handling the customers. Two hours later you are still standing in line, very cross with the U.S. Postal Service and your wife for wasting your entire afternoon. Next year, you decide to make the trip to UPS instead.

The U.S. Postal Service has just lost a customer. Coincidentally, USPS does gather data on how much time customers spend at the window. Unfortunately, it also gathers data on 2000+ other matrices. Critical information such as customer wait time often gets overlooked because managers are bombarded with too much information. Adding another postal clerk could have brought down the wait time from 2 hours to 45 minutes. Adding two more postal clerks would bring that wait time to 15 minutes. Adding three more postal clerks would bring that wait time to 5 minutes. Customer satisfaction surveys show that postal customers start getting agitated if wait time exceeds 18 minutes. Consequently, we can see that adding one more postal clerk would not have improved the situation, adding three more postal clerks would have resulted in the postal service incurring unnecessary costs. Therefore, there is an optimal number of postal clerks that the USPS could have assigned to the above mentioned post office, and that number is four. The postal service could have retained their customer if they had a good analytics strategy in place. So what is "analytics"?

The most simple definition of analytics would be that it is "the science of analysis". In reality, the word "Analytics" has not been properly defined by the professional community and may mean different things to different people. A simple and practical definition, however, would be how an entity arrives at the most optimal and realistic decision from a variety of available options, based on existing data. There are several aspects to this definition. First, the purpose or goal of this endeavor is to arrive at a decision. Second, the process should be able to identify the best option from a range of available options. Finally, the decision making process should be based on data. Business managers may choose to make decisions based on past experiences or rule of thumb, or there might be other qualitative aspects to decision making; but unless there is data involved in the process, it would be considered beyond the purview of analytics.

Many people think that analytics only involves the use statistical analyses or mathematics to predict and improve business performance. It would, however, be erroneous to limit the field of analytics to only statistics and mathematics. Good analytics professionals should be well trained in business concepts and the social sciences, as well as have a good grasp of statistics and mathematics. A good analytics professional should be willing and able to work across various fields to come up with the proper solutions. Others argue that an analytics professional should also be cognizant of his data sources, which includes knowledge of his organization's IT infrastructure (how else would he know that his data is not being compromised or corrupted by the system). That is why analytics is unique and much broader than the use of statistics or mathematics in business.

With computers getting more powerful along with the increasing popularity of Business Intelligence (BI) tools, the importance of analytics is growing for businesses. Analytics has been credited for helping Netflix ward off competition from Blockbuster video and helping Google overtake Yahoo! to become the most profitable portal on the internet.