Sunday, May 11, 2008

Tools of the Trade III: Simple Data Validation Procedures

The reliability of your analytical model depends to a great extent on the quality of its underlying data. Unfortunately, bad data is a prevalent problem for most corporations. A survey by Gartner found that at least 25 percent of all data for Fortune 1000 firms were bad or corrupted. Based on my experience, carrying out several simple procedures after reading in the data will let you track and rectify many of these data quality problems, even if you don't get 100 percent accuracy. If you are getting your data from an outside source, it is a good idea to share your findings from the data validation procedures with it. The people providing you the data may be able to tell you whether your preliminary findings make sense or needs further investigation. Carry out these procedures, even if you have assurances that the data is good and complete. These simple - yet effective - procedures include the following:

1. Prior to performing any analyses, pull out the first and last 15 observations of the data and look for any obvious inconsistencies. Examine columns with blanks and null values. Check and see whether the values for a variable make sense (i.e., a column that should have numeric values, such as price, does not contain numeric values).

2. If using SAS, perform a contents procedure on the dataset to see if all the variables are accounted for in the proper format. Date and Time are often the trickiest variable. Make sure whether they are in time, timestamp or character format. IDs are another problematic variable. Generally, they should be in character format (even if they are numbers), but often exist in numeric format.

3. Verify the number of records you read into your program. Investigate if you have more (often resulting from trailing blanks) or less records (records getting dropped by the read in program) than anticipated. Trailing blanks are not harmless and may cause your program to malfunction during later stages of your analyses.

Try not to eliminate any records. Save the bad records in separate datasets for further analyses. These records often provide valuable clues for future QA strategies.

4. Make sure the data falls within valid ranges. For instance, if you know the age of people in the dataset should not be less than 18 years or more than 65 years, all the DOB should be within that range. If you know that the highest and lowest unit prices in your dataset are $100 and $0.10 respectively, then seperate and study records where unit prices are above $100 or less than 10 cents.

5. Perform a frequency on key variables. A frequency on dates could show records missing for an entire day. Also, pay attention to dates equavalent to default dates (i.e., Jan 1, 1900 for Excel and Jan 1, 1960 for SAS). Make sure there are no blanks when there should be null values, and vice versa.

6. Randomly pull 15 records from the entire dataset and ascertain that there are no egregious errors.

7. Read out the records with the 10 highest and lowest values for key variables.

8. Output the mean, median and mode for key variables.

Save your data validation results. Comparing the different findings over time should enable you to identify and correct data problems.

Using SAS to compare two datasets

You can use the SAS compare procedure to validate a dataset by comparing it with a baseline dataset. Use the following code to compare two identical datasets:

proc compare base=indat1 compare=indat2 OUTSTATS=outdat1 noprint;
run;

This is helpful when you need to modify an existing SAS program, but do not want the modifications to change the output.