1. Prior to performing any analyses, pull out the first and last 15 observations of the data and look for any obvious inconsistencies. Examine columns with blanks and null values. Check and see whether the values for a variable make sense (i.e., a column that should have numeric values, such as price, does not contain numeric values).
2. If using SAS, perform a contents procedure on the dataset to see if all the variables are accounted for in the proper format. Date and Time are often the trickiest variable. Make sure whether they are in time, timestamp or character format. IDs are another problematic variable. Generally, they should be in character format (even if they are numbers), but often exist in numeric format.
3. Verify the number of records you read into your program. Investigate if you have more (often resulting from trailing blanks) or less records (records getting dropped by the read in program) than anticipated. Trailing blanks are not harmless and may cause your program to malfunction during later stages of your analyses.
Try not to eliminate any records. Save the bad records in separate datasets for further analyses. These records often provide valuable clues for future QA strategies.
4. Make sure the data falls within valid ranges. For instance, if you know the age of people in the dataset should not be less than 18 years or more than 65 years, all the DOB should be within that range. If you know that the highest and lowest unit prices in your dataset are $100 and $0.10 respectively, then seperate and study records where unit prices are above $100 or less than 10 cents.
5. Perform a frequency on key variables. A frequency on dates could show records missing for an entire day. Also, pay attention to dates equavalent to default dates (i.e., Jan 1, 1900 for Excel and Jan 1, 1960 for SAS). Make sure there are no blanks when there should be null values, and vice versa.
6. Randomly pull 15 records from the entire dataset and ascertain that there are no egregious errors.
7. Read out the records with the 10 highest and lowest values for key variables.
8. Output the mean, median and mode for key variables.
Save your data validation results. Comparing the different findings over time should enable you to identify and correct data problems.
Using SAS to compare two datasets
You can use the SAS compare procedure to validate a dataset by comparing it with a baseline dataset. Use the following code to compare two identical datasets:
proc compare base=indat1 compare=indat2 OUTSTATS=outdat1 noprint;
run;