Monday, November 19, 2007

A Brief Respite From Analytics

Today I'm taking a break from writing about Analytics, and just posting the video that I took while visiting the Monterey Bay Aquarium this summer. The video is pretty shoddy, but I shot it entirely with my Treo 650 cell phone. Guess I better save some money for an iPhone.

Monday, November 12, 2007

Laying the Analytics Foundation I: Designing a Sample

On my blog, I have tried to focus on my experience with analytical methods and techniques applied in the corporate work environment, which may be different than what we were taught in school. Very often, companies are eager to obtain data as quickly and cheaply as possible, and do not apply the same rigor that would be expected when writing an academic research paper. In most circumstances, we need not worry as long as the obtained data has some decision making value. Nevertheless, there will be times when you may be called to defend your analyses to senior management or a regulatory agency. A question that I am asked very frequently in such situations is, “How much confidence do you have in your numbers?” In these cases, the more you are able to follow the academic concepts and theories picked up in school, the more you are able reduce your exposure to criticism. After all, nobody in your audience is going to dispute Cochran's formulas.

First, every effort should be made to obtain reliable secondary data before contemplating a sampling study. However, if the necessary data is scarce or nonexistent, and the costs for conducting a study on the population is too prohibitive, you will need to extract a sample from your population. But before you decide on the size of your sample, you would need to accurately determine your primary sampling unit. Your primary sampling unit is the smallest indivisible unit of your population that you intend to sample. All elements of what you deem to be your primary sampling unit should have identical or similar characteristics. The results of your study could vary greatly if you are not careful when determining your primary sampling unit. Unfortunately, this is also one of the most neglected aspects of sampling that I've witnessed in many corporate environments. The U.S. Postal Service, for instance, has over 500 bulk mail processing plants that vary in size. Smaller plants have 1-3 AFCS sorting machines, medium sized plants have between 5-7 and larger plants more than 10 AFCS sorting machines. The characteristics of a plant with one sorting machine is very different than that with 10 sorting machines. If you pull a random sample of 50 plants from a list of the 500 hundred plants; you could end up with all small plants, and your study would not have any representation of medium and large sized plants. Nonsensical? I have seen it happen. The results are not pretty.

The next step is ensuring the stability of your sampling frame, another aspect of sampling that's often overlooked in the corporate environment. Your sampling frame is the population list of primary sampling units from which you are to choose your sample. I have seen good statisticians analyzing the frame to ensure that the list doesn't grow or retract in subsequent periods. So what if your sampling frame fluctuates greatly from one period to another? Easy. You don't (or rather can't) do a sample. In such circumstances, if you have no other choice, you can pull your sample from the most current frame. But you shouldn't put too much 'confidence in your numbers'.

Now you are ready to pull your sample. Most corporations that I've worked at commonly use a simple rule of thumb to determine sample size, which is 10 percent of the population. However, those who want to follow a more scientific method, the formula for determining sample size is given below:



The assumption behind these formulas are that the more variation between your sample and population means, the larger should be your sample size. You can obtain your population parameters by (1) doing a pilot study, (2) using that of a previous study of a similar population, and/or (3) taking an initial sample and using the mean from that sample. There are also ways to introduce more precision to the above formulas, if needed.

Once you have determined your sample size, there are several ways to pick your sample from your sampling frame. Some of the more popular methods are:

Simple Random Sampling
Simple random sampling is, by far, the most popular method used by businesses to construct their samples. In this method, you
randomly select the sampling units - equal to the number of your sample size - from your sampling frame without any bias or restrictions. In other words, every item in your sampling frame has an equal chance (or same probability) of being included in your sample. One way to achieve this objective would be to assign every sampling unit a unique number, and then use a random number generator to select a set of numbers - equal to your sample size - from that pool of sampling units.

Stratified Sampling
In this method, you group your population into various homogeneous groups or strata based on some broad characteristic shared by the units in each group or stratum, but not by the others. You then randomly select your sampling items from each stratum, all of which should equal your sample size. In the simple random sampling method explained previously, you risk excluding certain sampling units, whose characteristics you wish to include in your analyses.
This method ensures all the characteristics of your population you wish to study are represented in your sample. In the example of the USPS bulk mail processing plants that I provided, the study was flawed because we had conducted a simple random sample. This mistake could have been averted if we had carefully evaluated our sampling methodology and conducted a stratified sample instead.

Interval Sampling
In this method,also known as Systematic Sampling, you divide your population by your sample size to obtain a factor. From your sampling frame, you then select your sampling items using an interval that equals the factor calculated. For example, if your population size is 1,000 units and your sample size is 100; your factor would be 10 (1,000/100). Using this method, you have to select every 10th item from your sampling frame to construct your sample.

Judgment Sampling
This is a biased sampling method where the choice of selecting the sampling items rests exclusively on the judgment of the analyst(s) carrying out the study. If a sample of 10 students, for instance, has to be selected from a class of 100 students; the analyst chooses the 10 students that he/she thinks represents the class best. Judgment sampling is good for quick and dirty studies where the business can't afford to spend too much money.

Using the approaches discussed above, you should now be able to extract a sample from your population.