Demystifying statistics: Estimating sample size for a survey

Introduction

Whether you want to understand people’s preferences for a product, estimate the proportion of people preferring a political party or estimate the prevalence of a disease in a population, you will need to calculate the number of respondents sufficient for your survey objective. How can you calculate this magic number?

To obtain a completely accurate answer, you will have to ask each and every individual in your population. A population in statistics is defined as all the individuals about whom you want to obtain information. For example if you want to understand voters’ preferences for a political party in a country, then all the voters of that country is the population. If you are interested in estimating the prevalence of obesity among teenagers in a country, then all teenagers of that country make your population. If you are interested in understanding the proportion of students in a school preferring a particular brand of chocolates, then all the students in that school is your population.

Fortunately it is not necessary to ask each and every individual in the population if you are willing to be a little less accurate. You can collect data from only a proportion of people and still able to be get reasonably good answers that reflects the perceptions, habits, preferences or disease status of the population. That proportion of people is called a sample and the number of people that you need to select is called sample size.

How big your sample size should be? It depends on the following four parameters:

Population size

Paradoxically, the size of your target population doesn’t have a huge influence on sample size if the population is ‘large’, which is usually the case. Therefore, in most practical situations you can specify your population size to be ‘infinite’. The actual population size only needs to be specified if you are planning to sample a considerable proportion of the population (say >10%).

Expected proportion

It might sound ironic, but you do need to guesstimate the proportion you expect to obtain. You could obtain this estimate from previous surveys or from an expert opinion but if there is absolutely no information about what you are trying to measure, then a 50% value can be used to be conservative, as it will provide the largest sample size.

Precision or margin of error

It tells you how much error or imperfection in your result you are willing to accept. If you are willing to accept a larger error then your sample size would be smaller and vice versa. For example, let’s assume that 40% of the people in a sample prefer a political party. With a margin of error of 5%, you will be ‘fairly’ confident that the proportions of people in the population preferring a political party are between 35% and 45% (i.e. 40 ± 5). This range is called confidence interval in statistical jargon. Similarly, with a 2% margin of error the confidence interval will be from 38% to 42%. Thus with a 2% margin of error, you will get a more precise answer than with a 5% margin of error but you would need a larger sample size.

Level of confidence

Note that we can only be fairly (not 100%) confident that 35% to 45% of people would prefer the political party in the above example. This word ‘fairly’ is quantified using the level of confidence. Usually a value of 95% for the level of confidence is used but other levels (such as 90% or 99%) can also be used. So if you want to be 99% confident of your result that 35% to 45% of people prefer a party, you will have to select a larger sample size than if you are willing to be a little less (95% or 90%) confident. Statistically speaking, a 95% confidence means that if you repeat your survey a large number of times, and calculate confidence interval each time, your intervals will include the true population proportion 95% of the time. If you are not sure of what confidence level you need, it is better to stick to the conventional value of 95%.

That’s it! So assuming an infinite population, you just need to specify the expected proportion (use 50% if you are not sure), the level of confidence (use 95% if not sure) and the margin of error to calculate sample size for estimating a proportion.

Notes:

Margin of error can be a bit tricky to decide as its interpretation depends on the expected proportion. For example, a margin of error of 5% is okay for 40% expected proportion (as the confidence interval will be from 35% to 45%) but not for an expected proportion of 4% as the confidence interval will be from -1% to 9% which will not make any sense. Therefore, sometimes it is recommended to select the margin of error relative to the expected proportion. For example, if we decide on a relative margin of error of ‘10% of the expected proportion‘ then the absolute margin of error will be 5% for an expected proportion of 50% (i.e. 10%*50) and 0.5% for an expected proportion of 5% (10%*5). Both of these margins would make good sense.
The above approach will calculate sample size for estimating a proportion with a certain confidence. The approach to calculate sample size for comparing proportions is a bit different. I will discuss it in future blog.
This sample size calculation assumes simple random sampling. We will discuss calculation of sample sizes for other designs in a future blog.

Implementation

Sample size can be easily calculated using a calculator that we recently developed : http://statulator.com/SampleSize/ss1P.html. The calculator also shows you visualisation of the changes in sample size for a range of expected proportions and margins of errors. You can also create a table with a range of sample sizes and download it for discussion with your colleagues or collaborators.

For example, if you expect 15% of teenagers in a large population to be obese and you want to estimate this proportion with 95% confidence and with 10% relative margin of error (i.e. 10%*15 = 1.5% absolute margin of error), specify these values in the calculator and click calculate to obtain the required sample size (2177 individuals). The calculator will also interprets the results for you which you could adapt for your project proposal or a journal article.

This calculator also provides you some other options to adjust sample size for clustering, response rate etc. which I will discuss in a future blog.