4.6 The Normal Approximation to the Binomial
The binomial formula is cumbersome when the sample size (n) is large, particularly when we consider a range of observations, as shown in the following example.
Example
Approximately 15% of the US population smokes cigarettes. A local government believed their community had a lower smoking rate and commissioned a survey of 400 randomly selected individuals. The survey found that only 42 of the 400 participants smoke cigarettes. If the true proportion of smokers in the community was really 15%, what is the probability of observing 42 or fewer smokers in a sample of 400 people?
How to solve this
We first need to verify the four conditions for the binomial model are met. The question posed is equivalent to asking, what is the probability of observing k = 0, 1, 2, …, or 42 smokers in a sample of n = 400 when p = 0.15? We can compute these 43 different probabilities and add them together.
Solution
P(k = 0 or k = 1 or · · · or k = 42)
= P(k = 0) + P(k = 1) + · · · + P(k = 42)
= 0.0054
The computations in the previous example are tedious, long, and nearly impossible if you do not have access to technology. In some cases, we may use the normal distribution as an easier and faster way to estimate binomial probabilities. In general, we should avoid long, tedious work if an alternative method exists that is faster, easier, and still accurate. Recall that calculating probabilities of a range of values is much easier in the normal model. We might wonder if it is reasonable to use the normal model in place of the binomial distribution. Surprisingly, yes, if certain conditions are met.
Binomial Approximation Conditions
Consider the binomial model when the probability of a success is p = 0.10. The following figure shows four hollow histograms for simulated samples from the binomial distribution using four different sample sizes: n = 10, 30, 100, 300. What happens to the shape of the distributions as the sample size increases? What distribution does the last histogram resemble?
By the last histogram, it appears the distribution is transformed from a blocky and skewed distribution into one that rather resembles the normal distribution.
The binomial distribution with probability of success p is nearly normal when the sample size n is sufficiently large that np and n(1 − p) are both at least 10. The approximate normal distribution has parameters corresponding to the mean and standard deviation of the binomial distribution:
µ = np and σ = np(1 − p)
The normal approximation may be used when computing the range of many possible successes. For instance, we may apply the normal distribution to the setting of the previous example.
Example (Continued)
Use the normal approximation to estimate the probability of observing 42 or fewer smokers in a sample of 400, if the true proportion of smokers is p = 0.15.
Already aware of the binomial model, we then verify that both np and n(1 − p) are at least 10:
- np = 400 × 0.15 = 60 n(1 − p) = 400 × 0.85 = 340
With these conditions met, we may use the normal approximation in place of the binomial distribution using the mean and standard deviation from the binomial model:
- µ = np = 60 and σ = np(1 − p) = 7.14
We want to find the probability of observing 42 or fewer smokers using this model. Use the normal model N(µ = 60, σ = 7.14) and standardize to estimate the probability of observing 42 or fewer smokers. Your answer should be approximately equal to the solution we found in the previous of example, 0.0054.
Compute the z-score first.
Solution
Z = (42−60)/7.14 = −2.52.
The corresponding left tail area from the table or technology is 0.0059.
The Continuity Correction
The normal approximation to the binomial distribution tends to perform poorly when estimating the probability of a small range of counts, even when the conditions are met.
Suppose we wanted to compute the probability of observing 49, 50, or 51 smokers in 400 when p = 0.15. With such a large sample, we might be tempted to apply the normal approximation and use the range 49–51. However, we would find that the binomial solution and the normal approximation notably differ:
- Binomial: 0.0649
- Normal: 0.0421
We can identify the cause of this discrepancy in the next figure, which shows the areas representing the binomial probability (outlined) and normal approximation (shaded). Notice that the width of the area under the normal distribution is 0.5 units too slim on both sides of the interval.
The normal approximation to the binomial distribution for intervals of values can usually be improved if cutoff values are modified slightly. The cutoff values for the lower end of a shaded region should be reduced by 0.5, and the cutoff value for the upper end should be increased by 0.5. This is called the continuity correction.
The tip to add extra area when applying the normal approximation is most often useful when examining a range of observations. In the example above, the revised normal distribution estimate is 0.0633, much closer to the exact value of 0.0649. While it is possible to also apply this correction when computing a tail area, the benefit of the modification usually disappears since the total interval is typically quite wide.
Additional Resources
If you are using an offline version of this text, access the resources for this section via the QR code, or by visiting https://doi.org/10.7294/26207456.
Figure References
Figure 4.23: Kindred Grey (2020). Hollow histograms for different sample sizes. CC BY-SA 4.0.
Figure 4.24: Kindred Grey (2020). Continuity correction. CC BY-SA 4.0.
Figure Descriptions
Figure 4.23: Four hollow histograms side by side. First: represents n = 10 and has higher values towards 0-2 and lower ones to the right. Second: represents n = 30 and has higher values around 4 with lower ones to the left and right of four. Third: represents n = 100 and has x values ranging from zero to 20. Follows a bell shape. Fourth: represents n = 300 and has x values ranging from 10-50. Follows a bell shape.
Figure 4.24: A bell shaped curve with x axis ranges from 40-80 by 10. A section of the graph is highlighted on x = 50.
A random variable that counts the number of successes in a fixed number (n) of independent Bernoulli trials each with probability of a success (p)
When statisticians add or subtract .5 to values to improve approximation