4.5 The Normal Distribution
The normal (Gaussian) distribution is the most important of all the distributions, continuous or otherwise. Its graph is symmetric, bell-shaped, and unimodal. It is widely used and even more widely abused. You see this distribution in almost all disciplines, including psychology, business, economics, the sciences, nursing, and, of course, mathematics. Some of your instructors may use the normal distribution to help determine your grade. In both the natural world and in human society, many elements—from IQ scores to real estate prices—fit a normal distribution.
The normal distribution has two parameters (i.e., two numerical descriptive measures): the mean (μ) and the standard deviation (σ). If X is a quantity to be measured that has a normal distribution with mean (μ) and standard deviation (σ), we designate this by writing X ~ N(μ, σ).
The probability density function of this curve is as follows:
f(x) =
where:
- -∞ < X < ∞
- -∞ < μ < ∞
- σ > 0
As you can see, the normal PDF is a rather complicated function. This could be a problem since the normal distribution is so widely used. However, we will see some ways we can work around this.
The cumulative distribution function is P(X ≤ x). It can be calculated either by calculus, technology, or a table (though technology has made tables almost obsolete).
The curve is symmetric about a vertical line drawn through the mean, μ. In theory, the mean is the same as the median, because the graph is symmetric about μ. As the notation indicates, the normal distribution depends only on the mean and the standard deviation. Since the area under the curve must equal one, a change in the standard deviation, σ, causes a change in the shape of the curve, which becomes fatter or skinnier depending on σ. A change in μ causes the graph to shift to the left or right. This means there are an infinite number of normal probability distributions. One of special interest is called the standard normal distribution.
The Empirical Rule
The place to start when working with the normal distribution is the empirical rule. It applies to any normal distribution or data that has a bell-shaped, symmetric curve. According to the rule, if X is a random variable and has a normal distribution with mean µ and standard deviation σ, then:
- Approximately 68% of the values of x are within one standard deviation of the mean (±σ or z-scores of ±1)
- Approximately 95% of the values of x are within two standard deviations of the mean (±2σ or z-scores of ±2)
- Approximately 99.7% of the values of x are within three standard deviations of the mean (±3σ or z-scores of ±3)
The empirical rule is also known as the 68-95-99.7 rule.
Example
Suppose x has a normal distribution with mean 50 and standard deviation 6.
- About 68% of the x values lie within one standard deviation of the mean. Therefore, about 68% of the x values lie between –1σ = (–1)(6) = –6 and 1σ = (1)(6) = 6 of the mean 50. The values 50 – 6 = 44 and 50 + 6 = 56 are within one standard deviation from the mean 50. The z-scores are –1 and +1 for 44 and 56, respectively.
- About 95% of the x values lie within two standard deviations of the mean. Therefore, about 95% of the x values lie between –2σ = (–2)(6) = –12 and 2σ = (2)(6) = 12. The values 50 – 12 = 38 and 50 + 12 = 62 are within two standard deviations from the mean 50. The z-scores are –2 and +2 for 38 and 62, respectively.
- About 99.7% of the x values lie within three standard deviations of the mean. Therefore, about 95% of the x values lie between –3σ = (–3)(6) = –18 and 3σ = (3)(6) = 18 from the mean 50. The values 50 – 18 = 32 and 50 + 18 = 68 are within three standard deviations of the mean 50. The z-scores are –3 and +3 for 32 and 68, respectively.
Your Turn!
From 1984 to 1985, the mean height of 15- to 18-year-old males from Chile was 172.36 cm, and the standard deviation was 6.34 cm. Let Y represent the height of 15- to 18-year-old males in 1984 to 1985. Then Y ~ N(172.36, 6.34).
- About 68% of the y values lie between what two values? These values are and . The z-scores are and , respectively.
- About 95% of the y values lie between what two values? These values are and . The z-scores are and , respectively.
- About 99.7% of the y values lie between what two values? These values are and . The z-scores are and , respectively.
Solution
a. About 68% of the values lie between 166.02 cm and 178.7 cm. The z-scores are –1 and 1.
b. About 95% of the values lie between 159.68 cm and 185.04 cm. The z-scores are –2 and 2.
c. About 99.7% of the values lie between 153.34 cm and 191.38 cm. The z-scores are –3 and 3.
Finding Normal Probabilities
The shaded area in the following graph indicates the area to the left of x. This area is represented by the probability P(X < x).
If we know the area to the left, we can then use the complement rule to find the area to the right:
P(X > x) = 1 – P(X < x)
To find the area between two numbers, we can write the equation in terms of a CDF:
P(a < X < b) = P(X < b) – P(X < a)
Also recall that for continuous distributions:
P(X < x) ≅ (X ≤ x) & P(X > x) ≅ P(X ≥ x)
There are three main ways we could find probabilities associated with the normal distribution:
- Complicated math
- The standardizing process
- Technology
If you recall the formula previously presented for the PDF of the normal distribution, you could imagine why it’s preferable to avoid involving complicated math if possible.
In order to work around that, there is a process called standardizing that involves z-scores, the standard normal distribution, and tables. Although this tried and true process is now somewhat antiquated, it is a great place to start.
There are many technologies (e.g., calculators and various pieces of statistical software) that let us skip the entire standardizing process and instantaneously provide us with a probability. Although we typically have these at our disposal to use in practice, it is good to understand the process going on behind the scenes to make sure we apply our technology correctly.
The Standard Normal Distribution
The standard normal distribution (SND) is the simplest form of the normal distribution. The mean for the standard normal distribution is zero, and the standard deviation is one. The transformation z = produces the distribution Z ~ N(0, 1). The value x in the given equation comes from a normal distribution with mean μ and standard deviation σ.
Recall our previous discussion of z-scores, which are converted to units of the standard deviation. If X is a normally distributed random variable and X ~ N(μ, σ), then the z-score is:
Recall a z-score tells you how many standard deviations the value x is above (to the right of) or below (to the left of) the mean, μ. Values of x that are larger than the mean have positive z-scores, and values of x that are smaller than the mean have negative z-scores. If x equals the mean, then x has a z-score of zero.
We have the z-table at our disposal with probabilities already calculated and organized. Note that most z-tables give us the left-tailed, CDF, or “less than” probability. For example, the area to the left of a z-score of -3.37 is P(Z ≤ -3.37) = 0.0004.
The SND CDF value, P(Z ≤ z), is also denoted as Φ(z). We can then use these CDF values, P(Z ≤ z), and some probability rules to find greater than [P(Z ≥ z) = 1-P(Z ≤ z)] or in-between [P(a ≤ Z ≤ b) = P(Z ≤ b) – P(Z ≤ a)] probabilities.
Example
Use the z-table to find the following probabilities:
P(Z ≤ 1)
Solution
0.8413
P(Z ≥ 1)
Solution
0.1587
P(-1 ≤ Z ≤ 1)
Solution
0.6826
Your Turn!
Use the z-table to find the following probabilities:
P(Z ≤ -0.54)
P(Z ≥ 1.2)
P(-1.5 ≤ Z ≤ 0.84)
The Standardizing Process
So far, we have discussed converting any normal distribution with any mean and standard deviation to the standard normal distribution in units of z-scores. We also have the associated probabilities in our z-table. Essentially, the work has been done for us if we know how to standardize and look up the associated probability in the table. The general process is:
X ~ N(μ, σ) -> Z ~ N(0, 1) -> probability from z-table
While maybe outdated in our technology age, this process is good for beginners to understand and useful when we do not have access to technology.
Example
Height and weight are two measurements used to track a child’s development. The World Health Organization measures child development by comparing the weights of children who are the same height and the same gender. In 2009, weights for all 80 cm girls in the reference population had a mean µ = 10.2 kg and standard deviation σ = 0.8 kg. Weights are normally distributed.
X ~ N(10.2, 0.8)
Calculate the z-scores that correspond to the following weights, then find the associated probabilities.
The probability that a child weighs less than 11 kg
Solution
(11 – 10.2)/0.8 = 1
A child who weighs 11 kg is one standard deviation above the mean of 10.2 kg.
P(Z ≤ 1) = 0.8413
The probability that a child weighs more than 7.9 kg
Solution
(7.9 – 10.2)/0.8 = –2.875
A child who weighs 7.9 kg is 2.875 standard deviations below the mean of 10.2 kg.
P(Z ≥ -2.88) = 1 – P(Z ≤ -2.88) = 1 – 0.002 = 0.998
The probability that a child weighs between 11.2 and 12.2 kg
Solution
z1 = (11.2 – 10.2)/0.8 = 1.25 and z2 = (12.2 – 10.2)/0.8 = 2.5
A child who weighs 12.2 kg is 2.5 standard deviation above the mean of 10.2 kg.
P( 1.25 ≤ Z ≤ 2.5) = P(Z ≤ 2.5) – P(Z ≤1.25) = 0.9938 – 0.8944 = 0.0994
Your Turn!
The golf scores for a school team were normally distributed with a mean of 68 and a standard deviation of three.
Find the probability that a randomly selected golfer scored less than 65.
Find the probability that a golfer scored between 66 and 70.
Working Backwards
Sometimes, we may be given a percentile or z-score and want to work backward through the standardizing process to find a value on the original distribution. This “un-standardizing” process of finding a normal quantile or percentile associated with the normal distribution looks like this:
Probability in z-table -> Z ~ N(0, 1) -> X ~ N(μ, σ)
For example, if the mean of a normal distribution is five and the standard deviation is two, what value is three standard deviations above (to the right of) the mean (z-score = 3). Rearranging the z-score formula, the calculation is as follows:
x = μ + (z)(σ) = 5 + (3)(2) = 11
Often, we are given a percentile to find on the original distribution. For example, what if we want to know a value on the previous distribution that corresponds to the 90th percentile? We can look up a probability of 0.9 in the z-table and find a corresponding z-score of approximately 1.28.
x = μ + (z)(σ) = 5 + (1.28)(2) = 7.56
Example
A citrus farmer who grows mandarin oranges finds that the diameters of mandarin oranges harvested on his farm follow a normal distribution with a mean diameter of 5.85 cm and a standard deviation of 0.24 cm.
Solution
6.16
The middle 20% of mandarin oranges from this farm have diameters between and .
Solution
Between 5.79 and 5.91
Your Turn!
Two thousand students took an exam. The scores on the exam have an approximate normal distribution with a mean μ = 81 points and standard deviation σ = 15 points.
- Calculate the first and third quartile scores for this exam.
- The middle 50% of the exam scores are between what two values?
Figure References
Figure 4.19: Kindred Grey (2020). Normal distribution. CC BY-SA 4.0.
Figure 4.20: Kindred Grey (2020). Empirical rule. CC BY-SA 4.0.
Figure 4.21: Kindred Grey (2020). P(X < x). CC BY-SA 4.0.
Figure 4.22: Kindred Grey (2020). Z-table. CC BY-SA 4.0.
Figure Descriptions
Figure 4.19: Bell-shaped curve diagram with the lower case Greek letter mu at the center of the x axis. It has the label Normal: uppercase X is similar to N (μ, σ)
Figure 4.20: Frequency curve that illustrates the empirical rule. The normal curve is shown over a horizontal axis. The axis is labeled with points -3s, -2s, -1s, m, 1s, 2s, 3s. Vertical lines connect the axis to the curve at each labeled point. The peak of the curve aligns with the point m.
Figure 4.21: Diagram showing a bell-shaped curve with uppercase X at the extreme right end of the X axis. The X axis also contains a lowercase x about one-quarter of the way across the X axis from the right. The area under the bell curve to the right of the lowercase x is shaded. The label states: shaded area represents probability P(X less than x).
Figure 4.22: Z score table that highlights the associated value of 0.0004 with Z value of -3.37.
A commonly used symmetric, unimodal, bell-shaped, and continuous probability distribution
Roughly 68% of values are within one standard deviation of the mean, roughly 95% of values are within two standard deviations of the mean, and 99.7% of values are within three standard deviations of the mean
A normal random variable with a mean of 0 and standard deviation of 1 which z-scores follow; denoted N(0, 1)
A measure of location that tells us how many standard deviations a value is above or below the mean
Points in a distribution that relate to the rank order of values in that distribution