2.1 Descriptive Statistics and Frequency Distributions
Learning Objectives
By the end of this chapter, the student should be able to:
- Display and interpret categorical data
- Display and interpret quantitative data
- Recognize, describe, and calculate the measures of the center of quantitative data
- Recognize, describe, and calculate the measures of spread of quantitative data
- Recognize, describe, and calculate the measures of location of quantitative data
- Identify outliers in quantitative data
Descriptive Statistics
Once you collect data, what do you do with it? Data can be described and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. If you have no clue about house prices, you might ask your real estate agent to give you a sample dataset of prices, but looking through all the prices can be overwhelming. A better way might be to look at numerical descriptions such as the average or median house price. Your agent might also provide you with a graph of the data.
In this chapter, you will study numerical and graphical ways to describe and display your data. This area of statistics is called descriptive statistics. We will look at both graphical and numerical descriptive methods. You will learn how to construct, calculate, and, most importantly, interpret these measurements and graphs.
Numerical descriptors consist of summary statistics (typically calculated from a sample) that represent important aspects such as the central tendency and variability of a distribution or the relative standing of a single observation with regard to the rest of the distribution.
Graphical descriptive methods consist of chart, tables, and graphs. These are tools that help you learn about the distribution, or shape, of a sample or a population. A graph can be a more effective way of presenting data than a mass of numbers because we can see where the data clusters and where there are only a few data values. Newspapers and the internet sources use graphs to show trends and to enable readers to compare facts and figures quickly. Statisticians often graph data first to get a picture of the data before more formal tools are applied.
The type of graph you choose first depends on the type of data with which you are working. Some of the types of graphs used to display categorical data are pie charts and bar charts. Some graphs that are used to summarize and organize quantitative data are the dot plot, the histogram, the stem-and-leaf plot, the frequency polygon, the box plot, and, in special cases, the time series plot. The emphasis here will be on histograms and box plots.
We will start by looking at a graphical method that can display any type of data, the frequency table.
Frequency Tables
Frequency tables are a great starting place for summarizing and organizing your data. Once you have a set of data, you may first want to organize it to see the frequency (how often each value occurs in the set).
Frequency tables can be used to show either quantitative or categorical data. Displaying categorical data in a frequency table is fairly straightforward since you already have clearly defined categories. For example, if you polled 20 kindergarteners on their favorite colors, you could construct the following simple frequency table:
Color | Frequency |
---|---|
Red | 2 |
Orange | 2 |
Yellow | 1 |
Green | 3 |
Blue | 4 |
Purple | 3 |
Pink | 4 |
Clear with sparkles | 1 |
Total = 20 |
Figure 2.2: Frequency table of children’s favorite colors
Some quantitative data, especially discrete, may only a contain a limited number of values and little thought would be needed in creating the frequency table. Some data may have a natural grouping. For example, if you were organizing adults aged 20-69, it might make intuitive sense to group them as follows:
- 20–29
- 30–39
- 40–49
- 50–59
- 60–69
Consider the 30-39 grouping. Each group is typically called a class, or bin. In this case, 30 would be the lower class limit, while 39 is the upper class limit. The class width is defined as the difference between consecutive lower class limits. For the class 30–39, the class width is 40–30 = 10. The class midpoint is found by adding the lower limit and upper limit, then dividing by 2. For the class 30–39, the class midpoint would be calculated as follows:
Depending on the format and precision of the data reported, we may have to decide how best to bin, or group, our data. Grouping data may not always be a clean or intuitive process. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05), which is more precise. If the value with the most decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal places is 3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – 0.0005 = 0.9995). If the data is entirely made up of integers and the smallest value is 2, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). When the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary. The next two examples go into detail about how to construct a histogram using continuous data and how to create a histogram using discrete data.
The next question may concern how many bins to use. Generally anywhere from 5-20 bins, since too few does not display distribution well, but too many can have strange effects. A good place to start is the square root of your number of observations (n). Some other basic guidelines are that bins should not overlap or have gaps between them and should have the same width and cover the entire range of the data. The class limits and width should be “reasonable” numbers (e.g., whole numbers, or multiples of five or ten). In the end, it really just depends on the format of your data, but following these general guidelines should make sure your table is useful.
Relative Frequencies
A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, divide each frequency by the total number of students in the sample—in this case, 20. Relative frequencies can be written as fractions, percents, or decimals. To find the relative frequency:
RF =
in which:
-
-
- f = frequency
- n = total number of data values (or the sum of the individual frequencies)
- RF = relative frequency
-
For example, if three students in Mr. Ahab’s English class of 40 students received scores from 90% to 100%, then, f = 3, n = 40, and RF = = = 0.075. Of the students, 7.5% received scores between 90% and 100%. In this case, 90–100% are quantitative measures.
Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row, as shown in the figure below.
NOTES:
- The sum of all frequencies will add up to n, or your sample size.
- All relative frequencies should add up to one (pending rounding).
- The first entry of the cumulative relative frequency column will be the same as the first entry of the relative frequency column since there is nothing to accumulate.
- The last entry of the cumulative relative frequency column is one, indicating that 100% of the data has been accumulated.
Example
The following table represents one way of grouping the heights, in inches, of a sample of 100 male semiprofessional soccer players.
Heights (inches) | Frequency | Relative frequency | Cumulative relative frequency |
---|---|---|---|
59.95–61.95 | 5 | 0.05 | |
61.95–63.95 | 3 | 0.05 + 0.03 = 0.08 | |
63.95–65.95 | 15 | 0.08 + 0.15 = 0.23 | |
65.95–67.95 | 40 | 0.23 + 0.40 = 0.63 | |
67.95–69.95 | 17 | 0.63 + 0.17 = 0.80 | |
69.95–71.95 | 12 | 0.80 + 0.12 = 0.92 | |
71.95–73.95 | 7 | 0.92 + 0.07 = 0.99 | |
73.95–75.95 | 1 | 0.99 + 0.01 = 1.00 | |
Total = 100 | Total = 1.00 |
Figure 2.3: Frequency table of soccer player height
In this sample, there are five players whose heights fall within the interval 59.95–61.95 inches, three players whose heights fall within the interval 61.95–63.95 inches, 15 players whose heights fall within the interval 63.95–65.95 inches, 40 players whose height falls within the interval 65.95–67.95 inches, 17 players whose heights fall within the interval 67.95–69.95 inches, 12 players whose heights fall within the interval 69.95–71.95, seven players whose heights fall within the interval 71.95–73.95, and one player whose heights fall within the interval 73.95–75.95. All heights fall between the endpoints of an interval and not at the endpoints.
From the figure above, find the percentage of heights that are less than 65.95 inches.
Solution
If you look at the first, second, and third rows, the heights are all less than 65.95 inches. There are 5 + 3 + 15 = 23 players whose heights are less than 65.95 inches. The percentage of heights less than 65.95 inches is then 23/100 or 23%. This percentage is the cumulative relative frequency entry in the third row.
Find the percentage of heights that fall between 61.95 and 65.95 inches.
Solution
Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 0.18 or 18%.
Use the heights of the 100 male semiprofessional soccer players. Fill in the blanks, and check your answers.
What kind of data are the heights?
Solution
quantitative continuous
Describe how you could gather this data (the heights) to make it characteristic of all male semiprofessional soccer players.
Solution
Get rosters from each team and choose a simple random sample from each.
Remember, you count frequencies. To find the relative frequency, divide the frequency by the total number of data values. To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.
Your Turn!
Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows: 5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, and 3. Construct a bar graph that shows the registered voter population by district.
Construct an appropriate table including frequencies, relative frequencies, and cumulative relative frequencies.
Figure References
Figure 2.1: U.S. Marine Corps photo by Staff Sgt. William Greeson (2009). US Navy 090821-M-0440G-043 Voting ballots organized and arranged for counting by Afghan presidential election workers at a local school in the Nawa District. Public domain. https://commons.wikimedia.org/wiki/File:US_Navy_090821-M-0440G-043_Voting_ballots_organized_and_arranged_for_counting_by_Afghan_presidential_election_workers_at_a_local_school_in_the_Nawa_District.jpg.
Methods of organizing, summarizing, and presenting data
Organizing, summarizing, or presenting data visually in graphs, figures, or charts
Numbers that summarize some aspect of a dataset, often calculated
The possible values a variable can take on and how often it does so
The number of times a value occurs in the data
The lower end of a bin or class in a frequency table or histogram
The upper end of a bin or class in a frequency table or histogram
The difference in consecutive lower class limits
Found by adding the lower limit and upper limit, then dividing by two
The percentage, proportion, or ratio of the frequency of a value of the data to the total number of outcomes
The sum of the relative frequencies for all values that are less than or equal to the given value