By the end of this chapter, the student should be able to:
- Display and interpret categorical data
- Display and interpret quantitative data
- Recognize, describe, and calculate the measures of the center of quantitative data
- Recognize, describe, and calculate the measures of the spread of quantitative data
- Recognize, describe, and calculate the measures of location of quantitative data
- Identify outliers in quantitative data
Once you have collected data, what will you do with it? Data can be described and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. You may have no clue about the house prices, so you might ask your real estate agent to give you a sample data set of prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look at numerical descriptions such as the average or median house price. Your agent might also provide you with a graph of the data.
In this chapter, you will study numerical and graphical ways to describe and display your data. This area of statistics is called . We will look at both and descriptive methods. You will learn how to construct and calculate, and even more importantly, how to interpret these measurements and graphs.
Numerical descriptors consist of summary statistics, typically calculated from a sample, that represent important aspects such as the central tendency and variability of a distribution, or relative standing of a single observation with regards to the rest of the distribution.
Graphical descriptive methods consist of chart, tables, and graphs. These are tools that help you learn about the , or shape of a sample or a population. A graph can be a more effective way of presenting data than a mass of numbers because we can see where data clusters and where there are only a few data values. Newspapers and the Internet use graphs to show trends and to enable readers to compare facts and figures quickly. Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied.
The type of graph you choose to use first depends on the type of data you are working with. Some of the types of graphs used to display Categorical data are pie charts and bar charts. Some graphs that are used to summarize and organize Quantitative data are the dot plot, the histogram, the stem-and-leaf plot, the frequency polygon, the box plot, and the time series plot in special cases. The emphasis will be on histograms and box plots.
We will start by looking at a graphical method that can display any type of data, the frequency table.
Frequency tables are a great starting place for summarizing and organizing your data. Once you have a set of data, you may first want to organize it to see the , or how often each value occurs in the set.
Frequency tables can be used to show either quantitative or categorical data. Displaying categorical data in a frequency table is fairly straightforward since you already have clearly defined categories. For example if you polled 20 kindergarteners on their favorite colors you could construct the following simple frequency table:
|Clear with Sparkles||1|
|Total = 20|
Some quantitative data, especially discrete, may only a contain a limited number of values and little thought would be needed in creating the frequency table. Some data may have a natural grouping. For example, if you had ages of adults from 20-69, it might make intuitive sense to group them as follows:
- 20 – 29
- 30 – 39
- 40 – 49
- 50 – 59
- 60 – 69
Consider the 30-39 class. 30 is known as the , while 39 is the . The is defined as the difference between consecutive lower class limits. For the class 30 – 39, the class width = 40 – 30 = 10. The is found by adding the lower limit and upper limit, then dividing by 2. For the class 30 – 39, the class midpoint = (30 + 39)/2 = 34.5.
Depending on the format and precision of the data reported, we may have to decide how best to group our data into intervals, sometimes called bins or classes. Grouping data may not always have an intuitive way to do it or work out cleanly. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal places is 3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – 0.0005 = 0.9995). If all the data happen to be integers and the smallest value is two, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). Also, when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary. The next two examples go into detail about how to construct a histogram using continuous data and how to create a histogram using discrete data.
The next question may be how many bins should we use? Generally anywhere from 5-20 bins, since too few does not display distribution well, but too many can create strange effects. A good place to start is the square root of your number of observations (n). Some other basic guidelines are bins should not overlap, not have gaps between them, have the same width, and cover the entire range of the data. The class limits and width should be “reasonable” numbers such as whole numbers, 5s, 10s, etc… In the end it really just depends on the format of your data, but following these general guidelines should make sure our table is useful.
A is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, divide each frequency by the total number of students in the sample–in this case, 20. Relative frequencies can be written as fractions, percents, or decimals. To find the relative frequency:
- f = frequency
- n = total number of data values (or the sum of the individual frequencies), and
- RF = relative frequency,
For example, if three students in Mr. Ahab’s English class of 40 students received from 90% to 100%, then, f = 3, n = 40, and RF = = = 0.075. 7.5% of the students received 90–100%. 90–100% are quantitative measures.
is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row, as shown in the figure below.
- The sum of all frequencies will add up to n, or your sample size.
- All relative frequencies should add up to one (pending rounding)
- The first entry of the cumulative relative frequency column will be the same as the first entry of the relative frequency column since there is nothing to accumulate.
- The last entry of the cumulative relative frequency column is one, indicating that one hundred percent of the data has been accumulated
The following table represents one way of grouping the heights, in inches, of a sample of 100 male semiprofessional soccer players.
|61.95–63.95||3||= 0.03||0.05 + 0.03 = 0.08|
|63.95–65.95||15||= 0.15||0.08 + 0.15 = 0.23|
|65.95–67.95||40||= 0.40||0.23 + 0.40 = 0.63|
|67.95–69.95||17||= 0.17||0.63 + 0.17 = 0.80|
|69.95–71.95||12||= 0.12||0.80 + 0.12 = 0.92|
|71.95–73.95||7||= 0.07||0.92 + 0.07 = 0.99|
|73.95–75.95||1||= 0.01||0.99 + 0.01 = 1.00|
|Total = 100||Total = 1.00|
In this sample, there are five players whose heights fall within the interval 59.95–61.95 inches, three players whose heights fall within the interval 61.95–63.95 inches, 15 players whose heights fall within the interval 63.95–65.95 inches, 40 players whose heights fall within the interval 65.95–67.95 inches, 17 players whose heights fall within the interval 67.95–69.95 inches, 12 players whose heights fall within the interval 69.95–71.95, seven players whose heights fall within the interval 71.95–73.95, and one player whose heights fall within the interval 73.95–75.95. All heights fall between the endpoints of an interval and not at the endpoints.
a. From the figure above, find the percentage of heights that are less than 65.95 inches.
b. Find the percentage of heights that fall between 61.95 and 65.95 inches.
e. Describe how you could gather this data (the heights) so that the data are characteristic of all male semiprofessional soccer players.
Remember, you count frequencies. To find the relative frequency, divide the frequency by the total number of data values. To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.
Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows: 5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3.Construct a bar graph that shows the registered voter population by district.
Construct an appropriate table including frequencies, relative frequencies, and cumulative relative frequencies.
Figure 2.1: U.S. Marine Corps photo by Staff Sgt. William Greeson (2009). “US Navy 090821-M-0440G-043 Voting ballots organized and arranged for counting by Afghan presidential election workers at a local school in the Nawa District.” Public domain. Retrieved from: https://commons.wikimedia.org/wiki/File:US_Navy_090821-M-0440G-043_Voting_ballots_organized_and_arranged_for_counting_by_Afghan_presidential_election_workers_at_a_local_school_in_the_Nawa_District.jpg
Methods of organizing, summarizing, and presenting data
Organizing, summarizing, or presenting data visually in graphs, figures, or charts
Numbers that summarize some aspect of a dataset, often calculated
The possible values a variable can take on, and how often it does so
The number of times a value of the data occurs
The lower end of a bin or class in a frequency table or histogram
The upper end of a bin or class in a frequency table or histogram
The difference in consecutive lower class limits
Found by adding the lower limit and upper limit, then dividing by 2
The percentage, proportion, or ratio of the frequency of a value of the data to the total number of outcomes
The sum of the relative frequencies for all values that are less than or equal to the given value