3.4 Modeling Linear Relationships

If you knew the length of someone’s pinky (smallest finger), do you think you could predict that person’s height? Imagine collecting data on this and constructing a scatter plot of the points. Then draw a line that appears to “fit” the data. For your line, pick two convenient points and use them to find the slope of the line. Find the y-intercept of the line by extending your line so it crosses the y-axis. Using the slope and the y-intercept, write your equation of “best fit.” According to your equation, what is the predicted height for a pinky length of 2.5 inches? You have just started the process of linear regression.

Linear Regression

Data rarely perfectly fit a straight line, but we can be satisfied with rough predictions. Typically, if a dataset has a scatter plot that appears to “fit” a straight line called a line of best fit or least-squares line. This process of fitting the best-fit line is called linear regression.

The equation of the regression line is ŷ = a + bx.

The ŷ is read “y hat” and is the estimated value of y obtained using the regression line. It may or may not be equal to values of y observed from the data.

The sample means of the x values and the y values are \bar{x} and \bar{y}, respectively. The best fit line always passes through the point (\bar{x}, \bar{y}).

The slope, b, can be written as b=r\left(\frac{{s}_{y}}{{s}_{x}}\right), where sy is the standard deviation of the y values and sx is the standard deviation of the x values. Note that the slope is directly calculated using r, the correlation coefficient, discussed in previous sections.

The y-intercept, a, can then be calculated by using the slope, and means of x and y.

Example

Recall our previous example:

A random sample of 11 statistics students produced the following data, where x is the third exam score out of 80, and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score?

Third exam score (x) Final exam score (y)
65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
69 159

Figure 3.16: Third and final exam scores data

 

Solution

We have found:

  • An apparent linear relationship in the scatterplot
  • The correlation coefficient is r = 0.6631
  • The coefficient of determination is r2 = 0.66312 = 0.4397

The third exam score, x, is the independent variable and the final exam score, y, is the dependent variable. We will plot a regression line that best fits the data. If you were to fit a line by eye, you may draw different lines. We most often use what is called a least-squares regression line to obtain the best fit line. The idea behind finding the best fit line is based on the assumption that the data are scattered about a straight line. The criteria for the best fit line is that the vertical distance of each point to the line is made as small as possible. This best fit line is called the least-squares regression line.

Consider the following diagram. Each point of data is of the form (x, y), and each point of the line of best fit using least-squares linear regression has the form (x, ŷ).

 

Scatter plot of exam scores with a line of best fit. Weak positive correlation.
Figure 3.17: Line of best fit. Figure description available at the end of the section.

 

Solution

The line of best fit is: ŷ = –173.51 + 4.83x

Your Turn!

SCUBA divers have maximum dive times they cannot exceed when going to different depths. The data in the figure below show different depths’ maximum dive times in minutes. Use your calculator to find the least-squares regression line and predict the maximum dive time for 110 feet.

Depth (x)
(in feet)
Maximum dive time (y)
(in minutes)
50 80
60 55
70 45
80 35
90 25
100 22

Figure 3.18: SCUBA diver stats

Understanding Slope

The slope of the line, b, describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

Interpretation: The slope of the best-fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable on average.

Example

[Previous Example Continued]

The slope of the line is b = 4.83.

Interpretation: For a one-point increase in the score on the third exam, the final exam score increases by 4.83 points on average.

Understanding the y-Intercept

The y-intercept of the line, a, can tell us what we would predict the value of y to be when x is 0. This may make sense in some cases, but in many, it may not make sense for x to be equal to 0, therefore the y-intercept may not be useful.

Example

[Previous Example Continued]

The y-intercept of the line is –173.51.

Interpretation: In this context, it does not really make sense for x to be 0 (unless a student did not take the exam or try at all). Therefore our y-intercept does not make sense.

Prediction

The next and most useful step in regression is to actually use that equation to predict future values of y.

Recall our example in which we examined the scatter plot and found the correlation coefficient and coefficient of determination. We found the equation of the best-fit line for the final exam grade as a function of the grade on the third exam. We can now use the least-squares regression line for prediction.

Example

[Previous Example Continued]

Suppose you want to estimate, or predict, the mean final exam score of statistics students who received a score of 73 on the third exam. The exam scores (x values) range from 65 to 75. Since 73 is between the x values 65 and 75, substitute x = 73 into the equation. Then:

\stackrel{^}{y}=-173.51+4.83\left(73\right)=179.08

Solution

We can predict that statistics students who earn a grade of 73 on the third exam will earn a grade of 179.08 on the final exam, on average.

 

What would you predict the final exam score to be for a student who scored a 66 on the third exam?

Solution

145.27

 

What would you predict the final exam score to be for a student who scored a 90 on the third exam?

Solution

The x values in the data are between 65 and 75. Ninety is outside of the domain of the observed x values in the data (independent variable), so you cannot reliably predict the final exam score for this student. Even though it is possible to enter 90 into the equation for x and calculate a corresponding y value, the y value that you get will not be reliable.

 

 

 

Your Turn!

Data is collected on the relationship between the number of hours per week practicing a musical instrument and scores on a math test. The line of best fit is as follows:

ŷ = 72.5 + 2.8x

What would you predict the score on a math test would be for a student who practices a musical instrument for five hours a week?

Additional Resources

QR code

Click here for additional multimedia resources, including podcasts, videos, lecture notes, and worked examples.

If you are using an offline version of this text, access the resources for this section via the QR code, or by visiting https://doi.org/10.7294/26207456.

Figure References

Figure 3.17: Kindred Grey (2020). Line of best fit. CC BY-SA 4.0. Adaptation of Figure 12.11 from OpenStax Introductory Statistics (2013) (CC BY 4.0). Retrieved from https://openstax.org/books/introductory-statistics/pages/12-3-the-regression-equation

Figure Descriptions

Figure 3.17: Scatter plot of exam scores with a line of best fit. Weak positive correlation.

definition

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Significant Statistics: An Introduction to Statistics Copyright © 2024 by John Morgan Russell is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book