next up previous contents
Next: Prediction Up: Bivariate Data Previous: Spearman's Rank Correlation

Least Squares Line

Recall the equation of a line from algebra:

displaymath2967

(You may have seen Y=mX+b, we are going to change notation slightly.) Above, tex2html_wrap_inline2971 is called the slope of the line and tex2html_wrap_inline2973 is the y-intercept. The slope measures the amount Y increases when X increases by one unit. The Y-intercept is the value of Y when X=0.

Our objective is to fit a straight line to points on a scatterplot that do not lie along a straight line (see the figure above). So we want to find tex2html_wrap_inline2973 and tex2html_wrap_inline2971 such that the line tex2html_wrap_inline2991 fits the data as well as possible. First, we need to define what we mean by a ``best'' fit. We want a line that is in some sense closest to all of the data points simultaneously. In statistics, we define a residual, tex2html_wrap_inline2993 , as the vertical distance between a point and the line,

displaymath2995

(see the vertical line in the figure) Since residuals can be positive or negative, we will square them to remove the sign. By adding up all of the squared residuals, we get a measure of how far away from the data our line is. Thus, the ``best'' line will be one which has the minimum sum of squared residuals, i.e., min tex2html_wrap_inline2997 . This method of finding a line is called least squares.

The formulas for the slope and intercept of the least squares line are

displaymath2999

Using algebra, we can express the slope tex2html_wrap_inline2971 as

displaymath3003



Jan Lethen
Wed Nov 13 16:20:46 CST 1996