Recall the equation of a line from algebra:
(You may have seen Y=mX+b, we are going to change notation
slightly.) Above,
is called the slope of the line and
is the y-intercept. The slope measures the amount Y
increases when X increases by one unit. The Y-intercept is the
value of Y when X=0.
Our objective is to fit a straight line to points on a scatterplot
that do not lie along a straight line (see the figure above). So we want to
find
and
such that the line
fits the data as well as
possible. First, we need to define what we mean by a ``best'' fit. We
want a line that is in some sense closest to all of the data points
simultaneously. In statistics, we define a residual,
, as
the vertical distance between a point and the line,
(see the vertical line in the figure)
Since residuals can be positive or negative, we will square them to
remove the sign. By adding up all of the squared residuals, we get a
measure of how far away from the data our line is. Thus, the
``best'' line will be one which has the minimum sum of squared
residuals, i.e., min
. This method of finding a line is
called least squares.
The formulas for the slope and intercept of the least squares line are
Using algebra, we can express the slope
as