Regression

Overview

Consider the faithful dataset and the relation between waiting time and eruption time of Old Faithful:

These data may be more useful if they can contribute to predictions of eruption time given assumptions of measures of waiting time. However, the points themselves in this particular sample are not useful directly. Even for virtually the same waiting time, there is a range of eruption times (look at waiting time 50min). But these data can serve as training examples for a model that we can then use for predictions.

Some models are regression models. Generally, you can think of regression as finding a smooth line/curve to fit the data. Being smooth, you can then evaluate any X values and get a corresponding Y value. A different kind of model is a classifier, which outputs a classification among a finite set (like factors in R). Classifiers are not smooth, continuous functions.

Models are (almost) always approximations and not completely accurate. As independent scientists, not in direct communication with God or some oracle (presumably), we have no way of knowing (without infinitely-many experiments) the actual cause-effect relation of waiting time and eruption time for Old Faithful. Given the data we have, for all we know, Old Faithful always erupts for exactly the times shown in the sample, and exactly for the given waiting times. Maybe there is no model but the data themselves… Or maybe Old Faithful follows a linear pattern: longer waiting time = longer eruption. We’ll examine this theory first.

Linear regression

The simplest possible regression is linear. Multiple linear regression is for modeling data with multiple inputs, say $x_1$ through $x_n$, and one output, say $y$. Simple linear regression is for modeling data with one input, say $x$, and one output, say $y$. We’ll look at simple linear regression first.

Simple linear regression

A simple line can be defined as $y = \alpha + \beta x$ for some $\alpha$ and $\beta$. We prefer to find values of $\alpha$ and $\beta$ that defines the line that best fits the data. In the case of Old Faithful, we want a line that goes up and to the right, and cuts through the cluster of points.

Of course, we should not just guess the values of $\alpha$ and $\beta$. We should somehow compute the optimal values. To do so, we need an estimate of error, so that we can then minimize the error.

We’ll use calculus below to find the optimal values for $\alpha$ and $\beta$, and in order to do so, we’ll need an error function that is differentiable. So we fall back to the “sum of squared errors” again:

$$\text{sum of squared errors} = \sum_{i=1}^n (y_i - y')^2,$$

where $y’$ is our prediction using our equation $y’ = \alpha + \beta x_i$. Rewritten, we have:

$$\text{sum of squared errors} = \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2$$

We want to find the optimal $\alpha$ and $\beta$ so that this sum is smallest, given our particular data $Y=y_1, y_2, \dots, y_n$ and $X=x_1, x_2, \dots, x_n$.

Find the optimal $\alpha$

Here is the calculus to find the optimal $\alpha$. In summary, take the partial derivative of the error function, with respect to $\alpha$, and solve for this partial derivative equal to 0. Since the error function is a sum of squares, which are all concave up, we’ll be finding the minimum.

$$\begin{eqnarray} \frac{\partial}{\partial \alpha} \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 &=& \sum_{i=1}^n 2(y_i - \alpha - \beta x_i) \frac{\partial}{\partial \alpha}(y_i - \alpha - \beta x_i) \\ &=& \sum_{i=1}^n -2(y_i - \alpha - \beta x_i) \\ \end{eqnarray}$$

Setting the partial derivative equal to 0, we get:

$$\begin{eqnarray} \sum_{i=1}^n -2(y_i - \alpha - \beta x_i) &=& 0 \\ \sum_{i=1}^n (y_i - \alpha - \beta x_i) &=& 0 \\ \sum_{i=1}^n y_i - \sum_{i=1}^n \alpha - \sum_{i=1}^n \beta x_1 &=& 0 \\ \sum_{i=1}^n \alpha &=& \sum_{i=1}^n y_i - \sum_{i=1}^n \beta x_1 \\ \alpha n &=& \sum_{i=1}^n y_i - \beta \sum_{i=1}^n x_1 \\ \alpha &=& \frac{\sum_{i=1}^n y_i}{n} - \beta \frac{\sum_{i=1}^n x_1}{n} \\ \alpha &=& \overline{y} - \beta \overline{x}, \end{eqnarray}$$

where $\overline{x}$ and $\overline{y}$ are the means of $x_i$ and $y_i$.

Find the optimal $\beta$

Here is the calculus for finding the optimal $\beta$. It works the same as above, except we’ll eventually substitute in the optimal $\alpha$ as found above.

$$\begin{eqnarray} \frac{\partial}{\partial \beta} \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 &=& \sum_{i=1}^n 2(y_i - \alpha - \beta x_i) \frac{\partial}{\partial \beta}(y_i - \alpha - \beta x_i) \\ &=& \sum_{i=1}^n -2(y_i - \alpha - \beta x_i) x_i \\ &=& -2 \left( \sum_{i=1}^n x_i y_i - \alpha \sum_{i=1}^n x_i - \beta \sum_{i=1}^n x_i^2 \right) \end{eqnarray}$$

Setting the partial derivative equal to 0, we get:

$$\begin{eqnarray} -2 \left( \sum_{i=1}^n x_i y_i - \alpha \sum_{i=1}^n x_i - \beta \sum_{i=1}^n x_i^2 \right) &=& 0 \\ \sum_{i=1}^n x_i y_i - \alpha \sum_{i=1}^n x_i - \beta \sum_{i=1}^n x_i^2 &=& 0 \\ \sum_{i=1}^n x_i y_i - \alpha \sum_{i=1}^n x_i &=& \beta \sum_{i=1}^n x_i^2 \\ \frac{\sum_{i=1}^n x_i y_i - \alpha \sum_{i=1}^n x_i}{\sum_{i=1}^n x_i^2} &=& \beta \\ \frac{n \overline{xy} - \alpha n \overline{x}}{n \overline{x^2}} &=& \beta \\ \frac{\overline{xy} - \alpha \overline{x}}{\overline{x^2}} &=& \beta \\ \end{eqnarray}$$

Substituting $\alpha$ from above,

$$\begin{eqnarray} \frac{\overline{xy} - (\overline{y} - \beta \overline{x}) \overline{x}}{\overline{x^2}} &=& \beta \\ \frac{\overline{xy} - \overline{y}~\overline{x} + \beta \overline{x}^2}{\overline{x^2}} &=& \beta \\ \frac{\overline{xy} - \overline{y}~\overline{x}}{\overline{x^2}} &=& \beta - \frac{\beta \overline{x}^2}{\overline{x^2}} \\ \frac{\overline{xy} - \overline{y}~\overline{x}}{\overline{x^2}} &=& \frac{\beta \overline{x^2} - \beta \overline{x}^2}{\overline{x^2}} \\ \frac{\overline{xy} - \overline{y}~\overline{x}}{\overline{x^2}} &=& \beta \frac{\overline{x^2} - \overline{x}^2}{\overline{x^2}} \\ \frac{\overline{xy} - \overline{y}~\overline{x}}{\overline{x^2} - \overline{x}^2} &=& \beta \\ \end{eqnarray}$$

The answer happens to equate to $\beta = r_{xy} \frac{s_y}{s_x}$, where $r_{xy}$ is the correlation of $x$ and $y$ and $s_x$, $s_y$ are the standard deviations.

Both $\alpha$ and $\beta$ can be easily computed. They simply combine means, correlations, and standard deviations, which themselves are easily computed. Of course, we can do this in R even more easily (see below).

R code for simple linear regression

First, let’s manually set $\alpha$ and $\beta$ and draw the line.

Seems to be optimal to my eyes. Anyway, of course it’s optimal, we derived the formulas for $\alpha$ and $\beta$ using calculus. They aren’t even estimates! They are truly optimal.

The R function lm does the same for us:

Multiple linear regression

We can also find a line of best fit through a higher-dimensional space. Assuming we just want one predicted ($y’$) value, but we have multiple input ($x_1, \dots, x_n$), we can specify them all in the formula to the lm function. Consider the airquality dataset, which has measures for ozone, window, and solar radiation. We suspect these three variables affect temperature. Let’s find the line of best fit:

The best line is: y = 72.42 + Ozone*0.172 + Solar.R*0.0073 - Wind*0.323.

Estimating goodness of fit

Even with an optimal line, the line may not closely fit the data. There may still be a large error. We measure the goodness of fit with R^2.

In order to define R^2, we first define $E_{\text{avg}}$:

$$E_{\text{avg}} = \sum_{i=1}^n (y_i - \overline{y})^2$$

This $E_{\text{avg}}$ is the sum-of-squared-errors of the original $y$ values compared to their mean. This is like estimating the error of a linear regression that is simply the mean of the $y$ values. That’s almost never an optimal line. It is a horizontal line intercepting the y-axis at the mean of $y$:

Of course, any optimal linear regression is going to have less sum-of-squared-errors that that average line (unless the average line is optimal).

We can calculate the optimal line’s error also:

$$E_{\text{reg}} = \sum_{i=1}^n (y_i - y')^2$$

The plot below, from Wikipedia, shows the error from the average line and the error from the optimal line:

Let’s see the difference in these errors for Old Faithful:

Finally, we can consider the ratio of these errors:

$$R^2 = 1 - \frac{E_{\text{reg}}}{E_{\text{avg}}}$$

In R,

The R^2 value is our “coefficient of determination” (w00t), and it ranges between -1 and 1. It can be negative if you’re not comparing an optimal line but rather some crazy thing, so that the average line gives less error than that crazy thing. Anyway, usually it’s between 0 and 1, and higher is better. A value of 1.0 means $E_{\text{reg}}$ is zero, i.e., it perfectly predicts every observed $y$ value.

Naturally, R can find R^2 for us if you run summary on the computed linear regression model:

Notice Multiple R-squared down at the bottom-left. It works in multiple regression as well:

Residuals

We can examine the error of each predicted value, given the model, and the true value. These differences are called “residuals,” and they live in a vector in the output of the lm function:

Each residual corresponds to an $x$ value, in our case, the waiting column of the data frame. So we can graph the model’s error for each $x$ value:

Compare the linear regression (left) with the residuals (right):