Last time we spoke about regression models where we have two parameters, β0 and β1 that we use to estimate the relationship between our response variable y and our covariate x:
Notice now how our solution for the slope, β1, is related to the covariance of x and y and the variance of x!
Another note here is that to calculate xˉ and yˉ, we often must use the sample means (e.g. across all time points), because we do not have multiple samples for each time point.
Since we know the sum of squared error Q=∑i=1n(yi−β0−β1xi)2=S(β0,β1), we can take the derivative with respect to unknown parameters β0,β1, and σ:
These first two equations are the same as what we derived before. However, we now have a new term in which we get the MLE estimate for σ from the third equation (with β0 and β1 replaced by β^0 and β^1, respectively):
But as it turns out, this is a biased estimator (unlike those for β0 and β1, which are not! We often instead use a corrected, unbiased estimator of σ^2 where we divide by n−p, where p is the number of parameters estimated (here, p=2, one for β0 and one for β1). That is:
Let’s look at a simulation to show how this happens!
Next time, we will look at how this extends to multiple linear regression. Later this bias will also become important when we think about regularization.
What assumptions have we made in the example regressions we have fit so far? None! In fact, we don’t need to assume the true relationship between y and x is linear to perform linear regression.
Sometimes it is useful to fit sample coefficients β0 and β1 and use them to make predictions, even if this model isn’t fully correct or the “best” model.
Often we may use linear models as they are convenient to interpret, not necessarily because they are the best or most correct model.
If we do want to perform inference, however, we require that many more assumptions are satisfied.