Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lecture 5 Notes - Simple Linear Regression

Review

Week 3 survey:

Week 3 survey

Or go to pollev.com/stat153

Notes on covariance

To add to your materials from last week, I’ve created some visualizations.

Auto-correlation of fMRI data

This shows an animation of calculating the ACF for the fmri1 dataset from astsa from lags 20h20-20 \leq h \leq 20.

Cross-correlation of fMRI data

This shows an animation of calculating the CCF for the fmri1 dataset from astsa from lags 80h80-80 \leq h \leq 80.

Autocovariance on random walk after differencing

Today

Simple linear regression

Let’s say we want to learn the relationship between two variables yy and xx - the goal is to predict yy given xx. For example, we might want to predict the height of an adult man (yy) given the height of his father (xx). yy is called the response variable or dependent variable, xx i the covariate or independent variable. We will start with the more general scenario and then extend this concept specifically to time series.

In simple linear regression, we are predicting yy given one covariate xx. In the case of multiple xx, say {x1,,xp}\{x_1, \dots, x_p\}, this is called multiple regression.

For the simple case, we are predicting yy from one covariate xx. For example:

y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon

β0\beta_0 and β1\beta_1 are parameters that we will estimate from the data, where β0\beta_0 is the intercept, which corresponds to the value of yy when x=0x=0, and β1\beta_1 represents the change in yy when xx changes by one unit.

We could observe, for example, pairs of data (x1,y1),,(xn,yn)(x_1, y_1), \dots, (x_n, y_n) where each of these samples is a pair of fathers and sons. We then might write:

yi=β0+β1xi+ϵiy_i = \beta_0 + \beta_1 x_i + \epsilon_i.

We can then estimate our β^0\hat{\beta}_0 and β^1\hat{\beta}_1 from the data, then use these to predict the value of the response variable y^\hat{y} (height of a new adult man) for a new covariate xx. We will talk about how to do this using the Python library statsmodels, but we will also talk about how to derive the solution mathematically.

For time series regression, we often have some observed data y1,,yny_1, \dots, y_n that are observations for a single variable yy. In this case, we may try two main ways of predicting yy:

  1. Use time as a covariate (many of the examples we’ve seen already do this -- for example, the DJIA data, fMRI data, population data over time)

  2. Use lagged version of yy as the covariate. For this case, we might take xi=yi=1x_i = y_{i=1}. In this case, we are predicting a current value based on a prior time point. This is called Lagged Regression or AutoRegression.

Price of chicken and price of farm-bred Norwegian salmon over time, and corresponding fitted linear trend and 95% confidence intervals.

How do we estimate β0\beta_0 and β1\beta_1?

There are a couple of ways to estimate β0\beta_0 and β1\beta_1. Standard libraries such as statsmodels use the least squares method. In this method, we minimize the error sum of squares:

Q=t=1n(yy^)2=t=1n(yβ0β1x)2Q = \sum_{t=1}^n (y - \hat{y})^2 = \sum_{t=1}^n (y - \beta_0 - \beta_1 x)^2

Ordinary Least Squares (OLS)

In ordinary least squares, we can solve for:

minβ0,β1E[(yβ0β1x)2]\min_{\beta_0,\beta_1} \mathbb{E}[(y - \beta_0 - \beta_1 x)^2], which will be the best fitting line at the population level.

Solving this gives:

β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x} and β^1=i=1n(yiyˉ)(xixˉ)i=1n(xixˉ)2\hat{\beta}_1=\frac{\sum_{i=1}^n (y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^n (x_i-\bar{x})^2}

where

yˉ=y1++ynn\bar{y} = \frac{y_1+\dots+y_n}{n} and xˉ=x1++xnn\bar{x} = \frac{x_1+\dots+x_n}{n}

Let’s look at what this looks like in statsmodels with an example.