This lab will go into concepts from Lectures 5 and 6 (covering linear regression). This includes:
Fitting ordinary least squares (OLS) regression
Simulating many regressions, determining bias in coefficients
More examples of OLS
For this lab, you will fill in the aspects of the code marked ... or with the comment # FILL IN
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import statsmodels.api as sm # This contains the model fitting libraries for linear regression
from statsmodels.graphics.tsaplots import plot_acf # autocorrelation function
#!pip install astsa
import astsa # You should have pip installed this for Lab 1.. if not, uncomment line above
# Set the random seed, this is so you will generate the same answers
# each time (for example, when generating white noise)
np.random.seed(42) First we’ll start with a simple example where we have a linear trend plus noise. We’ll take a sample from this function and try to fit a regression line to predict from .
t = np.arange(1, 101)
sigma = 4 # If sigma=1, standard normal
beta0 = 2
beta1 = 0.5
y = beta0 + beta1*t + sigma*np.random.randn(len(t))
plt.plot(t,y,label='y=2+0.5t+w')
plt.xlabel('t',fontsize=18)
plt.ylabel('y',fontsize=18)
plt.legend()
We first have to add a constant term to our model to be sure that we actually get an estimate of
T = sm.add_constant(t)
linreg_simple2 = sm.OLS(y, T).fit()
print(linreg_simple2.summary()) OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.942
Model: OLS Adj. R-squared: 0.942
Method: Least Squares F-statistic: 1601.
Date: Thu, 05 Feb 2026 Prob (F-statistic): 1.62e-62
Time: 11:01:59 Log-Likelihood: -270.29
No. Observations: 100 AIC: 544.6
Df Residuals: 98 BIC: 549.8
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.3032 0.735 1.773 0.079 -0.155 2.762
x1 0.5056 0.013 40.010 0.000 0.480 0.531
==============================================================================
Omnibus: 0.521 Durbin-Watson: 2.042
Prob(Omnibus): 0.771 Jarque-Bera (JB): 0.518
Skew: -0.167 Prob(JB): 0.772
Kurtosis: 2.885 Cond. No. 117.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can also report the confidence intervals on our estimates of and , which can be useful for determining our certainty of these estimates.
ci_linreg_simple2 = linreg_simple2.conf_int(alpha=0.05)
print(ci_linreg_simple2)
print(f'95% confidence interval for B0$: [{ci_linreg_simple2[0,0]:.3f}, {ci_linreg_simple2[0,1]:.3f}]')
print(f'95% confidence interval for B1$: [{ci_linreg_simple2[1,0]:.3f}, {ci_linreg_simple2[1,1]:.3f}]')[[-0.15543644 2.76178753]
[ 0.48049713 0.53064895]]
95% confidence interval for B0$: [-0.155, 2.762]
95% confidence interval for B1$: [0.480, 0.531]
# Now plot the predictions - we can do this a few ways, either using
# `fitted values` or using the parameters directly:
yhat = linreg_simple2.fittedvalues
# We could also get the coefficients and use these to get yhat instead:
betas = linreg_simple2.params
plt.plot(t,y)
plt.plot(t,yhat)
#plt.plot(t,betas[0]+t*betas[1]) # This is exactly the same as yhat
[[ 0.37185304 3.24564413]
[ 0.89903144 3.7296118 ]
[ 1.42609361 4.21369571]
[ 1.95303407 4.69790132]
[ 2.47984706 5.18223441]
[ 3.00652647 5.66670107]
[ 3.53306583 6.15130778]
[ 4.05945832 6.63606137]
[ 4.58569669 7.12096908]
[ 5.11177328 7.60603856]
[ 5.63767998 8.09127793]
[ 6.16340821 8.57669578]
[ 6.68894886 9.0623012 ]
[ 7.21429231 9.54810382]
[ 7.73942837 10.03411384]
[ 8.26434624 10.52034205]
[ 8.78903451 11.00679986]
[ 9.31348109 11.49349935]
[ 9.83767321 11.9804533 ]
[10.36159738 12.46767521]
[10.88523934 12.95517932]
[11.40858405 13.44298068]
[11.93161567 13.93109514]
[12.45431749 14.41953939]
[12.97667199 14.90833097]
[13.49866072 15.39748831]
[14.02026441 15.8870307 ]
[14.54146287 16.37697831]
[15.06223505 16.8673522 ]
[15.58255907 17.35817426]
[16.10241221 17.8494672 ]
[16.62177101 18.34125447]
[17.14061133 18.83356023]
[17.6589084 19.32640923]
[18.17663699 19.81982671]
[18.69377151 20.31383827]
[19.21028617 20.80846969]
[19.72615517 21.30374676]
[20.24135292 21.79969509]
[20.75585425 22.29633983]
[21.26963469 22.79370546]
[21.78267071 23.29181551]
[22.29494003 23.79069227]
[22.80642189 24.29035648]
[23.31709734 24.79082711]
[23.82694954 25.29212099]
[24.33596398 25.79425261]
[24.84412878 26.2972339 ]
[25.3514348 26.80107395]
[25.85787588 27.30577894]
[26.36344892 27.81135198]
[26.86815391 28.31779306]
[27.37199396 28.82509909]
[27.87497525 29.33326388]
[28.37710688 29.84227832]
[28.87840075 30.35213052]
[29.37887138 30.86280597]
[29.87853559 31.37428783]
[30.37741235 31.88655715]
[30.8755224 32.39959317]
[31.37288803 32.91337361]
[31.86953278 33.42787494]
[32.3654811 33.94307269]
[32.86075817 34.45894169]
[33.35538959 34.97545635]
[33.84940115 35.49259087]
[34.34281863 36.01031946]
[34.83566764 36.52861653]
[35.32797339 37.04745685]
[35.81976066 37.56681565]
[36.3110536 38.08666879]
[36.80187566 38.60699281]
[37.29224955 39.12776499]
[37.78219716 39.64896345]
[38.27173955 40.17056714]
[38.76089689 40.69255588]
[39.24968847 41.21491037]
[39.73813272 41.73761219]
[40.22624718 42.26064381]
[40.71404854 42.78398852]
[41.20155266 43.30763048]
[41.68877456 43.83155465]
[42.17572851 44.35574677]
[42.66242801 44.88019336]
[43.14888581 45.40488162]
[43.63511402 45.92979949]
[44.12112404 46.45493555]
[44.60692666 46.980279 ]
[45.09253208 47.50581965]
[45.57794993 48.03154788]
[46.0631893 48.55745458]
[46.54825879 49.08353117]
[47.03316649 49.60976954]
[47.51792008 50.13616203]
[48.00252679 50.66270139]
[48.48699345 51.1893808 ]
[48.97132654 51.71619379]
[49.45553215 52.24313425]
[49.93961606 52.77019642]
[50.42358373 53.29737482]] mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \
0 1.808749 0.724071 0.371853 3.245644 -5.570921
1 2.314322 0.713184 0.899031 3.729612 -5.061172
2 2.819895 0.702355 1.426094 4.213696 -4.551506
3 3.325468 0.691588 1.953034 4.697901 -4.041922
4 3.831041 0.680885 2.479847 5.182234 -3.532422
.. ... ... ... ... ...
95 49.838187 0.680885 48.486993 51.189381 42.474724
96 50.343760 0.691588 48.971327 51.716194 42.976370
97 50.849333 0.702355 49.455532 52.243134 43.477933
98 51.354906 0.713184 49.939616 52.770196 43.979412
99 51.860479 0.724071 50.423584 53.297375 44.480809
obs_ci_upper
0 9.188419
1 9.689815
2 10.191295
3 10.692858
4 11.194504
.. ...
95 57.201650
96 57.711150
97 58.220734
98 58.730400
99 59.240149
[100 rows x 6 columns]

Show the predictions and confidence intervals¶
If we want to show the predicted data and confidence intervals for the fitted values, we can use get_prediction. We will then produce pred_summary, which contains:
mean- fitted valuesmean_ci_lower,mean_ci_upper- confidence intervals for the meanobs_ci_lower,obs_ci_upper- prediction interval for new observations
# To show the confidence intervals on the data, we can use the `get_prediction` method
#
pred = linreg_simple2.get_prediction(T)
pred_ci = pred.conf_int(alpha=0.05)
pred_summary = pred.summary_frame(alpha=0.05)
plt.plot(t,y)
plt.plot(t, pred_summary['mean'])
plt.fill_between(t, pred_summary['mean_ci_lower'], pred_summary['mean_ci_upper'], alpha=0.3, label="95% CI (mean)")
#plt.fill_between(t, pred_summary['obs_ci_lower'], pred_summary['obs_ci_upper'], alpha=0.3, label="95% CI (mean)")
plt.xlabel('t',fontsize=18)
plt.ylabel('y',fontsize=18)
For time series, one thing we want to check is that our residuals (the difference between the predicted data and our actual value) are weakly stationary, which allows us to assume that our error is weakly stationary.
We can do that by plotting the residuals over time as well as the autocorrelation function of the residuals. How does this look?
residuals = linreg_simple2.resid
plt.figure()
plt.plot(t,residuals)
plt.axhline(0, color='k', linewidth=0.5) # Horizontal line at 0
plt.xlabel('t')
plt.ylabel('Residual')
plt.figure(figsize=(10, 5))
plot_acf(residuals, lags=40, title='Autocorrelation Function of Residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()
<Figure size 1000x500 with 0 Axes>
The chicken example¶
Now let’s load some data from astsa, for example, the price of chicken over time, from 2001 to 2016. As pointed out in your book, commodities (such as raw materials, basic resources, agricultural, or mining products) show specific fluctuations in their prices over time. For this example, we’ll fit a regression model for chicken prices as our response and time as our predictor .
chicken_data = astsa.load_chicken()
# Fix the time index
chicken_data['Time'] = pd.to_datetime(chicken_data['Time'])
chicken_data.set_index('Time', inplace=True)
print(chicken_data) Value
Time
2001-08-01 65.58
2001-09-01 66.48
2001-10-01 65.70
2001-11-01 64.33
2001-12-01 63.23
... ...
2016-03-01 111.56
2016-04-01 111.55
2016-05-01 111.98
2016-06-01 111.84
2016-07-01 111.46
[180 rows x 1 columns]
plt.figure(figsize=(8,6))
plt.plot(chicken_data.index, chicken_data['Value'], label='Value')
plt.xlabel('Year', fontsize=18)
plt.ylabel('Value', fontsize=18)
plt.title('Chicken prices (U.S. cents/pound)', fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16);
Chicken price regression¶
We will fit a linear regression model to the chicken price data using time as the covariate. Here, the model is:
where is the price of chicken at time index , and . We can interpret the coefficient as the numerical change in price of chicken from one month to the next (since each time step is in one month). From just looking at the graph, we know that the linear fit is likely not the best model for these data, since there are also underlying (potentially seasonal) fluctuations in the data. Still, we may be interested in the overall trend in price increases for this commodity.
We can then use statsmodels to fit the OLS solution for the line of best fit to the data.
yvec = np.array(chicken_data['Value'])
n = len(yvec) # number of time points
xvec = np.arange(1, n+1)
X = sm.add_constant(xvec)
linreg2 = sm.OLS(yvec, X).fit()
print(linreg2.summary()) OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.917
Model: OLS Adj. R-squared: 0.917
Method: Least Squares F-statistic: 1974.
Date: Thu, 05 Feb 2026 Prob (F-statistic): 2.83e-98
Time: 11:23:23 Log-Likelihood: -532.83
No. Observations: 180 AIC: 1070.
Df Residuals: 178 BIC: 1076.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 58.5832 0.703 83.331 0.000 57.196 59.971
x1 0.2993 0.007 44.434 0.000 0.286 0.313
==============================================================================
Omnibus: 9.576 Durbin-Watson: 0.046
Prob(Omnibus): 0.008 Jarque-Bera (JB): 4.204
Skew: 0.018 Prob(JB): 0.122
Kurtosis: 2.252 Cond. No. 210.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
linreg2 = sm.OLS(yvec, X).fit()
print(linreg2.summary()) OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.917
Model: OLS Adj. R-squared: 0.917
Method: Least Squares F-statistic: 1974.
Date: Thu, 05 Feb 2026 Prob (F-statistic): 2.83e-98
Time: 11:23:23 Log-Likelihood: -532.83
No. Observations: 180 AIC: 1070.
Df Residuals: 178 BIC: 1076.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 58.5832 0.703 83.331 0.000 57.196 59.971
x1 0.2993 0.007 44.434 0.000 0.286 0.313
==============================================================================
Omnibus: 9.576 Durbin-Watson: 0.046
Prob(Omnibus): 0.008 Jarque-Bera (JB): 4.204
Skew: 0.018 Prob(JB): 0.122
Kurtosis: 2.252 Cond. No. 210.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now we have an estimate of (const) and an estimate of (x1). What this means is that the price of chicken is going up by approximately 0.30 cents per pound per month, and had started at a base price of 58.5 cents per pound.
We can also calculate the confidence intervals to understand the uncertainty associated with the estimate.
ci_linreg2 = linreg2.conf_int(alpha=0.05)
print(ci_linreg2)
print(f'95% confidence interval for price fluctuation: [{ci_linreg2[1,0]:.2f}, {ci_linreg2[1,1]:.2f}]')[[57.19590866 59.97056185]
[ 0.28604822 0.31263657]]
95% confidence interval for price fluctuation: [0.29, 0.31]
Now let’s plot the regression line on the original data.
plt.figure(figsize=(8,6))
plt.plot(chicken_data.index, chicken_data['Value'], label='Value')
plt.plot(chicken_data.index, linreg2.fittedvalues, label='y=B0+B1x')
plt.axhline(chicken_data['Value'].mean(), color='r', label='mean price')
plt.xlabel('Year', fontsize=18)
plt.ylabel('Value', fontsize=18)
plt.title('Chicken prices (U.S. cents/pound)', fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16);
plt.legend()
What are some problems with this? Do our residuals show any strong or unmodeled dependence on time?
residuals = linreg2.resid
plt.figure()
plt.plot(chicken_data.index,residuals)
plt.axhline(0, color='k', linewidth=0.5) # Horizontal line at 0
plt.xlabel('Date')
plt.ylabel('Residual')
plt.figure(figsize=(10, 5))
plot_acf(residuals, lags=80, title='Autocorrelation Function of Residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()
<Figure size 1000x500 with 0 Axes>
What if we took a moving average of the chicken data to smooth out some of the fluctuations? Here we will smooth over a 3 year + 1 month window (we are choosing an odd length window so it is centered around our observations).
def moving_average(w, n):
'''
Create an acausal moving average of the time series `w` using an `n`-point moving average
'''
v = np.zeros((len(w),)) * np.nan
half_win = n // 2
for t in np.arange(half_win, len(w) - half_win):
v[t] = np.mean(w[t-half_win : t+half_win+1])
return v
smoothing_win = 3*12+1 # 3 years * 12 months/year + 1 to make it an odd #
chicken_smoothed = moving_average(chicken_data['Value'], n=smoothing_win)
plt.plot(chicken_data.index,chicken_data['Value'], label='original')
plt.plot(chicken_data.index,chicken_smoothed, label='smoothed')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()

# Now let's fit a regression to this:
yvec = np.array(chicken_smoothed)
n = len(yvec) # number of time points
xvec = np.arange(1, n+1)
X = sm.add_constant(xvec)
linreg_smooth = sm.OLS(yvec, X).fit()
print(linreg_smooth.summary()) OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: nan
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Thu, 05 Feb 2026 Prob (F-statistic): nan
Time: 12:20:04 Log-Likelihood: nan
No. Observations: 180 AIC: nan
Df Residuals: 178 BIC: nan
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const nan nan nan nan nan nan
x1 nan nan nan nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: nan
Prob(Omnibus): nan Jarque-Bera (JB): nan
Skew: nan Prob(JB): nan
Kurtosis: nan Cond. No. 210.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We got nans (“Not a Number”/undefined)! Why did this happen? It turns out when we defined our moving average, we only calculated the windowed average once we got enough samples, so the beginning and end of our function is now undefined.
To make this work out, we have to remove the nan values as follows:
yvec = np.array(chicken_smoothed[~np.isnan(chicken_smoothed)]) # Take only the points that are *not* NaN (that's what ~ is shorthand for)
n = len(yvec) # number of time points
xvec = np.arange(1, n+1)
print(len(yvec), len(xvec))
X = sm.add_constant(xvec)
linreg_smooth = sm.OLS(yvec, X).fit()
print(linreg_smooth.summary())144 144
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.949
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 2635.
Date: Thu, 05 Feb 2026 Prob (F-statistic): 1.41e-93
Time: 12:20:05 Log-Likelihood: -353.12
No. Observations: 144 AIC: 710.2
Df Residuals: 142 BIC: 716.2
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 63.8031 0.474 134.568 0.000 62.866 64.740
x1 0.2912 0.006 51.336 0.000 0.280 0.302
==============================================================================
Omnibus: 15.433 Durbin-Watson: 0.005
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8.071
Skew: 0.395 Prob(JB): 0.0177
Kurtosis: 2.150 Cond. No. 168.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
residuals_smooth = linreg_smooth.resid
# Make sure we get the new time index as well
half_win = smoothing_win // 2
new_t = [chicken_data.index[a] for a in np.arange(half_win, len(chicken_data.index) - half_win)]
plt.figure()
plt.plot(new_t,residuals_smooth)
plt.axhline(0, color='k', linewidth=0.5) # Horizontal line at 0
plt.xlabel('Date')
plt.ylabel('Residual')
plt.figure(figsize=(10, 5))
plot_acf(residuals_smooth, lags=100, title='Autocorrelation Function of Residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show();
<Figure size 1000x500 with 0 Axes>
As it turns out, there are still pretty strong autocorrelations in these data, even when we apply the moving average, so probably we will want to do some differencing or other transformations first.
Lecture 6 review¶
Now as in Lecture 6, we are going to start with a known function . We will assume that is noise taken from a Gaussian distribution . Here we will choose arbitrary values for each of these.
We will draw samples from this function and then compute our estimated , , and through the MLE functions we derived in lecture.
beta0 = 0.5
beta1 = 28.3
sigma2 = 22.2
# n is the number of time points
n = 10
x = np.arange(1,n+1)
# how many times to repeat the simulation
nsamps = 10000
y = beta0 + beta1*x + np.random.randn(nsamps,n)*np.sqrt(sigma2)
print('y is of shape', y.shape)y is of shape (10000, 10)
Now we will use the MLE solutions for estimating each of these parameters from our sampled data.
# Initialize values for our estimates
beta0_hat = np.zeros(nsamps,)
beta1_hat = np.zeros(nsamps,)
sigma2_hat = np.zeros(nsamps,)
sigma2_hat_unbiased = np.zeros(nsamps,)
for i in np.arange(nsamps):
beta1_hat[i] = np.sum((y[i,:]-y[i,:].mean())*(x-x.mean()))/np.sum((x-x.mean())**2)
beta0_hat[i] = y[i,:].mean() - beta1_hat[i] * x.mean()
sigma2_hat[i] = 1/n*np.sum((y[i,:] - beta0_hat[i] - beta1_hat[i]*x)**2)
sigma2_hat_unbiased[i] = sigma2_hat[i]*n/(n-2)plt.figure(figsize=(10,3))
plt.plot(y.T);
plt.figure(figsize=(10,3))
plt.subplot(1,3,1)
plt.hist(beta0_hat, bins=50, alpha=0.5);
plt.axvline(beta0, color='r')
plt.axvline(beta0_hat.mean(), color='blue', linestyle='--')
plt.title('beta0 estimate')
plt.subplot(1,3,2)
plt.hist(beta1_hat, bins=50, alpha=0.5);
plt.axvline(beta1, color='r')
plt.axvline(beta1_hat.mean(), color='blue', linestyle='--')
plt.title('beta1 estimate')
plt.subplot(1,3,3)
plt.hist(sigma2_hat, bins=50, alpha=0.5);
plt.axvline(sigma2_hat.mean(), color='blue', linestyle='--')
plt.hist(sigma2_hat_unbiased, bins=50, alpha=0.5);
plt.axvline(sigma2, color='r')
plt.axvline(sigma2_hat_unbiased.mean(), color='m', linestyle='--')
plt.title('sigma2 estimate')
print(np.mean(sigma2_hat), np.std(sigma2_hat))
print(np.mean(sigma2_hat_unbiased), np.std(sigma2_hat_unbiased))
print(sigma2)17.726358503291888 8.794352684562693
22.157948129114857 10.992940855703367
22.2
# If you finish early -- try loading another dataset like the DJIA dataset and running the same analysis. What do you notice?