Lab 3 Solutions - Stat 153/248

This lab will go into concepts from Lectures 5 and 6 (covering linear regression). This includes:

Fitting ordinary least squares (OLS) regression
Simulating many regressions, determining bias in coefficients
More examples of OLS

For this lab, you will fill in the aspects of the code marked ... or with the comment # FILL IN

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import statsmodels.api as sm # This contains the model fitting libraries for linear regression 
from statsmodels.graphics.tsaplots import plot_acf # autocorrelation function

#!pip install astsa
import astsa # You should have pip installed this for Lab 1.. if not, uncomment line above

# Set the random seed, this is so you will generate the same answers
# each time (for example, when generating white noise)
np.random.seed(42)

First we’ll start with a simple example where we have a linear trend plus noise. We’ll take a sample from this function and try to fit a regression line to predict $y$ from $t$ .

t = np.arange(1, 101)
sigma = 4 # If sigma=1, standard normal
beta0 = 2
beta1 = 0.5
y = beta0 + beta1*t + sigma*np.random.randn(len(t))

plt.plot(t,y,label='y=2+0.5t+w')
plt.xlabel('t',fontsize=18)
plt.ylabel('y',fontsize=18)
plt.legend()

We will fit the simple model

y_i = \beta_0 + \beta_1 t + \epsilon_i

(1)

using Ordinary Least Squares (OLS) in the statsmodels library.

We first have to add a constant term to our model to be sure that we actually get an estimate of $\beta_0$

T = sm.add_constant(t)
linreg_simple2 = sm.OLS(y, T).fit()
print(linreg_simple2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.942
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     1601.
Date:                Thu, 05 Feb 2026   Prob (F-statistic):           1.62e-62
Time:                        11:01:59   Log-Likelihood:                -270.29
No. Observations:                 100   AIC:                             544.6
Df Residuals:                      98   BIC:                             549.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.3032      0.735      1.773      0.079      -0.155       2.762
x1             0.5056      0.013     40.010      0.000       0.480       0.531
==============================================================================
Omnibus:                        0.521   Durbin-Watson:                   2.042
Prob(Omnibus):                  0.771   Jarque-Bera (JB):                0.518
Skew:                          -0.167   Prob(JB):                        0.772
Kurtosis:                       2.885   Cond. No.                         117.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can also report the confidence intervals on our estimates of $\beta_0$ and $\beta_1$ , which can be useful for determining our certainty of these estimates.

ci_linreg_simple2 = linreg_simple2.conf_int(alpha=0.05)
print(ci_linreg_simple2)
print(f'95% confidence interval for B0$: [{ci_linreg_simple2[0,0]:.3f}, {ci_linreg_simple2[0,1]:.3f}]')
print(f'95% confidence interval for B1$: [{ci_linreg_simple2[1,0]:.3f}, {ci_linreg_simple2[1,1]:.3f}]')

[[-0.15543644  2.76178753]
 [ 0.48049713  0.53064895]]
95% confidence interval for B0$: [-0.155, 2.762]
95% confidence interval for B1$: [0.480, 0.531]

# Now plot the predictions - we can do this a few ways, either using
# `fitted values` or using the parameters directly:

yhat = linreg_simple2.fittedvalues

# We could also get the coefficients and use these to get yhat instead: 
betas = linreg_simple2.params 

plt.plot(t,y)
plt.plot(t,yhat)
#plt.plot(t,betas[0]+t*betas[1]) # This is exactly the same as yhat

[[ 0.37185304  3.24564413]
 [ 0.89903144  3.7296118 ]
 [ 1.42609361  4.21369571]
 [ 1.95303407  4.69790132]
 [ 2.47984706  5.18223441]
 [ 3.00652647  5.66670107]
 [ 3.53306583  6.15130778]
 [ 4.05945832  6.63606137]
 [ 4.58569669  7.12096908]
 [ 5.11177328  7.60603856]
 [ 5.63767998  8.09127793]
 [ 6.16340821  8.57669578]
 [ 6.68894886  9.0623012 ]
 [ 7.21429231  9.54810382]
 [ 7.73942837 10.03411384]
 [ 8.26434624 10.52034205]
 [ 8.78903451 11.00679986]
 [ 9.31348109 11.49349935]
 [ 9.83767321 11.9804533 ]
 [10.36159738 12.46767521]
 [10.88523934 12.95517932]
 [11.40858405 13.44298068]
 [11.93161567 13.93109514]
 [12.45431749 14.41953939]
 [12.97667199 14.90833097]
 [13.49866072 15.39748831]
 [14.02026441 15.8870307 ]
 [14.54146287 16.37697831]
 [15.06223505 16.8673522 ]
 [15.58255907 17.35817426]
 [16.10241221 17.8494672 ]
 [16.62177101 18.34125447]
 [17.14061133 18.83356023]
 [17.6589084  19.32640923]
 [18.17663699 19.81982671]
 [18.69377151 20.31383827]
 [19.21028617 20.80846969]
 [19.72615517 21.30374676]
 [20.24135292 21.79969509]
 [20.75585425 22.29633983]
 [21.26963469 22.79370546]
 [21.78267071 23.29181551]
 [22.29494003 23.79069227]
 [22.80642189 24.29035648]
 [23.31709734 24.79082711]
 [23.82694954 25.29212099]
 [24.33596398 25.79425261]
 [24.84412878 26.2972339 ]
 [25.3514348  26.80107395]
 [25.85787588 27.30577894]
 [26.36344892 27.81135198]
 [26.86815391 28.31779306]
 [27.37199396 28.82509909]
 [27.87497525 29.33326388]
 [28.37710688 29.84227832]
 [28.87840075 30.35213052]
 [29.37887138 30.86280597]
 [29.87853559 31.37428783]
 [30.37741235 31.88655715]
 [30.8755224  32.39959317]
 [31.37288803 32.91337361]
 [31.86953278 33.42787494]
 [32.3654811  33.94307269]
 [32.86075817 34.45894169]
 [33.35538959 34.97545635]
 [33.84940115 35.49259087]
 [34.34281863 36.01031946]
 [34.83566764 36.52861653]
 [35.32797339 37.04745685]
 [35.81976066 37.56681565]
 [36.3110536  38.08666879]
 [36.80187566 38.60699281]
 [37.29224955 39.12776499]
 [37.78219716 39.64896345]
 [38.27173955 40.17056714]
 [38.76089689 40.69255588]
 [39.24968847 41.21491037]
 [39.73813272 41.73761219]
 [40.22624718 42.26064381]
 [40.71404854 42.78398852]
 [41.20155266 43.30763048]
 [41.68877456 43.83155465]
 [42.17572851 44.35574677]
 [42.66242801 44.88019336]
 [43.14888581 45.40488162]
 [43.63511402 45.92979949]
 [44.12112404 46.45493555]
 [44.60692666 46.980279  ]
 [45.09253208 47.50581965]
 [45.57794993 48.03154788]
 [46.0631893  48.55745458]
 [46.54825879 49.08353117]
 [47.03316649 49.60976954]
 [47.51792008 50.13616203]
 [48.00252679 50.66270139]
 [48.48699345 51.1893808 ]
 [48.97132654 51.71619379]
 [49.45553215 52.24313425]
 [49.93961606 52.77019642]
 [50.42358373 53.29737482]]          mean   mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
0    1.808749  0.724071       0.371853       3.245644     -5.570921   
1    2.314322  0.713184       0.899031       3.729612     -5.061172   
2    2.819895  0.702355       1.426094       4.213696     -4.551506   
3    3.325468  0.691588       1.953034       4.697901     -4.041922   
4    3.831041  0.680885       2.479847       5.182234     -3.532422   
..        ...       ...            ...            ...           ...   
95  49.838187  0.680885      48.486993      51.189381     42.474724   
96  50.343760  0.691588      48.971327      51.716194     42.976370   
97  50.849333  0.702355      49.455532      52.243134     43.477933   
98  51.354906  0.713184      49.939616      52.770196     43.979412   
99  51.860479  0.724071      50.423584      53.297375     44.480809   

    obs_ci_upper  
0       9.188419  
1       9.689815  
2      10.191295  
3      10.692858  
4      11.194504  
..           ...  
95     57.201650  
96     57.711150  
97     58.220734  
98     58.730400  
99     59.240149  

[100 rows x 6 columns]

Show the predictions and confidence intervals¶

If we want to show the predicted data and confidence intervals for the fitted values, we can use get_prediction. We will then produce pred_summary, which contains:

mean - fitted values
mean_ci_lower, mean_ci_upper - confidence intervals for the mean
obs_ci_lower, obs_ci_upper - prediction interval for new observations

# To show the confidence intervals on the data, we can use the `get_prediction` method
# 
pred = linreg_simple2.get_prediction(T)
pred_ci = pred.conf_int(alpha=0.05)
pred_summary = pred.summary_frame(alpha=0.05)

plt.plot(t,y)
plt.plot(t, pred_summary['mean'])
plt.fill_between(t, pred_summary['mean_ci_lower'], pred_summary['mean_ci_upper'], alpha=0.3, label="95% CI (mean)")
#plt.fill_between(t, pred_summary['obs_ci_lower'], pred_summary['obs_ci_upper'], alpha=0.3, label="95% CI (mean)")

plt.xlabel('t',fontsize=18)
plt.ylabel('y',fontsize=18)

For time series, one thing we want to check is that our residuals (the difference between the predicted data and our actual value) are weakly stationary, which allows us to assume that our error $\epsilon_t$ is weakly stationary.

We can do that by plotting the residuals over time as well as the autocorrelation function of the residuals. How does this look?

residuals = linreg_simple2.resid

plt.figure()
plt.plot(t,residuals)
plt.axhline(0, color='k', linewidth=0.5) # Horizontal line at 0
plt.xlabel('t')
plt.ylabel('Residual')

plt.figure(figsize=(10, 5))
plot_acf(residuals, lags=40, title='Autocorrelation Function of Residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()

<Figure size 1000x500 with 0 Axes>

The chicken example¶

Now let’s load some data from astsa, for example, the price of chicken over time, from 2001 to 2016. As pointed out in your book, commodities (such as raw materials, basic resources, agricultural, or mining products) show specific fluctuations in their prices over time. For this example, we’ll fit a regression model for chicken prices as our response $y$ and time as our predictor $x$ .

chicken_data = astsa.load_chicken()
# Fix the time index
chicken_data['Time'] = pd.to_datetime(chicken_data['Time'])
chicken_data.set_index('Time', inplace=True)
print(chicken_data)

             Value
Time              
2001-08-01   65.58
2001-09-01   66.48
2001-10-01   65.70
2001-11-01   64.33
2001-12-01   63.23
...            ...
2016-03-01  111.56
2016-04-01  111.55
2016-05-01  111.98
2016-06-01  111.84
2016-07-01  111.46

[180 rows x 1 columns]

plt.figure(figsize=(8,6))
plt.plot(chicken_data.index, chicken_data['Value'], label='Value')
plt.xlabel('Year', fontsize=18)
plt.ylabel('Value', fontsize=18)
plt.title('Chicken prices (U.S. cents/pound)', fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16);

Chicken price regression¶

We will fit a linear regression model to the chicken price data using time as the covariate. Here, the model is:

y_i = \beta_0 + \beta_1 x_i + \epsilon_i

(2)

where $y_i$ is the price of chicken at time index $i$ , and $x_i = i$ . We can interpret the coefficient $\beta_1$ as the numerical change in price of chicken from one month to the next (since each time step is in one month). From just looking at the graph, we know that the linear fit is likely not the best model for these data, since there are also underlying (potentially seasonal) fluctuations in the data. Still, we may be interested in the overall trend in price increases for this commodity.

We can then use statsmodels to fit the OLS solution for the line of best fit to the data.

yvec = np.array(chicken_data['Value'])
n = len(yvec) # number of time points
xvec = np.arange(1, n+1)
X = sm.add_constant(xvec)
linreg2 = sm.OLS(yvec, X).fit()
print(linreg2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.917
Model:                            OLS   Adj. R-squared:                  0.917
Method:                 Least Squares   F-statistic:                     1974.
Date:                Thu, 05 Feb 2026   Prob (F-statistic):           2.83e-98
Time:                        11:23:23   Log-Likelihood:                -532.83
No. Observations:                 180   AIC:                             1070.
Df Residuals:                     178   BIC:                             1076.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         58.5832      0.703     83.331      0.000      57.196      59.971
x1             0.2993      0.007     44.434      0.000       0.286       0.313
==============================================================================
Omnibus:                        9.576   Durbin-Watson:                   0.046
Prob(Omnibus):                  0.008   Jarque-Bera (JB):                4.204
Skew:                           0.018   Prob(JB):                        0.122
Kurtosis:                       2.252   Cond. No.                         210.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

linreg2 = sm.OLS(yvec, X).fit()
print(linreg2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.917
Model:                            OLS   Adj. R-squared:                  0.917
Method:                 Least Squares   F-statistic:                     1974.
Date:                Thu, 05 Feb 2026   Prob (F-statistic):           2.83e-98
Time:                        11:23:23   Log-Likelihood:                -532.83
No. Observations:                 180   AIC:                             1070.
Df Residuals:                     178   BIC:                             1076.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         58.5832      0.703     83.331      0.000      57.196      59.971
x1             0.2993      0.007     44.434      0.000       0.286       0.313
==============================================================================
Omnibus:                        9.576   Durbin-Watson:                   0.046
Prob(Omnibus):                  0.008   Jarque-Bera (JB):                4.204
Skew:                           0.018   Prob(JB):                        0.122
Kurtosis:                       2.252   Cond. No.                         210.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now we have an estimate of $\beta_0$ (const) and an estimate of $\beta_1$ (x1). What this means is that the price of chicken is going up by approximately 0.30 cents per pound per month, and had started at a base price of 58.5 cents per pound.

We can also calculate the confidence intervals to understand the uncertainty associated with the estimate.

ci_linreg2 = linreg2.conf_int(alpha=0.05)
print(ci_linreg2)
print(f'95% confidence interval for price fluctuation: [{ci_linreg2[1,0]:.2f}, {ci_linreg2[1,1]:.2f}]')

[[57.19590866 59.97056185]
 [ 0.28604822  0.31263657]]
95% confidence interval for price fluctuation: [0.29, 0.31]

Now let’s plot the regression line on the original data.

plt.figure(figsize=(8,6))
plt.plot(chicken_data.index, chicken_data['Value'], label='Value')
plt.plot(chicken_data.index, linreg2.fittedvalues, label='y=B0+B1x')
plt.axhline(chicken_data['Value'].mean(), color='r', label='mean price')
plt.xlabel('Year', fontsize=18)
plt.ylabel('Value', fontsize=18)
plt.title('Chicken prices (U.S. cents/pound)', fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16);
plt.legend()

What are some problems with this? Do our residuals show any strong or unmodeled dependence on time?

residuals = linreg2.resid

plt.figure()
plt.plot(chicken_data.index,residuals)
plt.axhline(0, color='k', linewidth=0.5) # Horizontal line at 0
plt.xlabel('Date')
plt.ylabel('Residual')

plt.figure(figsize=(10, 5))
plot_acf(residuals, lags=80, title='Autocorrelation Function of Residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()

<Figure size 1000x500 with 0 Axes>

What if we took a moving average of the chicken data to smooth out some of the fluctuations? Here we will smooth over a 3 year + 1 month window (we are choosing an odd length window so it is centered around our observations).

def moving_average(w, n):
    '''
    Create an acausal moving average of the time series `w` using an `n`-point moving average
    '''
    v = np.zeros((len(w),)) * np.nan
    half_win = n // 2
    for t in np.arange(half_win, len(w) - half_win):
        v[t] = np.mean(w[t-half_win : t+half_win+1])
    return v

smoothing_win = 3*12+1 # 3 years * 12 months/year + 1 to make it an odd #
chicken_smoothed = moving_average(chicken_data['Value'], n=smoothing_win)
plt.plot(chicken_data.index,chicken_data['Value'], label='original')
plt.plot(chicken_data.index,chicken_smoothed, label='smoothed')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()

# Now let's fit a regression to this:

yvec = np.array(chicken_smoothed)
n = len(yvec) # number of time points
xvec = np.arange(1, n+1)
X = sm.add_constant(xvec)
linreg_smooth = sm.OLS(yvec, X).fit()
print(linreg_smooth.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                         nan
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Thu, 05 Feb 2026   Prob (F-statistic):                nan
Time:                        12:20:04   Log-Likelihood:                    nan
No. Observations:                 180   AIC:                               nan
Df Residuals:                     178   BIC:                               nan
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const             nan        nan        nan        nan         nan         nan
x1                nan        nan        nan        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                     nan
Prob(Omnibus):                    nan   Jarque-Bera (JB):                  nan
Skew:                             nan   Prob(JB):                          nan
Kurtosis:                         nan   Cond. No.                         210.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We got nans (“Not a Number”/undefined)! Why did this happen? It turns out when we defined our moving average, we only calculated the windowed average once we got enough samples, so the beginning and end of our function is now undefined.

To make this work out, we have to remove the nan values as follows:

yvec = np.array(chicken_smoothed[~np.isnan(chicken_smoothed)]) # Take only the points that are *not* NaN (that's what ~ is shorthand for)
n = len(yvec) # number of time points
xvec = np.arange(1, n+1)
print(len(yvec), len(xvec))
X = sm.add_constant(xvec)
linreg_smooth = sm.OLS(yvec, X).fit()
print(linreg_smooth.summary())

144 144
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.949
Model:                            OLS   Adj. R-squared:                  0.949
Method:                 Least Squares   F-statistic:                     2635.
Date:                Thu, 05 Feb 2026   Prob (F-statistic):           1.41e-93
Time:                        12:20:05   Log-Likelihood:                -353.12
No. Observations:                 144   AIC:                             710.2
Df Residuals:                     142   BIC:                             716.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         63.8031      0.474    134.568      0.000      62.866      64.740
x1             0.2912      0.006     51.336      0.000       0.280       0.302
==============================================================================
Omnibus:                       15.433   Durbin-Watson:                   0.005
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                8.071
Skew:                           0.395   Prob(JB):                       0.0177
Kurtosis:                       2.150   Cond. No.                         168.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

residuals_smooth = linreg_smooth.resid

# Make sure we get the new time index as well
half_win = smoothing_win // 2
new_t = [chicken_data.index[a] for a in np.arange(half_win, len(chicken_data.index) - half_win)]

plt.figure()
plt.plot(new_t,residuals_smooth)
plt.axhline(0, color='k', linewidth=0.5) # Horizontal line at 0
plt.xlabel('Date')
plt.ylabel('Residual')

plt.figure(figsize=(10, 5))
plot_acf(residuals_smooth, lags=100, title='Autocorrelation Function of Residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show();

<Figure size 1000x500 with 0 Axes>

As it turns out, there are still pretty strong autocorrelations in these data, even when we apply the moving average, so probably we will want to do some differencing or other transformations first.

Lecture 6 review¶

Now as in Lecture 6, we are going to start with a known function $y = \beta_0 + \beta_1 x + w_t$ . We will assume that $w_t$ is noise taken from a Gaussian distribution $\sim N(0, \sigma^2)$ . Here we will choose arbitrary values for each of these.

We will draw samples from this function and then compute our estimated $\beta_0$ , $\beta_1$ , and $\sigma^2$ through the MLE functions we derived in lecture.

beta0 = 0.5
beta1 = 28.3
sigma2 = 22.2

# n is the number of time points
n = 10
x = np.arange(1,n+1)

# how many times to repeat the simulation
nsamps = 10000
y = beta0 + beta1*x + np.random.randn(nsamps,n)*np.sqrt(sigma2)

print('y is of shape', y.shape)

y is of shape (10000, 10)

Now we will use the MLE solutions for estimating each of these parameters from our sampled data.

# Initialize values for our estimates
beta0_hat = np.zeros(nsamps,)
beta1_hat = np.zeros(nsamps,)
sigma2_hat = np.zeros(nsamps,)
sigma2_hat_unbiased = np.zeros(nsamps,)

for i in np.arange(nsamps):
    beta1_hat[i] = np.sum((y[i,:]-y[i,:].mean())*(x-x.mean()))/np.sum((x-x.mean())**2)
    beta0_hat[i] = y[i,:].mean() - beta1_hat[i] * x.mean()
    sigma2_hat[i] = 1/n*np.sum((y[i,:] - beta0_hat[i] - beta1_hat[i]*x)**2)
    sigma2_hat_unbiased[i] = sigma2_hat[i]*n/(n-2)

plt.figure(figsize=(10,3))
plt.plot(y.T);

plt.figure(figsize=(10,3))
plt.subplot(1,3,1)
plt.hist(beta0_hat, bins=50, alpha=0.5);
plt.axvline(beta0, color='r')
plt.axvline(beta0_hat.mean(), color='blue', linestyle='--')
plt.title('beta0 estimate')

plt.subplot(1,3,2)
plt.hist(beta1_hat, bins=50, alpha=0.5);
plt.axvline(beta1, color='r')
plt.axvline(beta1_hat.mean(), color='blue', linestyle='--')
plt.title('beta1 estimate')

plt.subplot(1,3,3)
plt.hist(sigma2_hat, bins=50, alpha=0.5);
plt.axvline(sigma2_hat.mean(), color='blue', linestyle='--')
plt.hist(sigma2_hat_unbiased, bins=50, alpha=0.5);
plt.axvline(sigma2, color='r')
plt.axvline(sigma2_hat_unbiased.mean(), color='m', linestyle='--')
plt.title('sigma2 estimate')

print(np.mean(sigma2_hat), np.std(sigma2_hat))
print(np.mean(sigma2_hat_unbiased), np.std(sigma2_hat_unbiased))
print(sigma2)

17.726358503291888 8.794352684562693
22.157948129114857 10.992940855703367
22.2

# If you finish early -- try loading another dataset like the DJIA dataset and running the same analysis. What do you notice?