AR Models -- Estimation and Predictions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

We fit AR models by OLS of $Y$ on $X$ where $Y$ is the vector consisting of $y_{p+1}, \dots y_n$ and $X$ consists of rows $1, y_{t-1}, \dots, y_{t-p}$ for $t = p+1, \dots, n$ . This gives rise to estimates $\hat{\phi}_0, \hat{\phi}_1, \dots, \hat{\phi}_p$ which are also known as Conditional MLEs or Conditional Least Squares Estimates (conditional here refers to conditioning on $y_1, \dots, y_p$ ).

Dataset One: California Population¶

#California Population Dataset from FRED
capop = pd.read_csv('CAPOP_20March2025.csv')
print(capop.head())
y = capop['CAPOP'].to_numpy()
n = len(y)
print(n)
plt.figure(figsize = (12, 6))
plt.plot(y)
plt.xlabel('Year')
plt.ylabel('Thousands')
plt.title('Annual CA population')
plt.show()

  observation_date   CAPOP
0       1900-01-01  1490.0
1       1901-01-01  1550.0
2       1902-01-01  1623.0
3       1903-01-01  1702.0
4       1904-01-01  1792.0
125

We can AR(p) models to this data using the following code.

p = 1
n = len(y)
yreg = y[p:] #these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) #this will be the design matrix (X) in the autoregression
for j in range(1, p+1):
    col = y[p-j : n-j].reshape(-1, 1)
    Xmat = np.column_stack([Xmat, col])

armod = sm.OLS(yreg, Xmat).fit()
print(armod.params)
print(armod.summary())
sighat = np.sqrt(np.mean(armod.resid ** 2))
print(sighat)

[236.66072866   1.0038257 ]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.574e+05
Date:                Sat, 22 Mar 2025   Prob (F-statistic):          4.02e-225
Time:                        14:56:55   Log-Likelihood:                -828.87
No. Observations:                 124   AIC:                             1662.
Df Residuals:                     122   BIC:                             1667.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        236.6607     30.009      7.886      0.000     177.255     296.066
x1             1.0038      0.001    746.589      0.000       1.001       1.006
==============================================================================
Omnibus:                        5.372   Durbin-Watson:                   0.342
Prob(Omnibus):                  0.068   Jarque-Bera (JB):                8.010
Skew:                           0.028   Prob(JB):                       0.0182
Kurtosis:                       4.244   Cond. No.                     3.82e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.82e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
193.54270158976897

Instead of creating $y$ and $X$ and then using OLS, we can fit AR(p) model directly using the AutoReg function in statsmodels.

from statsmodels.tsa.ar_model import AutoReg
armod_sm = AutoReg(y, lags = p, trend = 'c').fit()
print(armod_sm.summary())

                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  125
Model:                     AutoReg(1)   Log Likelihood                -828.870
Method:               Conditional MLE   S.D. of innovations            193.543
Date:                Sat, 22 Mar 2025   AIC                           1663.740
Time:                        14:56:56   BIC                           1672.201
Sample:                             1   HQIC                          1667.177
                                  125                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        236.6607     29.766      7.951      0.000     178.321     295.001
y.L1           1.0038      0.001    752.683      0.000       1.001       1.006
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            0.9962           +0.0000j            0.9962            0.0000
-----------------------------------------------------------------------------

Note that “Method: Conditional MLE” in the above output. The estimates of $\hat{\phi}_0$ and $\hat{\phi}_1$ are the same in both outputs above.

The fitted model leads to predictions for future observations which can be calculated as follows.

#Generate k-step ahead forecasts: 
k = 100 #increase k to see exponential behavior in AR(1) predictions
yhat = np.concatenate([y, np.full(k, -9999)]) #extend data by k placeholder values
phi_vals = armod.params
for i in range(1, k+1):
    ans = phi_vals[0]
    for j in range(1, p+1):
        ans += phi_vals[j] * yhat[n+i-j-1]
    yhat[n+i-1] = ans
predvalues = yhat[n:]

#Plotting the series with forecasts: 
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0')
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

In this AR(1) model, the fitted estimate $\hat{\phi}_1$ is slightly more than 1. So the predictions are exponential. If you increase $k$ , this exponential behaviour will be become visible (when $k$ is small, the predictions look linear).

The same predictions are obtained from the AutoReg object using the predict function.

predvalues_sm = armod_sm.predict(start = n, end = n+k-1)
yhat = np.concatenate([y, predvalues_sm])

Check below that both predictions lead to the same values.

plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0', label="Extended Series with Forecasts")
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues_sm, label='Forecasts (Statsmodels AutoReg)', color='blue')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts (OLS)', color='green')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

When $p$ changes, the predictions change appreciably. Run the code below for various values of $p$ to see how the predictions change.

p = 1 #large p e.g., p = 25 leads to predictions that decrease into the future. 
armod_sm = AutoReg(y, lags = p, trend = 'c').fit()
print(armod_sm.summary())
predvalues_sm = armod_sm.predict(start = n, end = n+k-1)
yhat = np.concatenate([y, predvalues_sm])
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0', label="Extended Series with Forecasts")
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues_sm, label='Forecasts (Statsmodels AutoReg)', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  125
Model:                    AutoReg(25)   Log Likelihood                -601.595
Method:               Conditional MLE   S.D. of innovations             99.188
Date:                Sat, 22 Mar 2025   AIC                           1257.190
Time:                        15:01:01   BIC                           1327.530
Sample:                            25   HQIC                          1285.658
                                  125                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         48.5749     31.228      1.555      0.120     -12.632     109.782
y.L1           1.7693      0.100     17.607      0.000       1.572       1.966
y.L2          -0.5359      0.204     -2.632      0.008      -0.935      -0.137
y.L3          -0.3854      0.207     -1.860      0.063      -0.792       0.021
y.L4           0.0123      0.207      0.060      0.952      -0.393       0.417
y.L5           0.0966      0.218      0.442      0.658      -0.331       0.525
y.L6           0.0967      0.221      0.437      0.662      -0.337       0.530
y.L7          -0.0334      0.221     -0.151      0.880      -0.466       0.399
y.L8           0.1527      0.220      0.695      0.487      -0.278       0.583
y.L9          -0.3617      0.220     -1.643      0.100      -0.793       0.070
y.L10          0.3859      0.223      1.734      0.083      -0.050       0.822
y.L11         -0.2041      0.226     -0.903      0.366      -0.647       0.239
y.L12         -0.2265      0.226     -1.000      0.317      -0.670       0.217
y.L13          0.3839      0.224      1.715      0.086      -0.055       0.823
y.L14         -0.0488      0.226     -0.216      0.829      -0.492       0.394
y.L15         -0.0288      0.225     -0.128      0.898      -0.470       0.412
y.L16         -0.2512      0.222     -1.134      0.257      -0.686       0.183
y.L17          0.0877      0.221      0.396      0.692      -0.346       0.521
y.L18          0.1523      0.222      0.687      0.492      -0.282       0.587
y.L19          0.1106      0.222      0.498      0.619      -0.325       0.546
y.L20         -0.0222      0.221     -0.100      0.920      -0.456       0.412
y.L21         -0.7083      0.221     -3.201      0.001      -1.142      -0.275
y.L22          0.9305      0.233      3.997      0.000       0.474       1.387
y.L23         -0.4332      0.251     -1.726      0.084      -0.925       0.059
y.L24          0.1580      0.243      0.650      0.516      -0.318       0.634
y.L25         -0.0995      0.119     -0.838      0.402      -0.332       0.133
                                    Roots                                     
==============================================================================
                   Real          Imaginary           Modulus         Frequency
------------------------------------------------------------------------------
AR.1            -1.0075           -0.0000j            1.0075           -0.5000
AR.2            -0.9809           -0.2737j            1.0183           -0.4567
AR.3            -0.9809           +0.2737j            1.0183            0.4567
AR.4            -0.8328           -0.5853j            1.0179           -0.4025
AR.5            -0.8328           +0.5853j            1.0179            0.4025
AR.6            -0.6709           -0.7787j            1.0278           -0.3632
AR.7            -0.6709           +0.7787j            1.0278            0.3632
AR.8            -0.3740           -0.9839j            1.0526           -0.3078
AR.9            -0.3740           +0.9839j            1.0526            0.3078
AR.10           -0.1016           -1.0310j            1.0360           -0.2656
AR.11           -0.1016           +1.0310j            1.0360            0.2656
AR.12            0.2724           -0.9824j            1.0195           -0.2069
AR.13            0.2724           +0.9824j            1.0195            0.2069
AR.14            0.6190           -0.8776j            1.0739           -0.1522
AR.15            0.6190           +0.8776j            1.0739            0.1522
AR.16            0.7994           -0.6917j            1.0571           -0.1135
AR.17            0.7994           +0.6917j            1.0571            0.1135
AR.18            0.9170           -0.4919j            1.0406           -0.0784
AR.19            0.9170           +0.4919j            1.0406            0.0784
AR.20            1.0284           -0.2495j            1.0582           -0.0379
AR.21            1.0284           +0.2495j            1.0582            0.0379
AR.22            0.9984           -0.0240j            0.9987           -0.0038
AR.23            0.9984           +0.0240j            0.9987            0.0038
AR.24           -0.3772           -2.1016j            2.1352           -0.2783
AR.25           -0.3772           +2.1016j            2.1352            0.2783
------------------------------------------------------------------------------

When $p$ is large (e.g., $p = 25$ ), check that the predictions decrease in time. Given that the dataset here is small ( $n = 125$ ), the AR(25) model overfits and its predictions are not reliable.

Dataset Two: simulated dataset¶

AR(1) predictions behave quite differently in the three regimes $\phi_1 < 1$ , $\phi_1 = 1$ and $\phi_1 > 1$ . To see this, consider the simulated dataset below.

seed = 43
rng = np.random.default_rng(seed)
n = 400
sig = 0.5
ysim = np.zeros(n)
for i in range(2, n):
    err = rng.normal(loc = 0, scale = sig, size = 1)
    phi_0 = 0.1
    phi_1 = 1
    ysim[i] = phi_0 + phi_1 * ysim[i-1] + err[0]
plt.figure(figsize = (12, 6))
plt.plot(ysim)
plt.show()

p = 1
armod_sm = AutoReg(ysim, lags = p, trend = 'c').fit()
print(armod_sm.summary())
k = 1000
predvalues_sm = armod_sm.predict(start = n, end = n+k-1)
yhat = np.concatenate([ysim, predvalues_sm])
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0', label="Extended Series with Forecasts")
plt.plot(range(1, n + 1), ysim, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues_sm, label='Forecasts (Statsmodels AutoReg)', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  400
Model:                     AutoReg(1)   Log Likelihood                -283.472
Method:               Conditional MLE   S.D. of innovations              0.492
Date:                Sat, 22 Mar 2025   AIC                            572.945
Time:                        15:05:19   BIC                            584.911
Sample:                             1   HQIC                           577.684
                                  400                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0705      0.037      1.891      0.059      -0.003       0.144
y.L1           1.0021      0.002    555.249      0.000       0.999       1.006
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            0.9979           +0.0000j            0.9979            0.0000
-----------------------------------------------------------------------------

The estimated $\phi_1$ is more than 1 and the predictions are exponential. If we forcibly set the $\phi_1$ parameter to be 1, the predictions will be linear. This is done below.

k = 1000
yhat = np.concatenate([ysim, np.full(k, -9999)]) #extend data by k placeholder values
phi_vals = np.array([armod_sm.params[0], 1]) #\phi_1 parameter is set to 1
for i in range(1, k+1):
    ans = phi_vals[0]
    for j in range(1, p+1):
        ans += phi_vals[j] * yhat[n+i-j-1]
    yhat[n+i-1] = ans
predvalues = yhat[n:]

#Plotting the series with forecasts: 
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0')
plt.plot(range(1, n + 1), ysim, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

Now for the following simulated dataset, the estimated $\phi_1$ is slightly smaller than 1. This leads to predictions that decay to a constant.

seed = 123
rng = np.random.default_rng(seed)
n = 400
sig = 0.5
ysim = np.zeros(n)
for i in range(2, n):
    err = rng.normal(loc = 0, scale = sig, size = 1)
    phi_0 = 0.1
    phi_1 = 1
    ysim[i] = phi_0 + phi_1 * ysim[i-1] + err[0]
plt.figure(figsize = (12, 6))
plt.plot(ysim)
plt.show()

p = 1
armod_sm = AutoReg(ysim, lags = p, trend = 'c').fit()
print(armod_sm.summary())
k = 1000
predvalues_sm = armod_sm.predict(start = n, end = n+k-1)
yhat = np.concatenate([ysim, predvalues_sm])
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0', label="Extended Series with Forecasts")
plt.plot(range(1, n + 1), ysim, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues_sm, label='Forecasts (Statsmodels AutoReg)', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  400
Model:                     AutoReg(1)   Log Likelihood                -289.700
Method:               Conditional MLE   S.D. of innovations              0.500
Date:                Sat, 22 Mar 2025   AIC                            585.400
Time:                        15:06:45   BIC                            597.367
Sample:                             1   HQIC                           590.140
                                  400                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1857      0.055      3.391      0.001       0.078       0.293
y.L1           0.9974      0.002    538.039      0.000       0.994       1.001
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0026           +0.0000j            1.0026            0.0000
-----------------------------------------------------------------------------

Again, if we force $\phi_1$ to equal 1, the predictions will be linear.

k = 1000
yhat = np.concatenate([ysim, np.full(k, -9999)]) #extend data by k placeholder values
phi_vals = np.array([armod_sm.params[0], 1]) #\phi_1 parameter is set to 1
for i in range(1, k+1):
    ans = phi_vals[0]
    for j in range(1, p+1):
        ans += phi_vals[j] * yhat[n+i-j-1]
    yhat[n+i-1] = ans
predvalues = yhat[n:]

#Plotting the series with forecasts: 
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0')
plt.plot(range(1, n + 1), ysim, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

Dataset Three: Sunspots¶

Predictions for AR(2) have a more richer structure than those for AR(1). In some examples (such as the sunspots dataset), AR(2) will result in predictions that oscillate (a behaviour which will never be seen in predictions for AR(1)).

sunspots = pd.read_csv('SN_y_tot_V2.0.csv', header = None, sep = ';')
y = sunspots.iloc[:,1].values
n = len(y)
plt.figure(figsize = (12, 6))
plt.plot(y)
plt.show()

Let us fit AR(2) model to this data.

armod_sm = AutoReg(y, lags = 2, trend = 'c').fit()
print(armod_sm.summary())

                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  325
Model:                     AutoReg(2)   Log Likelihood               -1505.524
Method:               Conditional MLE   S.D. of innovations             25.588
Date:                Sat, 22 Mar 2025   AIC                           3019.048
Time:                        15:10:14   BIC                           3034.159
Sample:                             2   HQIC                          3025.080
                                  325                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         24.4561      2.372     10.308      0.000      19.806      29.106
y.L1           1.3880      0.040     34.685      0.000       1.310       1.466
y.L2          -0.6965      0.040    -17.423      0.000      -0.775      -0.618
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            0.9965           -0.6655j            1.1983           -0.0937
AR.2            0.9965           +0.6655j            1.1983            0.0937
-----------------------------------------------------------------------------

Predictions generated by the fitted AR(2) are plotted below.

p = 2
armod_sm = AutoReg(y, lags = p, trend = 'c').fit()
print(armod_sm.summary())
k = 100
predvalues_sm = armod_sm.predict(start = n, end = n+k-1)
yhat = np.concatenate([y, predvalues_sm])
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0', label="Extended Series with Forecasts")
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues_sm, label='Forecasts (Statsmodels AutoReg)', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  325
Model:                     AutoReg(2)   Log Likelihood               -1505.524
Method:               Conditional MLE   S.D. of innovations             25.588
Date:                Sat, 22 Mar 2025   AIC                           3019.048
Time:                        15:10:15   BIC                           3034.159
Sample:                             2   HQIC                          3025.080
                                  325                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         24.4561      2.372     10.308      0.000      19.806      29.106
y.L1           1.3880      0.040     34.685      0.000       1.310       1.466
y.L2          -0.6965      0.040    -17.423      0.000      -0.775      -0.618
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            0.9965           -0.6655j            1.1983           -0.0937
AR.2            0.9965           +0.6655j            1.1983            0.0937
-----------------------------------------------------------------------------

The oscillating nature of the predictions are clearly visible above. We shall see some formulae (next lecture) for AR(2) predictions that explain this oscillating predictions.

Stat 153

Sunspots prediction

Stat 153

AR(p) -- Estimation, Uncertainty and Forecasting