AutoRegressive (AR) Models

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

We will fit AutoRegressive Models to some time series datasets, and obtain predictions for future values of the time series.

Dataset One: Liquor Sales Data from FRED¶

#The following is FRED data on retail sales (in millions of dollars) for beer, wine and liquor stores (https://fred.stlouisfed.org/series/MRTSSM4453USN)
beersales = pd.read_csv('MRTSSM4453USN_October2025.csv')
print(beersales.head())
y = beersales['MRTSSM4453USN'].to_numpy()
plt.figure(figsize = (12, 6))
plt.plot(y)
plt.xlabel('Year')
plt.ylabel('Millions of Dollars')
plt.title('Retail Sales: Beer, wine and liquor stores')
plt.show()

  observation_date  MRTSSM4453USN
0       1992-01-01           1414
1       1992-02-01           1444
2       1992-03-01           1496
3       1992-04-01           1569
4       1992-05-01           1707

AutoRegressive (AR) models are a simple way for obtaining forecasts (predictions) for future values of the time series.

p = 12 #this is the order of the AR model
n = len(y)
print(n)
print(y)

403
[1414 1444 1496 1569 1707 1663 1792 1744 1658 1763 1715 2354 1512 1433
 1573 1606 1683 1679 1828 1665 1600 1647 1670 2287 1450 1410 1607 1648
 1715 1741 1858 1742 1710 1693 1731 2416 1459 1407 1619 1594 1696 1747
 1769 1745 1731 1678 1762 2437 1575 1549 1723 1687 1836 1837 1908 1936
 1672 1748 1871 2395 1605 1526 1745 1712 1940 1886 1976 1960 1783 1932
 1962 2657 1740 1665 1783 1848 2015 1938 2079 1993 1910 2033 2026 2857
 1755 1700 1867 1974 2078 2017 2200 2013 2000 2088 2140 3076 1810 1863
 2066 2011 2225 2278 2326 2288 2228 2223 2409 3192 1995 1941 2201 2110
 2337 2389 2349 2382 2191 2279 2495 3349 1999 1994 2236 2174 2429 2349
 2427 2451 2170 2267 2456 3294 2003 1931 2145 2202 2431 2297 2466 2511
 2323 2495 2514 3456 2169 2101 2270 2418 2543 2505 2716 2485 2481 2586
 2619 3604 2104 2194 2398 2501 2568 2637 2771 2645 2639 2656 2782 3925
 2303 2394 2616 2653 2870 2906 2966 2878 2852 2773 2987 4067 2445 2478
 2862 2730 3107 3203 3163 3136 2911 2942 3142 4195 2551 2677 2852 2861
 3264 3132 3359 3316 3010 3201 3235 4260 2782 2657 2888 2990 3312 3160
 3397 3249 3119 3274 3209 4422 2754 2790 3076 3168 3319 3292 3503 3307
 3222 3349 3374 4484 2765 2858 3111 3221 3325 3402 3584 3428 3373 3372
 3499 4696 2877 3077 3413 3278 3640 3668 3589 3606 3365 3482 3714 4917
 3021 3088 3465 3334 3771 3636 3848 3908 3459 3605 3863 4961 3213 3143
 3466 3511 3947 3822 4014 3988 3601 3857 3865 5245 3359 3273 3625 3631
 4070 3957 4252 3995 3818 4004 3995 5444 3376 3528 3786 3844 4114 4179
 4436 4182 4136 4138 4359 5645 3474 3503 3957 3960 4328 4353 4466 4324
 4221 4218 4506 5801 3684 3686 4314 4028 4574 4611 4632 4569 4296 4432
 4797 5969 3837 3818 4331 4252 4741 4647 4876 4902 4349 4569 4933 6163
 4066 4184 5081 4757 5737 5572 5905 5634 5410 5590 5631 7142 4960 4831
 5566 5560 5940 5877 6044 5601 5468 5647 5893 7474 4812 4920 5441 5544
 5727 5833 6003 5755 5621 5744 6017 7651 4983 4972 5535 5527 6067 6040
 6096 5834 5783 5736 6166 7863 4907 5248 5765 5539 6264 6155 6256 6203
 5658 5987 6453 7854 5055 5037 5613 5626 6310 5950 6225]

AR models work just like regression. The response vector and design matrix in the regression are set up as follows.

yreg = y[p:] #these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) #this will be the design matrix (X) in the autoregression
for j in range(1, p+1):
    col = y[p-j : n-j]
    Xmat = np.column_stack([Xmat, col])
print(Xmat.shape)
print(Xmat)

(391, 13)
[[1.000e+00 2.354e+03 1.715e+03 ... 1.496e+03 1.444e+03 1.414e+03]
 [1.000e+00 1.512e+03 2.354e+03 ... 1.569e+03 1.496e+03 1.444e+03]
 [1.000e+00 1.433e+03 1.512e+03 ... 1.707e+03 1.569e+03 1.496e+03]
 ...
 [1.000e+00 5.626e+03 5.613e+03 ... 6.256e+03 6.155e+03 6.264e+03]
 [1.000e+00 6.310e+03 5.626e+03 ... 6.203e+03 6.256e+03 6.155e+03]
 [1.000e+00 5.950e+03 6.310e+03 ... 5.658e+03 6.203e+03 6.256e+03]]

armod = sm.OLS(yreg, Xmat).fit()
print(armod.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     2597.
Date:                Thu, 23 Oct 2025   Prob (F-statistic):               0.00
Time:                        16:14:31   Log-Likelihood:                -2529.9
No. Observations:                 391   AIC:                             5086.
Df Residuals:                     378   BIC:                             5137.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.7819     21.711      0.312      0.755     -35.907      49.471
x1             0.0439      0.018      2.448      0.015       0.009       0.079
x2             0.0428      0.018      2.379      0.018       0.007       0.078
x3             0.0521      0.018      2.913      0.004       0.017       0.087
x4             0.0336      0.018      1.870      0.062      -0.002       0.069
x5             0.0523      0.018      2.898      0.004       0.017       0.088
x6             0.0002      0.018      0.009      0.993      -0.036       0.036
x7            -0.0128      0.018     -0.697      0.486      -0.049       0.023
x8            -0.0304      0.019     -1.633      0.103      -0.067       0.006
x9            -0.0440      0.019     -2.368      0.018      -0.081      -0.007
x10           -0.0621      0.019     -3.352      0.001      -0.099      -0.026
x11           -0.0163      0.019     -0.875      0.382      -0.053       0.020
x12            0.9739      0.019     52.184      0.000       0.937       1.011
==============================================================================
Omnibus:                      172.369   Durbin-Watson:                   0.812
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1196.724
Skew:                           1.726   Prob(JB):                    1.36e-260
Kurtosis:                      10.845   Cond. No.                     3.36e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.36e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

arfitted = armod.fittedvalues
#Plotting the fitted values against response values
plt.xlabel('Fitted Values')
plt.ylabel('Response Values')
plt.title('AR(1) Model Fitted Values vs Response Values')
plt.plot(arfitted, yreg, 'o')
plt.show()

#Generate k-step ahead forecasts: 
k = 100
yhat = np.concatenate([y, np.full(k, -9999)]) #extend data by k placeholder values
for i in range(1, k+1):
    ans = armod.params[0]
    for j in range(1, p+1):
        ans += armod.params[j] * yhat[n+i-j-1]
    yhat[n+i-1] = ans
predvalues = yhat[n:]

#Plotting the series with forecasts: 
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat,  color='C0')
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

The predictions depend crucially on the order $p$ of the model. If $p \geq 12$ , we get reasonable predictions. However for smaller values of $p$ , the predictions look very unnatural. Go back and repeat the code above for other values of $p$ .

Dataset Two: House Price Data from FRED¶

#The following is FRED data on Average Sales Price of Houses Sold for the US (https://fred.stlouisfed.org/series/ASPUS)
hprice = pd.read_csv('ASPUS_October2025.csv')
print(hprice.head())
y = hprice['ASPUS'].to_numpy()
plt.figure(figsize = (12, 6))
plt.plot(y)
plt.xlabel('Year')
plt.ylabel('Dollars')
plt.title('Average Sales Price of Houses Sold in the US')
plt.show()

  observation_date  ASPUS
0       1963-01-01  19300
1       1963-04-01  19400
2       1963-07-01  19200
3       1963-10-01  19600
4       1964-01-01  19600

p = 30 #this is the order of the AR model
n = len(y)
print(n)

yreg = y[p:] #these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) #this will be the design matrix (X) in the autoregression
for j in range(1, p+1):
    col = y[p-j : n-j].reshape(-1, 1)
    Xmat = np.column_stack([Xmat, col])
print(Xmat.shape)
print(y)

(220, 31)
[ 19300  19400  19200  19600  19600  20200  20500  20900  21500  21000
  21600  21700  22700  23200  23000  22800  24000  24200  23900  24400
  25400  26700  26600  27000  27600  28100  28100  27100  27000  27300
  26000  26300  27300  28600  28300  28200  29000  29400  30300  31600
  32800  35100  35900  36600  38000  38600  39000  39300  40900  42600
  42200  44400  46000  47800  48100  50300  51600  54300  54000  57500
  59300  61600  63500  66400  68300  72400  74200  72700  73600  74400
  77500  80000  80900  84300  83800  83700  81200  85700  83900  84600
  86700  89100  92500  90800  94700  99200  98500  97800  98500 100500
 100500 103800 106300 112000 114400 115600 120800 126100 129900 133500
 137900 134800 141500 140400 144300 146800 150200 151200 149500 151200
 145500 150100 151100 148200 145400 144400 144500 145300 141700 147200
 144700 148900 148000 148300 153600 154200 152800 156100 153500 158900
 157700 160900 161100 166000 164000 171000 172200 177200 174700 175400
 180000 178800 184300 181500 189100 191800 193000 204800 202900 202400
 204100 212100 211000 211200 207800 214200 227600 227600 219100 232500
 233100 241000 248100 256000 262900 265300 274000 286300 288500 287800
 294600 294200 305300 302600 308100 299600 322100 310100 301200 305800
 290400 304200 285100 276600 257000 273400 274100 272900 275300 268800
 266000 278000 268100 267600 263000 259700 278000 282700 294500 297700
 307400 320400 324400 334400 331400 340600 340400 369400 348000 339700
 347400 366700 357000 357900 358800 364900 374800 376900 373200 399700
 374600 378400 392900 384000 375500 376700 382700 384600 383000 371100
 400600 396900 417400 428600 468000 496700 499300 525100 520300 521000
 505300 503000 521900 498300 519700 502200 498700 510900 514200 512800]

armod = sm.OLS(yreg, Xmat).fit()
print(armod.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.997
Method:                 Least Squares   F-statistic:                     2885.
Date:                Thu, 23 Oct 2025   Prob (F-statistic):          4.29e-235
Time:                        16:22:56   Log-Likelihood:                -2239.4
No. Observations:                 220   AIC:                             4541.
Df Residuals:                     189   BIC:                             4646.
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1252.3043    895.312      1.399      0.164    -513.784    3018.393
x1             0.7836      0.073     10.799      0.000       0.640       0.927
x2             0.5246      0.093      5.623      0.000       0.341       0.709
x3            -0.2076      0.100     -2.077      0.039      -0.405      -0.010
x4             0.0044      0.099      0.045      0.964      -0.191       0.200
x5             0.0292      0.099      0.296      0.768      -0.166       0.224
x6            -0.3315      0.098     -3.382      0.001      -0.525      -0.138
x7            -0.0284      0.101     -0.282      0.778      -0.227       0.170
x8             0.1678      0.102      1.646      0.101      -0.033       0.369
x9             0.1665      0.103      1.621      0.107      -0.036       0.369
x10           -0.2485      0.103     -2.413      0.017      -0.452      -0.045
x11            0.0576      0.106      0.546      0.586      -0.151       0.266
x12            0.3131      0.104      3.004      0.003       0.107       0.519
x13           -0.4220      0.108     -3.910      0.000      -0.635      -0.209
x14           -0.1454      0.110     -1.316      0.190      -0.363       0.073
x15            0.5566      0.116      4.807      0.000       0.328       0.785
x16           -0.0623      0.118     -0.527      0.599      -0.295       0.171
x17           -0.3085      0.123     -2.515      0.013      -0.550      -0.066
x18            0.1707      0.119      1.432      0.154      -0.064       0.406
x19            0.0860      0.118      0.728      0.468      -0.147       0.319
x20           -0.0764      0.118     -0.649      0.517      -0.309       0.156
x21           -0.2422      0.119     -2.043      0.042      -0.476      -0.008
x22            0.1030      0.119      0.867      0.387      -0.131       0.337
x23            0.0116      0.120      0.097      0.923      -0.225       0.248
x24           -0.2901      0.118     -2.468      0.014      -0.522      -0.058
x25            0.5416      0.117      4.641      0.000       0.311       0.772
x26            0.1503      0.123      1.219      0.224      -0.093       0.394
x27           -0.5970      0.123     -4.840      0.000      -0.840      -0.354
x28            0.2527      0.131      1.930      0.055      -0.006       0.511
x29            0.1942      0.121      1.610      0.109      -0.044       0.432
x30           -0.1401      0.103     -1.365      0.174      -0.343       0.062
==============================================================================
Omnibus:                       33.340   Durbin-Watson:                   1.992
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              114.033
Skew:                           0.550   Prob(JB):                     1.73e-25
Kurtosis:                       6.351   Cond. No.                     2.29e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.29e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

arfitted = armod.fittedvalues
#Plotting the fitted values against response values
plt.xlabel('Fitted Values')
plt.ylabel('Response Values')
plt.title('AR(1) Model Fitted Values vs Response Values')
#plotting with reduced size points for better visibility    
plt.plot(arfitted, yreg, 'o', markersize=4)  

plt.show()

#Generate k-step ahead forecasts: 
k = 100
yhat = np.concatenate([y, np.full(k, -9999)]) #extend data by k placeholder values
for i in range(1, k+1):
    ans = armod.params[0]
    for j in range(1, p+1):
        ans += armod.params[j] * yhat[n+i-j-1]
    yhat[n+i-1] = ans
predvalues = yhat[n:]

#Plotting the series with forecasts: 
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0')
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

Repeat the above code for different values of $p$ to see how the predictions change.

We will now apply AR models to logarithms of $y$ .

ylog = np.log(y)
plt.plot(ylog)
plt.xlabel('Year')
plt.ylabel('Log Dollars')
plt.title('Log of Average Sales Price of Houses Sold in the US')
plt.show()

p = 30 #this is the order of the AR model
n = len(ylog)
print(n)

yreg = ylog[p:] #these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) #this will be the design matrix (X) in the autoregression
for j in range(1, p+1):
    col = ylog[p-j : n-j].reshape(-1, 1)
    Xmat = np.column_stack([Xmat, col])

armod_log = sm.OLS(yreg, Xmat).fit()
print(armod_log.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                     7003.
Date:                Thu, 23 Oct 2025   Prob (F-statistic):          1.96e-271
Time:                        16:22:59   Log-Likelihood:                 509.83
No. Observations:                 220   AIC:                            -957.7
Df Residuals:                     189   BIC:                            -852.5
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1013      0.033      3.049      0.003       0.036       0.167
x1             0.8024      0.073     11.057      0.000       0.659       0.946
x2             0.4279      0.092      4.628      0.000       0.246       0.610
x3            -0.1316      0.097     -1.356      0.177      -0.323       0.060
x4             0.0597      0.095      0.627      0.531      -0.128       0.247
x5            -0.0348      0.095     -0.366      0.715      -0.222       0.153
x6            -0.2380      0.094     -2.544      0.012      -0.423      -0.053
x7             0.0186      0.095      0.195      0.845      -0.169       0.207
x8             0.0834      0.095      0.874      0.383      -0.105       0.272
x9             0.0662      0.095      0.694      0.488      -0.122       0.254
x10           -0.1635      0.095     -1.727      0.086      -0.350       0.023
x11            0.0860      0.095      0.905      0.367      -0.101       0.273
x12            0.2016      0.094      2.135      0.034       0.015       0.388
x13           -0.2995      0.096     -3.128      0.002      -0.488      -0.111
x14           -0.0542      0.097     -0.558      0.577      -0.246       0.137
x15            0.3066      0.097      3.161      0.002       0.115       0.498
x16           -0.1182      0.097     -1.214      0.226      -0.310       0.074
x17           -0.1742      0.098     -1.781      0.077      -0.367       0.019
x18            0.0710      0.097      0.735      0.463      -0.119       0.261
x19            0.1175      0.096      1.227      0.221      -0.071       0.306
x20            0.1415      0.096      1.473      0.143      -0.048       0.331
x21           -0.2030      0.096     -2.105      0.037      -0.393      -0.013
x22            0.0006      0.097      0.007      0.995      -0.191       0.192
x23           -0.0754      0.097     -0.780      0.437      -0.266       0.115
x24           -0.1470      0.097     -1.517      0.131      -0.338       0.044
x25            0.3098      0.096      3.228      0.001       0.120       0.499
x26            0.1047      0.098      1.065      0.288      -0.089       0.299
x27           -0.3197      0.098     -3.256      0.001      -0.513      -0.126
x28            0.1514      0.101      1.503      0.135      -0.047       0.350
x29            0.0733      0.097      0.756      0.451      -0.118       0.264
x30           -0.0703      0.075     -0.938      0.349      -0.218       0.078
==============================================================================
Omnibus:                        0.073   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.964   Jarque-Bera (JB):                0.002
Skew:                           0.001   Prob(JB):                        0.999
Kurtosis:                       3.013   Cond. No.                     6.47e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.47e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

arfitted_log = armod_log.fittedvalues
#Plotting the fitted values against response values
plt.xlabel('Fitted Values')
plt.ylabel('Response Values')
plt.title('AR(1) Model Fitted Values vs Response Values')
#plotting with reduced size points for better visibility    
plt.plot(arfitted_log, yreg, 'o', markersize=4)  

plt.show()

#Generate k-step ahead forecasts: 
k = 100
yhat = np.concatenate([ylog, np.full(k, -9999)]) #extend data by k placeholder values
for i in range(1, k+1):
    ans = armod_log.params[0]
    for j in range(1, p+1):
        ans += armod_log.params[j] * yhat[n+i-j-1]
    yhat[n+i-1] = ans
predvalues_log = yhat[n:]

#Plotting the series with forecasts: 
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0')
plt.plot(range(1, n + 1), ylog, label='Original Data (log scale)', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues_log, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

We now plot the predictions on the original data.

#Plotting the series with forecasts: 
time_all = np.arange(1, n + k + 1)
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), np.exp(predvalues_log), label='Forecasts (AR on log)', color='blue')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts (AR original data)', color='green')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

Sunspots Data¶

Autoregressive models were invented in the context of the sunspots data by Yule in 1927.

sunspots = pd.read_csv('SN_y_tot_V2.0.csv', header = None, sep = ';')
y = sunspots.iloc[:,1].values
n = len(y)
plt.figure(figsize = (12, 6))
plt.plot(y)
plt.show()

The most basic model for this dataset is the single sinusoid model:

y_t = \beta_0 + \beta_1 \cos(2 \pi f t) + \beta_2 \sin(2 \pi f t) + \epsilon_t

with $\epsilon_t \overset{\text{i.i.d}}{\sim} N(0, \sigma^2)$ . We previously used this model. For estimating $f$ , we calculate $RSS(f)$ for a bunch of $f$ values, and then minimize it over $f$ .

#rss function:
def rss(f):
    n = len(y)
    X = np.column_stack([np.ones(n)])
    x = np.arange(1, n+1)
    xcos = np.cos(2 * np.pi * f * x)
    xsin = np.sin(2 * np.pi * f * x)
    X = np.column_stack([X, xcos, xsin])
    md = sm.OLS(y, X).fit()
    ans = np.sum(md.resid ** 2)
    return ans

ngrid = 10000
fvals = np.linspace(0, 0.5, ngrid)
rssvals = np.array([rss(f) for f in fvals])
plt.plot(fvals, rssvals) #plot rss and find f which minimizes rss
plt.show()

fhat = fvals[np.argmin(rssvals)]
print(fhat) 
print(1/fhat)  #period corresponding to fhat

0.09090909090909091
11.0

The MLE of $f$ is basically $1/11$ (corresponding to the 11-year solar cycle). After estimating $f$ , the other parameters $\beta_0, \beta_1, \beta_2, \sigma$ are estimated as in usual linear regression as follows.

#Estimates of beta and sigma: 
x = np.arange(1, n+1)
f = fhat
xcos = np.cos(2 * np.pi * f * x)
xsin = np.sin(2 * np.pi * f * x)
X = np.column_stack([np.ones(n), xcos, xsin])
md = sm.OLS(y, X).fit()
print(md.params) #this gives estimates of beta_0, beta_1, beta_2 
rss_fhat = np.sum(md.resid ** 2)
#there are two estimates for sigma (which usually give similar values)
sigma_mle = np.sqrt(rss_fhat/n)
sigma_unbiased = np.sqrt((rss_fhat)/(n-3))
print(np.array([sigma_mle, sigma_unbiased]))

[ 78.87599158 -38.28231622 -29.2786631 ]
[51.62376442 51.86369026]

Yule did not like the simple single sinusoidal model for the sunspots data, mainly because if we simulate data from it, the data will look much more noisy and irregular as opposed to the sunspots data which have a smooth appearance. We can check this by simulating synthetic data from the single sinusoid model and comparing the synthetic data with the actual sunspots data. We will take the parameters $f, \beta, \sigma$ to be those estimated from the sunspots dataset to make the plots comparable to the actual sunspots data.

rng = np.random.default_rng(seed = 42)
errorsamples = rng.normal(loc = 0, scale = sigma_mle, size = n)
sim_data = md.fittedvalues + errorsamples
plt.figure(figsize = (10, 6))
plt.plot(sim_data)
plt.show()

To facilitate comparison with the actual sunspots dataset, let us plot a bunch of these simulated datasets along with the real sunspots data (just to see if the sunspots dataset can be spotted as the “odd one out” from these plots).

fig, axes = plt.subplots(3, 3, figsize = (12, 4))
axes = axes.flatten()
for i in range(5):
    errorsamples = rng.normal(loc = 0, scale = sigma_mle, size = n)
    sim_data = md.fittedvalues + errorsamples
    axes[i].plot(sim_data)
axes[5].plot(y)
for i, idx in enumerate(range(6, 9)):
    errorsamples = rng.normal(loc = 0, scale = sigma_mle, size = n)
    sim_data = md.fittedvalues + errorsamples
    axes[idx].plot(sim_data)
plt.tight_layout()
plt.show()

It is clear from the above plot-grid that the sunspots dataset sticks out as the odd one out. This shows that the simulated datasets (which are too wiggly and do not have well-defined peaks) are generated from a model that ignores many aspects of the sunspots dataset.

Yule Model¶

Yule model is given by:

y_t = \phi_0 + \phi_1 y_{t-1} - y_{t-2} + \epsilon_t

Note that $\phi_2$ in the usual AR(2) model is set to -1 above. Also $\phi_1$ is given by $2 \cos(2 \pi f)$ . The above model can be rewritten as

y_t + y_{t-2} = \phi_0 + \phi_1 y_{t-1} + \epsilon_t

so we can estimate $\phi_0$ and $\phi_1$ by regressing $y_t + y_{t-2}$ on $y_{t-1}$ .

p = 2
yreg = y[p:]
x1 = y[1:-1]
x2 = y[:-2]
Xmat = np.column_stack([np.ones(len(yreg)), x1])
print(Xmat.shape)

(323, 2)

y_adjusted = yreg + x2
yulemod = sm.OLS(y_adjusted, Xmat).fit()
print(yulemod.summary())
print(yulemod.params)
slpe_est = yulemod.params[1]

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.930
Model:                            OLS   Adj. R-squared:                  0.930
Method:                 Least Squares   F-statistic:                     4254.
Date:                Thu, 23 Oct 2025   Prob (F-statistic):          2.90e-187
Time:                        16:51:37   Log-Likelihood:                -1532.1
No. Observations:                 323   AIC:                             3068.
Df Residuals:                     321   BIC:                             3076.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         28.6834      2.511     11.421      0.000      23.742      33.624
x1             1.6365      0.025     65.224      0.000       1.587       1.686
==============================================================================
Omnibus:                       11.119   Durbin-Watson:                   2.493
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.202
Skew:                           0.225   Prob(JB):                     0.000184
Kurtosis:                       4.037   Cond. No.                         162.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[28.6833743   1.63648935]

From the estimate for $\phi_1$ , the frequency parameter can be estimated using the relation: $\phi_1 = 2 \cos (2 \pi f)$ .

yulefhat = np.arccos(slpe_est/2) / (2 * np.pi)
print(1/yulefhat)

10.259176143439632

The period corresponding to this frequency is 10.259 which is smaller than the period of 11 that we get by fitting $y_t = \beta_0 + \beta_1 \cos(2 \pi f t) + \beta_2 \sin(2 \pi f t) + \epsilon_t$ .

#Generate k-step ahead forecasts: 
k = 100
yhat = np.concatenate([y, np.full(k, -9999)]) #extend data by k placeholder values
for i in range(1, k+1):
    ans = yulemod.params[0] - yhat[n+i-3]
    ans += yulemod.params[1] * yhat[n+i-2]
    yhat[n+i-1] = ans
predvalues = yhat[n:]
print(predvalues)
print(yhat)

[ 1.56348277e+02  1.29845664e+02  8.48261433e+01  3.76547907e+01
  5.47889486e+00 -5.26328291e-03  2.31958661e+01  6.66484255e+01
  1.14556947e+02  1.49506172e+02  1.58791685e+02  1.39038104e+02
  9.74260659e+01  4.90819890e+01  1.15794607e+01 -1.44895065e+00
  1.47327213e+01  5.42422665e+01  1.02717544e+02  1.42537275e+02
  1.59226562e+02  1.46718673e+02  1.09560357e+02  6.12590590e+01
  1.93728148e+01 -8.72279635e-01  7.88308317e+00  4.24562356e+01
  9.02794685e+01  1.33968527e+02  1.57641974e+02  1.52694258e+02
  1.20923928e+02  7.38798357e+01  2.86630109e+01  1.71025067e+00
  2.81917042e+00  3.15866660e+01  7.75554463e+01  1.24015370e+02
  1.54077760e+02  1.56814618e+02  1.31231066e+02  8.66269979e+01
  3.92164679e+01  6.23370836e+00 -3.31696240e-01  2.19068486e+01
  6.48653949e+01  1.12928054e+02  1.48623536e+02  1.58976155e+02
  1.40222622e+02  9.91800471e+01  5.07678428e+01  1.25843613e+01
 -1.49029536e+00  1.36601606e+01  5.25283769e+01  1.00985343e+02
  1.41416436e+02  1.59124522e+02  1.47672524e+02  1.11223365e+02
  6.30267024e+01  2.06025365e+01 -6.27496582e-01  7.05394632e+00
  4.08545789e+01  8.84875112e+01  1.32637665e+02  1.57255989e+02
  1.53393461e+02  1.22454150e+02  7.56848253e+01  3.00866351e+01
  2.23500697e+00  2.25430426e+00  3.01375122e+01  7.57487878e+01
  1.22507947e+02  1.53417536e+02  1.57241592e+02  1.32590028e+02
  8.84239513e+01  4.07982008e+01  7.02524409e+00 -6.18089412e-01
  2.06466335e+01  6.30894595e+01  1.11281969e+02  1.47705672e+02
  1.59120165e+02  1.41376156e+02  1.00923784e+02  5.24679154e+01
  1.36227750e+01 -1.49101484e+00  1.26205694e+01  5.08278165e+01]
[ 8.30000000e+00  1.83000000e+01  2.67000000e+01  3.83000000e+01
  6.00000000e+01  9.67000000e+01  4.83000000e+01  3.33000000e+01
  1.67000000e+01  1.33000000e+01  5.00000000e+00  0.00000000e+00
  0.00000000e+00  3.30000000e+00  1.83000000e+01  4.50000000e+01
  7.83000000e+01  1.05000000e+02  1.00000000e+02  6.50000000e+01
  4.67000000e+01  4.33000000e+01  3.67000000e+01  1.83000000e+01
  3.50000000e+01  6.67000000e+01  1.30000000e+02  2.03300000e+02
  1.71700000e+02  1.21700000e+02  7.83000000e+01  5.83000000e+01
  1.83000000e+01  8.30000000e+00  2.67000000e+01  5.67000000e+01
  1.16700000e+02  1.35000000e+02  1.85000000e+02  1.68300000e+02
  1.21700000e+02  6.67000000e+01  3.33000000e+01  2.67000000e+01
  8.30000000e+00  1.83000000e+01  3.67000000e+01  6.67000000e+01
  1.00000000e+02  1.34800000e+02  1.39000000e+02  7.95000000e+01
  7.97000000e+01  5.12000000e+01  2.03000000e+01  1.60000000e+01
  1.70000000e+01  5.40000000e+01  7.93000000e+01  9.00000000e+01
  1.04800000e+02  1.43200000e+02  1.02000000e+02  7.52000000e+01
  6.07000000e+01  3.48000000e+01  1.90000000e+01  6.30000000e+01
  1.16300000e+02  1.76800000e+02  1.68000000e+02  1.36000000e+02
  1.10800000e+02  5.80000000e+01  5.10000000e+01  1.17000000e+01
  3.30000000e+01  1.54200000e+02  2.57300000e+02  2.09800000e+02
  1.41300000e+02  1.13500000e+02  6.42000000e+01  3.80000000e+01
  1.70000000e+01  4.02000000e+01  1.38200000e+02  2.20000000e+02
  2.18200000e+02  1.96800000e+02  1.49800000e+02  1.11000000e+02
  1.00000000e+02  7.82000000e+01  6.83000000e+01  3.55000000e+01
  2.67000000e+01  1.07000000e+01  6.80000000e+00  1.13000000e+01
  2.42000000e+01  5.67000000e+01  7.50000000e+01  7.18000000e+01
  7.92000000e+01  7.03000000e+01  4.68000000e+01  1.68000000e+01
  1.35000000e+01  4.20000000e+00  0.00000000e+00  2.30000000e+00
  8.30000000e+00  2.03000000e+01  2.32000000e+01  5.90000000e+01
  7.63000000e+01  6.83000000e+01  5.29000000e+01  3.85000000e+01
  2.42000000e+01  9.20000000e+00  6.30000000e+00  2.20000000e+00
  1.14000000e+01  2.82000000e+01  5.99000000e+01  8.30000000e+01
  1.08500000e+02  1.15200000e+02  1.17400000e+02  8.08000000e+01
  4.43000000e+01  1.34000000e+01  1.95000000e+01  8.58000000e+01
  1.92700000e+02  2.27300000e+02  1.68700000e+02  1.43000000e+02
  1.05500000e+02  6.33000000e+01  4.03000000e+01  1.81000000e+01
  2.51000000e+01  6.58000000e+01  1.02700000e+02  1.66300000e+02
  2.08300000e+02  1.82500000e+02  1.26300000e+02  1.22000000e+02
  1.02700000e+02  7.41000000e+01  3.90000000e+01  1.27000000e+01
  8.20000000e+00  4.34000000e+01  1.04400000e+02  1.78300000e+02
  1.82200000e+02  1.46600000e+02  1.12100000e+02  8.35000000e+01
  8.92000000e+01  5.78000000e+01  3.07000000e+01  1.39000000e+01
  6.28000000e+01  1.23600000e+02  2.32000000e+02  1.85300000e+02
  1.69200000e+02  1.10100000e+02  7.45000000e+01  2.83000000e+01
  1.89000000e+01  2.07000000e+01  5.70000000e+00  1.00000000e+01
  5.37000000e+01  9.05000000e+01  9.90000000e+01  1.06100000e+02
  1.05800000e+02  8.63000000e+01  4.24000000e+01  2.18000000e+01
  1.12000000e+01  1.04000000e+01  1.18000000e+01  5.95000000e+01
  1.21700000e+02  1.42000000e+02  1.30000000e+02  1.06600000e+02
  6.94000000e+01  4.38000000e+01  4.44000000e+01  2.02000000e+01
  1.57000000e+01  4.60000000e+00  8.50000000e+00  4.08000000e+01
  7.01000000e+01  1.05500000e+02  9.01000000e+01  1.02800000e+02
  8.09000000e+01  7.32000000e+01  3.09000000e+01  9.50000000e+00
  6.00000000e+00  2.40000000e+00  1.61000000e+01  7.90000000e+01
  9.50000000e+01  1.73600000e+02  1.34600000e+02  1.05700000e+02
  6.27000000e+01  4.35000000e+01  2.37000000e+01  9.70000000e+00
  2.79000000e+01  7.40000000e+01  1.06500000e+02  1.14700000e+02
  1.29700000e+02  1.08200000e+02  5.94000000e+01  3.51000000e+01
  1.86000000e+01  9.20000000e+00  1.46000000e+01  6.02000000e+01
  1.32800000e+02  1.90600000e+02  1.82600000e+02  1.48000000e+02
  1.13000000e+02  7.92000000e+01  5.08000000e+01  2.71000000e+01
  1.61000000e+01  5.53000000e+01  1.54300000e+02  2.14700000e+02
  1.93000000e+02  1.90700000e+02  1.18900000e+02  9.83000000e+01
  4.50000000e+01  2.01000000e+01  6.60000000e+00  5.42000000e+01
  2.00700000e+02  2.69300000e+02  2.61700000e+02  2.25100000e+02
  1.59000000e+02  7.64000000e+01  5.34000000e+01  3.99000000e+01
  1.50000000e+01  2.20000000e+01  6.68000000e+01  1.32900000e+02
  1.50000000e+02  1.49400000e+02  1.48000000e+02  9.44000000e+01
  9.76000000e+01  5.41000000e+01  4.92000000e+01  2.25000000e+01
  1.84000000e+01  3.93000000e+01  1.31000000e+02  2.20100000e+02
  2.18900000e+02  1.98900000e+02  1.62400000e+02  9.10000000e+01
  6.05000000e+01  2.06000000e+01  1.48000000e+01  3.39000000e+01
  1.23000000e+02  2.11100000e+02  1.91800000e+02  2.03300000e+02
  1.33000000e+02  7.61000000e+01  4.49000000e+01  2.51000000e+01
  1.16000000e+01  2.89000000e+01  8.83000000e+01  1.36300000e+02
  1.73900000e+02  1.70400000e+02  1.63600000e+02  9.93000000e+01
  6.53000000e+01  4.58000000e+01  2.47000000e+01  1.26000000e+01
  4.20000000e+00  4.80000000e+00  2.49000000e+01  8.08000000e+01
  8.45000000e+01  9.40000000e+01  1.13300000e+02  6.98000000e+01
  3.98000000e+01  2.17000000e+01  7.00000000e+00  3.60000000e+00
  8.80000000e+00  2.96000000e+01  8.32000000e+01  1.25500000e+02
  1.54700000e+02  1.56348277e+02  1.29845664e+02  8.48261433e+01
  3.76547907e+01  5.47889486e+00 -5.26328291e-03  2.31958661e+01
  6.66484255e+01  1.14556947e+02  1.49506172e+02  1.58791685e+02
  1.39038104e+02  9.74260659e+01  4.90819890e+01  1.15794607e+01
 -1.44895065e+00  1.47327213e+01  5.42422665e+01  1.02717544e+02
  1.42537275e+02  1.59226562e+02  1.46718673e+02  1.09560357e+02
  6.12590590e+01  1.93728148e+01 -8.72279635e-01  7.88308317e+00
  4.24562356e+01  9.02794685e+01  1.33968527e+02  1.57641974e+02
  1.52694258e+02  1.20923928e+02  7.38798357e+01  2.86630109e+01
  1.71025067e+00  2.81917042e+00  3.15866660e+01  7.75554463e+01
  1.24015370e+02  1.54077760e+02  1.56814618e+02  1.31231066e+02
  8.66269979e+01  3.92164679e+01  6.23370836e+00 -3.31696240e-01
  2.19068486e+01  6.48653949e+01  1.12928054e+02  1.48623536e+02
  1.58976155e+02  1.40222622e+02  9.91800471e+01  5.07678428e+01
  1.25843613e+01 -1.49029536e+00  1.36601606e+01  5.25283769e+01
  1.00985343e+02  1.41416436e+02  1.59124522e+02  1.47672524e+02
  1.11223365e+02  6.30267024e+01  2.06025365e+01 -6.27496582e-01
  7.05394632e+00  4.08545789e+01  8.84875112e+01  1.32637665e+02
  1.57255989e+02  1.53393461e+02  1.22454150e+02  7.56848253e+01
  3.00866351e+01  2.23500697e+00  2.25430426e+00  3.01375122e+01
  7.57487878e+01  1.22507947e+02  1.53417536e+02  1.57241592e+02
  1.32590028e+02  8.84239513e+01  4.07982008e+01  7.02524409e+00
 -6.18089412e-01  2.06466335e+01  6.30894595e+01  1.11281969e+02
  1.47705672e+02  1.59120165e+02  1.41376156e+02  1.00923784e+02
  5.24679154e+01  1.36227750e+01 -1.49101484e+00  1.26205694e+01
  5.08278165e+01]

#Plotting the series with forecasts: 
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, label='Extended Series (yhat)', color='C0')
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

The predictions above are perfectly sinusoidal. Because the cycles for the sunspots data are irregular and can have different periods, there is a danger that these perfectly sinusoidal predictions can get out of phase sometimes leading to loss of prediction accuracy.

Below, we simulate datasets using the Yule model.

#Simulating from the Yule Model:
rng = np.random.default_rng(seed = 42)
sighat = np.sqrt(np.mean(yulemod.resid ** 2))
print(sighat)
fig, axes = plt.subplots(3, 3, figsize = (12, 6))
axes = axes.flatten()
for j in range(1):
    ysim = y.copy()
    for i in range(2, n):
        err = rng.normal(loc = 0, scale = sighat, size = 1)
        ysim[i] = yulemod.params[0] + yulemod.params[1] * ysim[i-1] - ysim[i-2] + err[0]
    axes[j].plot(ysim)
axes[1].plot(y)    
for j in range(2, 9):
    ysim = y.copy()
    for i in range(2, n):
        err = rng.normal(loc = 0, scale = sighat, size = 1)
        ysim[i] = yulemod.params[0] + yulemod.params[1] * ysim[i-1] - ysim[i-2] + err[0]
    axes[j].plot(ysim)
plt.show()

27.778317980024667

The sunspots dataset still probably sticks out compared to the synthetic datasets. However, all these synthetic datasets are quite smooth just like the actual sunspots data. This plots shows that the two “sinusoid + noise” models described in lecture are fundamentally different.

AR(2) Model¶

The Yule model is a special case of the AR(2) model. Specifically, the AR(2) model is: $y_t = \phi_0 + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \epsilon_t$ and the Yule model is a special case obtained by taking $\phi_2 = 1$ . Below we fit the AR(2) model to the sunspots data (i.e., we estimate all the parameters $\phi_0, \phi_1, \phi_2$ ).

p = 2 #this is the order of the AR model
n = len(y)
print(n)

yreg = y[p:] #these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) #this will be the design matrix (X) in the autoregression
for j in range(1, p+1):
    col = y[p-j : n-j].reshape(-1, 1)
    Xmat = np.column_stack([Xmat, col])

armod = sm.OLS(yreg, Xmat).fit()
print(armod.params)
print(armod.summary())
sighat = np.sqrt(np.mean(armod.resid ** 2))
print(sighat)

[24.45610705  1.38803272 -0.69646032]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.829
Model:                            OLS   Adj. R-squared:                  0.828
Method:                 Least Squares   F-statistic:                     774.7
Date:                Thu, 23 Oct 2025   Prob (F-statistic):          2.25e-123
Time:                        16:56:04   Log-Likelihood:                -1505.5
No. Observations:                 323   AIC:                             3017.
Df Residuals:                     320   BIC:                             3028.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         24.4561      2.384     10.260      0.000      19.767      29.146
x1             1.3880      0.040     34.524      0.000       1.309       1.467
x2            -0.6965      0.040    -17.342      0.000      -0.775      -0.617
==============================================================================
Omnibus:                       33.173   Durbin-Watson:                   2.200
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               50.565
Skew:                           0.663   Prob(JB):                     1.05e-11
Kurtosis:                       4.414   Cond. No.                         231.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
25.588082517109108

#Simulations from the fitted AR(2) model:
fig, axes = plt.subplots(3, 3, figsize = (12, 6))
axes = axes.flatten()

for j in range(3):
    ysim = y.copy()
    for i in range(2, n):
        err = rng.normal(loc = 0, scale = sighat, size = 1)
        ysim[i] = armod.params[0] + armod.params[1] * ysim[i-1] + armod.params[2] * ysim[i-2] + err[0]
    axes[j].plot(ysim)

axes[3].plot(y)

for j in range(4, 9):
    ysim = y.copy()
    for i in range(2, n):
        err = rng.normal(loc = 0, scale = sighat, size = 1)
        ysim[i] = armod.params[0] + armod.params[1] * ysim[i-1] + armod.params[2] * ysim[i-2] + err[0]
    axes[j].plot(ysim)
plt.show()

These datasets are still smooth (much smoother than observations simulated from $y_t = \beta_0 + \beta_1 \cos(2 \pi f t) + \beta_2 \sin(2 \pi f t) + \epsilon_t$ ).

Below, we obtain predictions using the AR(2) model.

#Generate k-step ahead forecasts: 
k = 100
yhat = np.concatenate([y, np.full(k, -9999)]) #extend data by k placeholder values
for i in range(1, k+1):
    ans = armod.params[0]
    for j in range(1, p+1):
        ans += armod.params[j] * yhat[n+i-j-1]
    yhat[n+i-1] = ans
predvalues = yhat[n:]

#Plotting the series with forecasts: 
plt.figure(figsize=(12, 6))
time_all = np.arange(1, n + k + 1)
plt.plot(time_all, yhat, color='C0')
plt.plot(range(1, n + 1), y, label='Original Data', color='C1')
plt.plot(range(n + 1, n + k + 1), predvalues, label='Forecasts', color='blue')
plt.axvline(x=n, color='black', linestyle='--', label='Forecast Start')
#plt.axhline(y=np.mean(y), color='gray', linestyle=':', label='Mean of Original Data')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Time Series + AR(p) Forecasts')
plt.legend()
plt.show()

The predictions above are quite different from those obtained from the Yule model. The predictions actually are given by a damped sinusoid (as we shall see in the coming lectures). The predictions become flat after a few time points. When the cycles become irregular with varying periods, constant predictors might do well compared to sinusoidal predictors which may go out of phase at some places.