Skip to article frontmatterSkip to article content

We evaluate the prediction accuracy of some simple models for the sunspots dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
sunspots = pd.read_csv('SN_y_tot_V2.0.csv', header = None, sep = ';')
print(sunspots.head())

y = sunspots.iloc[:, 1].values
n = len(y)

plt.figure(figsize = (12, 6))
plt.plot(y)
plt.show()
print(n)
        0     1    2  3  4
0  1700.5   8.3 -1.0 -1  1
1  1701.5  18.3 -1.0 -1  1
2  1702.5  26.7 -1.0 -1  1
3  1703.5  38.3 -1.0 -1  1
4  1704.5  60.0 -1.0 -1  1
<Figure size 1200x600 with 1 Axes>
325

Let us split the dataset into two parts (training and test). We will fit models to the training data, and evaluate their prediction accuracy on the test set.

splitnumber = 250
sunspots_train = sunspots.iloc[:splitnumber, :].copy()
sunspots_test = sunspots.iloc[splitnumber:, :].copy()

print(sunspots_train)
print(sunspots_test)

tme_train = sunspots_train.iloc[:,0]
tme_test = sunspots_test.iloc[:,0]
tme = sunspots.iloc[:,0]

y =  sunspots_train.iloc[:,1].values

plt.figure(figsize = (12, 6))
plt.xlabel('Time')
plt.ylabel('Count')
plt.plot(tme, sunspots.iloc[:,1], color = "None")
plt.plot(tme_train, y, color = 'black', label = 'Training data')
plt.plot(tme_test, sunspots_test.iloc[:,1], color = 'red', label = 'Test Data')
plt.legend()
plt.title('Sunspots Data')
plt.show()
          0      1     2    3  4
0    1700.5    8.3  -1.0   -1  1
1    1701.5   18.3  -1.0   -1  1
2    1702.5   26.7  -1.0   -1  1
3    1703.5   38.3  -1.0   -1  1
4    1704.5   60.0  -1.0   -1  1
..      ...    ...   ...  ... ..
245  1945.5   55.3   6.6  365  1
246  1946.5  154.3  11.1  365  1
247  1947.5  214.7   9.8  365  1
248  1948.5  193.0   9.3  366  1
249  1949.5  190.7   9.2  365  1

[250 rows x 5 columns]
          0      1     2      3  4
250  1950.5  118.9   7.3    365  1
251  1951.5   98.3   6.6    365  1
252  1952.5   45.0   4.5    366  1
253  1953.5   20.1   3.0    365  1
254  1954.5    6.6   1.7    365  1
..      ...    ...   ...    ... ..
320  2020.5    8.8   4.1  14440  1
321  2021.5   29.6   7.9  15233  1
322  2022.5   83.2  14.2  15258  1
323  2023.5  125.5  19.2  13286  1
324  2024.5  154.7  22.1  11952  0

[75 rows x 5 columns]
<Figure size 1200x600 with 1 Axes>

Model One: Sinusoid Model

Our first model is the simple sinusoidal model that we studied way back in Lectures 5-8:

yt=β0+β1cos(2πft)+β2sin(2πft)+ϵty_t = \beta_0 + \beta_1 \cos(2 \pi f t) + \beta_2 \sin(2 \pi f t) + \epsilon_t

with ϵti.i.dN(0,σ2)\epsilon_t \overset{\text{i.i.d}}{\sim} N(0, \sigma^2). The point estimate for ff can be obtained by the follwoing code

def rss(f):
    n = len(y)
    x = np.arange(1, n+1)
    xcos = np.cos(2 * np.pi * f * x)
    xsin = np.sin(2 * np.pi * f * x)
    X = np.column_stack([np.ones(n), xcos, xsin])

    md = sm.OLS(y, X).fit()
    rss = np.sum(md.resid ** 2)

    return rss

allfvals = np.arange(0.01, 0.5, .0001) #much finer grid
rssvals = np.array([rss(f) for f in allfvals])
fhat = allfvals[np.argmin(rssvals)]

print(fhat)
print(1/fhat)
0.08989999999999951
11.123470522803176

The predictions with this model are obtained as follows.

n = len(y)
x = np.arange(1, n+1)
xcos = np.cos(2 * np.pi * fhat * x)
xsin = np.sin(2 * np.pi * fhat * x)
X = np.column_stack([np.ones(n), xcos, xsin])

md = sm.OLS(y, X).fit()

t_future = np.arange(n+1, n+len(tme_test) + 1)
pred_test = (md.params[0] 
             + md.params[1] * (np.cos(2 * np.pi * fhat * t_future)) 
             + md.params[2] * (np.sin(2 * np.pi * fhat * t_future)))

# Prediction error:
pred_error_rms_sinusoid = np.sqrt(np.mean((pred_test - sunspots_test.iloc[:, 1]) ** 2))
print(pred_error_rms_sinusoid)
79.83531765284938
plt.figure(figsize = (12, 6))
plt.xlabel('Time')
plt.ylabel('Count')
plt.plot(tme, sunspots.iloc[:,1], color = "None")
plt.plot(tme_train, y, color = 'black', label = 'Training data')
plt.plot(tme_test, sunspots_test.iloc[:,1], color = 'red', label = 'Test Data')
plt.plot(tme_test, pred_test, color = 'blue', label = 'Predictions')
plt.legend()
plt.title('Sunspots Data')
plt.show()    
<Figure size 1200x600 with 1 Axes>

Model Two: The Yule Model

This model is motivated by the following observation (discussed with proof in Lecture 16) that

st=β0+β1cos(2πft)+β2sin(2πft)    for t=1,2,s_t = \beta_0 + \beta_1 \cos(2 \pi f t) + \beta_2 \sin(2 \pi f t) ~~~ \text{ for } t = 1, 2, \dots

is equivalent to (below ω=2πf\omega = 2 \pi f)

st=α0+α1st1st2     with α0=2β0(1cosω) and α1=2cosω.s_t = \alpha_0 + \alpha_1 s_{t-1} - s_{t-2} ~~~~ \text{ with } \alpha_0 = 2 \beta_0 (1 - \cos \omega) \text{ and } \alpha_1 = 2 \cos \omega.

This suggests that a different sinusoid plus noise model is obtained by adding noise in the above equation leading to:

yt=α0+α1yt1yt2+ϵty_t = \alpha_0 + \alpha_1 y_{t-1} - y_{t-2} + \epsilon_t

The parameters α0\alpha_0 and α1\alpha_1 can be fit by least squares by minimizing

t=3n(ytα0α1yt1+yt2)2=t=3n(yt+yt2α0α1yt1)2\sum_{t=3}^n (y_t - \alpha_0 - \alpha_1 y_{t-1} + y_{t-2})^2 = \sum_{t=3}^n (y_t + y_{t-2} - \alpha_0 - \alpha_1 y_{t-1} )^2

Note that the time index goes from 3 to nn above (because we do not have access to yt1y_{t-1} and/or yt2y_{t-2} when t2t \leq 2). This least squares estimator can be computed by creating a new response variable yt+yt2y_t + y_{t-2} and regressing it on yt1y_{t-1} (and the constant term).

p = 2
yreg = y[p:]
x1 = y[1:-1]
x2 = y[:-2]
Xmat = np.column_stack([np.ones(len(yreg)), x1])

print(Xmat.shape)
(248, 2)
y_adjusted = yreg + x2
yulemod = sm.OLS(y_adjusted, Xmat).fit()

print(yulemod.summary())
print(yulemod.params)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.926
Model:                            OLS   Adj. R-squared:                  0.926
Method:                 Least Squares   F-statistic:                     3090.
Date:                Thu, 20 Mar 2025   Prob (F-statistic):          2.78e-141
Time:                        20:49:50   Log-Likelihood:                -1168.0
No. Observations:                 248   AIC:                             2340.
Df Residuals:                     246   BIC:                             2347.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         27.3114      2.791      9.787      0.000      21.815      32.808
x1             1.6348      0.029     55.592      0.000       1.577       1.693
==============================================================================
Omnibus:                       10.450   Durbin-Watson:                   2.408
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               20.195
Skew:                           0.133   Prob(JB):                     4.12e-05
Kurtosis:                       4.372   Cond. No.                         155.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[27.31136502  1.634765  ]

Because α1=2cosω=2cos(2πf)\alpha_1 = 2 \cos \omega = 2 \cos (2 \pi f), we can obtain an estimate of ff from the least squares estimate of α1\alpha_1 obtained above.

alpha1_hat = yulemod.params[1]
yulefhat = np.arccos(alpha1_hat/2) / (2 * np.pi)

print(1/yulefhat)
10.234141128185033

It is interesting that the period corresponding to this estimate of ff is less than 11 by a nontrivial amount.

We now obtain predictions for the test times from the Yule model.

# Generate k-step ahead forecasts: 
k = len(tme_test)
yhat = np.concatenate([y, np.full(k, -9999)]) # extend data by k placeholder values

for i in range(1, k+1):
    ans = yulemod.params[0] - yhat[n+i-3]
    ans += yulemod.params[1] * yhat[n+i-2]
    yhat[n+i-1] = ans

predvalues_yulemod = yhat[n:]
print(predvalues_yulemod)
[146.06104972  75.38685636   4.49010921 -40.73521796 -43.77125261
  -3.50912861  65.34601702 137.64587487 186.98400606 195.34039803
 159.66300392  92.98245693  19.65282692 -33.5433384  -47.17693735
 -16.26850237  47.89312417 121.87387032 178.65337795 197.49378336
 171.513911   110.20251965  35.9526756  -24.11697905 -48.06690374
 -27.14974761  30.99481172 105.13034589 168.17996275 197.11573524
 181.36930636 126.69182313  53.05341637 -12.65059012 -46.42279325
 -35.92840227  14.99966389  87.76069276 155.77980967 194.21405216
 189.02588952 142.11022034  70.60228927   0.6192958  -42.27852115
 -42.42337723   0.23763408  70.12321812 141.70871333 188.84859105
 194.32571784 156.13965529  88.2372901   15.41894292 -35.71957693
 -46.50069192 -12.98676148  52.58175387 126.25693714 181.13003247
 197.15946465 168.49072395 105.59463801  31.44305903 -26.88126073
 -48.07623808 -24.4007254   35.49815135 109.74322567 171.21759752
 197.46867444 178.90864425 122.31627964  48.36109315 -15.94589238]
plt.figure(figsize = (12, 6))
plt.xlabel('Time')
plt.ylabel('Count')
plt.plot(tme, sunspots.iloc[:,1], color = "None")
plt.plot(tme_train, y, color = 'black', label = 'Training data')
plt.plot(tme_test, sunspots_test.iloc[:,1], color = 'red', label = 'Test Data')
#plt.plot(tme_test, pred_test, color = 'blue', label = 'Predictions (Model One)')
plt.plot(tme_test, predvalues_yulemod, color = 'green', label = 'Predictions (Model Two)')
plt.legend()
plt.title('Sunspots Data')
plt.show()    
<Figure size 1200x600 with 1 Axes>
pred_error_rms_yulemod = np.sqrt(np.mean((predvalues_yulemod - sunspots_test.iloc[:,1]) ** 2))
print(pred_error_rms_sinusoid, pred_error_rms_yulemod)
#basically the same prediction errors (slightly smaller for the Yule model); expect different results for different training-test splits. 
79.83531765284938 79.52702577862316

Model Three: AR(2)

The AR(2) model is a natural extension of the Yule model:

yt=ϕ0+ϕ1yt1+ϕ2yt2+ϵty_t = \phi_0 + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \epsilon_t

The only difference between the Yule model and AR(2) is that ϕ2=1\phi_2 = -1 in the Yule model while it is treated as an adjustable parameter which can provide better fits to the data in AR(2).

The AR(2) model is fit in the following way.

p = 2
yreg = y[p:] # these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) # this will be the design matrix (X) in the autoregression

for j in range(1, p+1):
    col = y[p-j : n-j].reshape(-1, 1)
    Xmat = np.column_stack([Xmat, col])
armod = sm.OLS(yreg, Xmat).fit()
print(armod.params)
print(armod.summary())

sighat = np.sqrt(np.mean(armod.resid ** 2))
print(sighat)
[23.08524521  1.37838349 -0.68406411]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.822
Model:                            OLS   Adj. R-squared:                  0.821
Method:                 Least Squares   F-statistic:                     567.0
Date:                Thu, 20 Mar 2025   Prob (F-statistic):           1.19e-92
Time:                        20:45:50   Log-Likelihood:                -1147.2
No. Observations:                 248   AIC:                             2300.
Df Residuals:                     245   BIC:                             2311.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         23.0852      2.647      8.722      0.000      17.872      28.299
x1             1.3784      0.047     29.399      0.000       1.286       1.471
x2            -0.6841      0.047    -14.506      0.000      -0.777      -0.591
==============================================================================
Omnibus:                       25.182   Durbin-Watson:                   2.117
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.915
Skew:                           0.631   Prob(JB):                     5.85e-09
Kurtosis:                       4.441   Cond. No.                         220.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
24.697886587075086

Note that the estimated value of ϕ2\phi_2 (which is the value of ϕ2\phi_2 which gives the best fit to the data in the least squares sense) is -0.6841 which is quite a bit smaller than the value of -1 which is hard-coded in the Yule model.

Let us obtain predictions for the future values using the fitted AR(2) model.

# Generate k-step ahead forecasts: 
k = len(tme_test)
yhat = np.concatenate([y, np.full(k, -9999)]) # extend data by k placeholder values

for i in range(1, k+1):
    ans = armod.params[0]
    for j in range(1, p+1):
        ans += armod.params[j] * yhat[n+i-j-1]

    yhat[n+i-1] = ans

predvalues_ar = yhat[n:]
plt.figure(figsize = (12, 6))
plt.xlabel('Time')
plt.ylabel('Count')
plt.plot(tme, sunspots.iloc[:,1], color = "None")
plt.plot(tme_train, y, color = 'black', label = 'Training data')
plt.plot(tme_test, sunspots_test.iloc[:,1], color = 'red', label = 'Test Data')
plt.plot(tme_test, pred_test, color = 'blue', label = 'Predictions (Model One)')
plt.plot(tme_test, predvalues_yulemod, color = 'green', label = 'Predictions (Model Two)')
plt.plot(tme_test, predvalues_ar, color = 'yellow', label = 'Predictions (Model Three)')
plt.legend()
plt.title('Sunspots Data')
plt.show()    
<Figure size 1200x600 with 1 Axes>

The predictions look quite different. For AR(2), the predictions die out to a constant value.

pred_error_rms_ar = np.sqrt(np.mean((predvalues_ar - sunspots_test.iloc[:,1]) ** 2))
print(pred_error_rms_sinusoid, pred_error_rms_yulemod, pred_error_rms_ar)
79.83531765284938 79.52702577862316 70.32365837557232

It is interesting that even though the predictions of the AR model die out to a constant value, in terms of prediction accuracy, it performs better than the previous two sinusoidal models. One reason for this is that the cycles of the sunspots data are irregular (and their periodicity changes from cycle to cycle). If we use a single sinusoid for prediction (as in Models One and Two), there is a danger that the predictions go out of phase with the actual data. The AR model, by predicting a constant, becomes more accurate than a sinusoidal model that goes out of phase. Note though that this prediction comparison will change if we change the training and test datasets.

Model Four: AR(pp) for a higher pp

We can see how the predictions change if we use AR(p)AR(p) for a higher pp. The AR(p) for p2p \geq 2 should give better fits to the data. Unless there is overfitting, this should lead to better predictions as well.

p = 12
yreg = y[p:] # these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) # this will be the design matrix (X) in the autoregression

for j in range(1, p+1):
    col = y[p-j: n-j].reshape(-1, 1)
    Xmat = np.column_stack([Xmat, col])
armod = sm.OLS(yreg, Xmat).fit()
print(armod.params)
print(armod.summary())

sighat = np.sqrt(np.mean(armod.resid ** 2))
print(sighat)
[13.12906966  1.21947938 -0.50695376 -0.10670251  0.16956741 -0.15634832
  0.05883931 -0.05974739  0.10456934  0.09940318 -0.06642164  0.14368696
 -0.06191045]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.850
Model:                            OLS   Adj. R-squared:                  0.842
Method:                 Least Squares   F-statistic:                     106.0
Date:                Thu, 20 Mar 2025   Prob (F-statistic):           2.00e-85
Time:                        20:45:50   Log-Likelihood:                -1082.3
No. Observations:                 238   AIC:                             2191.
Df Residuals:                     225   BIC:                             2236.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         13.1291      4.789      2.742      0.007       3.693      22.565
x1             1.2195      0.067     18.239      0.000       1.088       1.351
x2            -0.5070      0.106     -4.795      0.000      -0.715      -0.299
x3            -0.1067      0.112     -0.957      0.340      -0.327       0.113
x4             0.1696      0.112      1.517      0.131      -0.051       0.390
x5            -0.1563      0.113     -1.389      0.166      -0.378       0.065
x6             0.0588      0.112      0.525      0.600      -0.162       0.280
x7            -0.0597      0.111     -0.540      0.590      -0.278       0.158
x8             0.1046      0.110      0.951      0.343      -0.112       0.321
x9             0.0994      0.109      0.908      0.365      -0.116       0.315
x10           -0.0664      0.109     -0.608      0.544      -0.282       0.149
x11            0.1437      0.105      1.372      0.171      -0.063       0.350
x12           -0.0619      0.067     -0.922      0.358      -0.194       0.070
==============================================================================
Omnibus:                       30.370   Durbin-Watson:                   1.990
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               51.328
Skew:                           0.713   Prob(JB):                     7.15e-12
Kurtosis:                       4.773   Cond. No.                         859.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
22.837160267072164
# Generate k-step ahead forecasts: 
k = len(tme_test)
yhat = np.concatenate([y, np.full(k, -9999)]) # extend data by k placeholder values

for i in range(1, k+1):
    ans = armod.params[0]
    for j in range(1, p+1):
        ans += armod.params[j] * yhat[n+i-j-1]

    yhat[n+i-1] = ans

predvalues_ar_high = yhat[n:]
pred_error_rms_ar_high = np.sqrt(np.mean((predvalues_ar_high - sunspots_test.iloc[:,1]) ** 2))
print(pred_error_rms_sinusoid, pred_error_rms_yulemod, 
      pred_error_rms_ar, pred_error_rms_ar_high)
79.83531765284938 79.52702577862316 70.32365837557232 50.162969939451386

With p=12p = 12, the prediction accuracy is much better compared to p=2p = 2 (as well as models 1 and 2).

plt.figure(figsize = (12, 6))
plt.xlabel('Time')
plt.ylabel('Count')
plt.plot(tme, sunspots.iloc[:,1], color = "None")
plt.plot(tme_train, y, color = 'black', label = 'Training data')
plt.plot(tme_test, sunspots_test.iloc[:,1], color = 'red', label = 'Test Data')
#plt.plot(tme_test, pred_test, color = 'blue', label = 'Predictions (Model One)')
#plt.plot(tme_test, predvalues_yulemod, color = 'green', label = 'Predictions (Model Two)')
plt.plot(tme_test, predvalues_ar, color = 'yellow', label = 'Predictions (AR(2))')
plt.plot(tme_test, predvalues_ar_high, color = 'pink', label = 'Predictions (AR(p) for high p)')
plt.legend()
plt.title('Sunspots Data')
plt.show()    
<Figure size 1200x600 with 1 Axes>