Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg

AR(p) Models and Stationarity

Consider fitting AR(pp) models starting with p=1p = 1 to the following GNP dataset (from https://fred.stlouisfed.org/series/GNP). This is quarterly data.

gnp = pd.read_csv("GNP_30Oct2025.csv")
print(gnp.head())
y = gnp['GNP'].to_numpy()
plt.plot(y, color = 'black')
plt.show()
  observation_date      GNP
0       1947-01-01  244.142
1       1947-04-01  247.063
2       1947-07-01  250.716
3       1947-10-01  260.981
4       1948-01-01  267.133
<Figure size 640x480 with 1 Axes>

As usual, we fit AR(pp) models using the AutoReg function.

armd = AutoReg(y, lags = 1).fit()
print(armd.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  314
Model:                     AutoReg(1)   Log Likelihood               -2054.516
Method:               Conditional MLE   S.D. of innovations            171.573
Date:                Fri, 07 Nov 2025   AIC                           4115.031
Time:                        14:45:08   BIC                           4126.270
Sample:                             1   HQIC                          4119.523
                                  314                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.8687     13.423      0.661      0.509     -17.439      35.176
y.L1           1.0116      0.001    827.008      0.000       1.009       1.014
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            0.9886           +0.0000j            0.9886            0.0000
-----------------------------------------------------------------------------

Is the above fitted AR(1) model causal-stationary? For an AR(1) causal-stationarity happens when ϕ1<1|\phi_1| < 1. Here the fitted value for ϕ1\phi_1 equals 1.0116 which exceeds 1. So this model is not causal stationary. In fact, from the plot of the GNP dataset, it should be clear that a stationary model will not be appropriate for it. Predictions based on this fitted AR(1) model will become exponentially large (as the prediction horizon increases) which makese sense (given the nature of the data).

Preprocessing using logarithms and differences

If we want to fit stationary AR models to this GNP dataset, we can first preprocess the data by taking logarithms, and then by taking differences of the logarithms. Note that differenced log data (multiplied by 100) leads to percentage change interpretation. In other words, 100(logGNPtlogGNPt1)100 \left(\log \text{GNP}_t - \log \text{GNP}_{t-1} \right) represents the percent change in GNP from one quarter to the next.

Here is a plot of the logarithms of the data.

ylog = np.log(y)
plt.plot(ylog, color = 'black')
plt.title('Log of GNP')
plt.show()
<Figure size 640x480 with 1 Axes>

Next we take differences and multiply by 100.

ylogdiff = (np.diff(np.log(y))) * 100
plt.plot(ylogdiff, color = 'black')
plt.title('Diff(Log(Quarterly GNP))')
plt.ylabel('Value')
plt.xlabel('Quarter')
plt.show()
<Figure size 640x480 with 1 Axes>

Now let us fit AR(1) to this log-differenced data.

armd_logdiff = AutoReg(ylogdiff, lags = 1).fit()
print(armd_logdiff.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  313
Model:                     AutoReg(1)   Log Likelihood                -512.052
Method:               Conditional MLE   S.D. of innovations              1.249
Date:                Fri, 07 Nov 2025   AIC                           1030.104
Time:                        15:11:31   BIC                           1041.333
Sample:                             1   HQIC                          1034.591
                                  313                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.1596      0.110     10.519      0.000       0.943       1.376
y.L1           0.2487      0.055      4.536      0.000       0.141       0.356
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            4.0208           +0.0000j            4.0208            0.0000
-----------------------------------------------------------------------------

This fitted AR(1) clearly has ϕ^1<1|\hat{\phi}_1| < 1 so it corresponds to a causal-stationary regime.

Causal Stationarity for AR(pp) with p2p \geq 2

When p2p \geq 2, causal-stationarity is determined by the roots of the AR polynomial ϕ(z)=1ϕ1zϕpzp\phi(z) = 1 - \phi_1 z - \dots - \phi_p z^p. Specifically, the AR(pp) is causal stationary if every root of the AR polynomial has modulus strictly larger than 1.

For our first example, consider the AR(1) model that we fitted to the data yt=100(logGNPtlogGNPt1)y_t = 100 \left(\log \text{GNP}_t - \log \text{GNP}_{t-1} \right). This model is:

yt=1.1596+0.2487yt1+ϵt.\begin{align*} y_t = 1.1596 + 0.2487 y_{t-1} + \epsilon_t. \end{align*}

This is AR(1) with ϕ1\phi_1 coefficient 0.2487 which has magnitude strictly smaller than 1. So it is a causal stationary model. We can also rewrite this model as:

100(logGNPtlogGNPt1)=1.1596+24.87(logGNPt1logGNPt2)+ϵt.\begin{align*} 100 \left( \log \text{GNP}_t - \log \text{GNP}_{t-1}\right) = 1.1596 + 24.87 \left( \log \text{GNP}_{t-1} - \log \text{GNP}_{t-2} \right) + \epsilon_t. \end{align*}

If Gt:=100logGNPtG_t := 100 \log \text{GNP}_t, then the above model in terms of GtG_t becomes:

GtGt1=1.1596+0.2487(Gt1Gt2)+ϵt\begin{align*} G_t - G_{t-1} = 1.1596 + 0.2487 \left(G_{t-1} - G_{t-2} \right) + \epsilon_t \end{align*}

which is equivalen to

Gt=1.1596+1.2487Gt10.2487Gt2+ϵt.\begin{align*} G_t = 1.1596 + 1.2487 G_{t-1} - 0.2487 G_{t-2} + \epsilon_t. \end{align*}

This is an AR(2) model and we can ask whether this AR(2) model is causal stationary. For this, we need to calculate the roots of the AR(2) polynomial:

11.1487z0.2487z2.\begin{align*} 1 - 1.1487 z - 0.2487 z^2. \end{align*}

Its roots are given by

1.1487±1.14872+4×0.24872×(0.2487)\begin{align*} \frac{1.1487 \pm \sqrt{1.1487^2 + 4 \times 0.2487}}{2 \times (-0.2487)} \end{align*}

These can be computed in python as follows.

coeffs = [-0.2487, -1.1487, 1]
roots = np.roots(coeffs) #these are roots of the quadratic
print(roots)
#The magnitudes of the roots are calculated as:
print(np.abs(roots))
[-5.3678854   0.74906754]
[5.3678854  0.74906754]

One of the roots has magnitude strictly larger than 1, while the other has magnitude strictly smaller than 1. Hence this AR(2) model is not causal-stationary. This makes sense because this AR(2) model is for Gt=100logGNPtG_t = 100 \log \text{GNP}_t which is predominantly an increasing dataset. So we cannot expect a stationary model for this data.

When we fit an AR(pp) model using AutoReg, the summary displays roots of the corresponding fitted AR polynomial along with their moduli. If all moduli are strictly larger than 1, then the model is causal stationary.

Below we fit an AR(pp) model with p2p \geq 2 to the percent change (100 times differences of logs) GNP dataset.

p = 2
armd_logdiff = AutoReg(ylogdiff, lags = p).fit()
print(armd_logdiff.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      y   No. Observations:                  313
Model:                     AutoReg(2)   Log Likelihood                -504.061
Method:               Conditional MLE   S.D. of innovations              1.224
Date:                Fri, 07 Nov 2025   AIC                           1016.123
Time:                        15:41:21   BIC                           1031.082
Sample:                             2   HQIC                          1022.102
                                  313                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.9181      0.126      7.292      0.000       0.671       1.165
y.L1           0.1971      0.055      3.553      0.000       0.088       0.306
y.L2           0.2077      0.056      3.742      0.000       0.099       0.317
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.7704           +0.0000j            1.7704            0.0000
AR.2           -2.7192           +0.0000j            2.7192            0.5000
-----------------------------------------------------------------------------

Note that the moduli are both strictly larger than 1, so this is a causal stationary AR model. Note that this model is for the percent changes of GNP (and not for GNP or log(GNP) directly).

How to determine the order of pp for AR(p)

Here is the method: Calculate PACF(h) for each h=1,2,3,h = 1, 2, 3, \dots. The PACF(h) is defined as the estimate of ϕh\phi_h when the AR(h) model is fit to the observed time series data. In other words, PACF(1) is the estimate of ϕ1\phi_1 when AR(1) is fit to the data, PACF(2) is the estimate of ϕ2\phi_2 when AR(2) is fit to the data, and so on. If PACF(h) becomes negligible after a particular value of hh (say after h=ph = p), then we use pp as the AR order.

Here is the function for calculating the PACF.

def sample_pacf(dt, p_max):
    pautocorr = []
    for p in range(1, p_max + 1):
        armd = AutoReg(dt, lags = p).fit()
        phi_p = armd.params[-1]
        pautocorr.append(phi_p)
    return pautocorr

Below we apply it to the percent change GNP dataset.

p_max = 50 
sample_pacf_vals = sample_pacf(ylogdiff, p_max)
markerline, stemline, baseline = plt.stem(range(1, p_max + 1), sample_pacf_vals)
markerline.set_marker("None")
plt.xlabel("p")
plt.ylabel('Partial Correlation')
plt.title("Sample Partial Auto Correlation")
plt.show()
<Figure size 640x480 with 1 Axes>

From the plot above, we see that the PACF(h) values become somewhat negligible when h=3h = \geq 3. Based on this, p=3p = 3 is a reasonable choice for fitting AR(pp) to the percent change GNP data.

Statsmodel has an inbuilt function for calculating PACF. Below, we compute this for the percent change GNP data, and check whether it gives the same values as our function.

from statsmodels.tsa.stattools import pacf
pacf_values = pacf(ylogdiff, nlags=p_max, method = 'ols')
print(np.column_stack([sample_pacf_vals, pacf_values[1:]])) #check that these values match exactly 
[[ 0.24870755  0.24870755]
 [ 0.20772295  0.20772295]
 [-0.01181547 -0.01181547]
 [-0.05004494 -0.05004494]
 [-0.03595239 -0.03595239]
 [ 0.00802459  0.00802459]
 [ 0.08230295  0.08230295]
 [ 0.0316824   0.0316824 ]
 [ 0.09574666  0.09574666]
 [ 0.09594221  0.09594221]
 [ 0.03584216  0.03584216]
 [ 0.00112678  0.00112678]
 [-0.03725445 -0.03725445]
 [ 0.05155244  0.05155244]
 [ 0.07745691  0.07745691]
 [ 0.11526705  0.11526705]
 [ 0.09917843  0.09917843]
 [ 0.08106869  0.08106869]
 [ 0.04302832  0.04302832]
 [ 0.05237025  0.05237025]
 [-0.05067918 -0.05067918]
 [ 0.01267157  0.01267157]
 [-0.08404734 -0.08404734]
 [ 0.09698737  0.09698737]
 [ 0.14416186  0.14416186]
 [-0.07836845 -0.07836845]
 [-0.02209264 -0.02209264]
 [ 0.06157982  0.06157982]
 [-0.02871973 -0.02871973]
 [-0.02974936 -0.02974936]
 [-0.02387845 -0.02387845]
 [ 0.0406917   0.0406917 ]
 [ 0.03228025  0.03228025]
 [ 0.00545047  0.00545047]
 [ 0.00125922  0.00125922]
 [-0.0416868  -0.0416868 ]
 [-0.02112732 -0.02112732]
 [-0.12050451 -0.12050451]
 [-0.10229888 -0.10229888]
 [-0.01385521 -0.01385521]
 [-0.06736392 -0.06736392]
 [ 0.13019911  0.13019911]
 [ 0.0214844   0.0214844 ]
 [ 0.0473848   0.0473848 ]
 [ 0.0398027   0.0398027 ]
 [-0.04117905 -0.04117905]
 [-0.15896129 -0.15896129]
 [-0.07691234 -0.07691234]
 [ 0.01057553  0.01057553]
 [-0.08638415 -0.08638415]]

There is also an inbuilt function for plotting pacf values. This plot gives values for PACF(hh) for h=0,1,2,h = 0, 1, 2, \dots with the default value for PACF(0) = 1.

from statsmodels.graphics.tsaplots import plot_pacf
fig, axes = plt.subplots()
plot_pacf(ylogdiff, lags = p_max, ax = axes)
axes.set_title("Sample PACF of diff_log_GDP Series")
plt.show()
<Figure size 640x480 with 1 Axes>

In comparison to our plot of the sample_pacf values, the inbuilt PACF plot from statsmodels will look slightly different (even though it is plotting the same values). The differences are: (a) it also plots the value 1 at lag 0, (b) it gives a shaded region that can be used to assess whether values are negligible or not.