import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.optimize import minimize
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_process import ArmaProcess, arma_acf, arma_pacf
from statsmodels.tsa.arima.model import ARIMAParameter Estimation in MA(1)¶
Estimating the parameters of ARMA (as well as ARIMA, SARIMA models) is much harder than parameter estimation in AR models which was handled by standard regression (ordinary least squares). We will not study this topic in any detail (and simply rely on the ARIMA function for fitting these models to data). But today, we will discuss estimation for MA(1) to gain some insight into how estimation works in these models.
Recall that the MA(1) model is given by
with . For parameter estimation, we need to write down the likelihood.
Likelihood¶
The joint density of is multivariate normal with mean vector and covariance matrix where equals the matrix whose entry is given by when , equals when and equals 0 for all other . The likelihood is therefore
where is the vector with components . This is a function of the unknown parameters which can be estimated by maximizing the logarithm of the likelihood. The presence of makes this computationally expensive. Some (exact or approximate) formula should be used for so that one does not need to invert an matrix every time the log-likelihood is to be computed.
Conditional Likelihood¶
Instead of the full likelihood, the form of the conditional likelihood when conditioned on is simpler. This is similar to the story for the AR(1) model where the form of the likelihood when conditioned on is much simpler than the full likelihood. The difference here is that instead of conditioning on being the observed value, we are conditioning on .
Below we write the conditional likelihood:
This likelihood can be broken down as
Each of the terms above can be written explicitly. Let and, for ,
Then
and
Thus the conditional likelihood given is given by
The key term in the above likelihood is which depends on the data and also on the parameters and . So let us denote
so that the likelihood and log-likelihood become
and
Thus the MLEs of are obtained by minimizing , and the MLE of is
def S_func(params, y): # this is the function S(\mu, \theta)
mu, theta = params
n = len(y)
eps = np.zeros(n)
eps[0] = y[0] - mu
for t in range(1, n):
eps[t] = y[t] - mu - theta * eps[t-1]
S_val = np.sum(eps**2)
return S_valWe will use the function ‘minimize’ from scipy.optimize to minimize S_func. First let us simulate some data from the MA(1) model with theta = -0.7 to evaluate the performance of our estimation strategy.
arma_process = ArmaProcess([1], [1, -0.7])
dt = arma_process.generate_sample(nsample=400)
dt = dt + 5 # adding a mean of 5The output from the ARIMA function is given below.
md = ARIMA(dt, order=(0, 0, 1)).fit()
print(md.summary()) SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 400
Model: ARIMA(0, 0, 1) Log Likelihood -559.755
Date: Sat, 22 Nov 2025 AIC 1125.511
Time: 19:48:49 BIC 1137.485
Sample: 0 HQIC 1130.253
- 400
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0092 0.017 296.502 0.000 4.976 5.042
ma.L1 -0.6566 0.037 -17.828 0.000 -0.729 -0.584
sigma2 0.9603 0.066 14.579 0.000 0.831 1.089
===================================================================================
Ljung-Box (L1) (Q): 0.53 Jarque-Bera (JB): 0.33
Prob(Q): 0.47 Prob(JB): 0.85
Heteroskedasticity (H): 0.90 Skew: 0.02
Prob(H) (two-sided): 0.54 Kurtosis: 3.14
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Let us now minimize and compare the resulting estimates of and with those reported by the ARIMA function.
# Initial guess: [mu_init, theta_init]
mu_init = np.mean(dt)
theta_init = 0
init_params = [mu_init, theta_init]
# Perform the optimization
result = minimize(S_func, init_params, args=(dt,))
print(result)
mu_hat, theta_hat = result.x
print("Estimated mu:", mu_hat)
print("Estimated theta:", theta_hat) message: Optimization terminated successfully.
success: True
status: 0
fun: 384.13856984891936
x: [ 5.009e+00 -6.582e-01]
nit: 10
jac: [ 0.000e+00 0.000e+00]
hess_inv: [[ 1.476e-04 -1.728e-06]
[-1.728e-06 6.930e-04]]
nfev: 51
njev: 17
Estimated mu: 5.009073896055675
Estimated theta: -0.6582279929829957
print(result.x)
print(md.params)[ 5.0090739 -0.65822799]
[ 5.00917415 -0.65657733 0.96028995]
The estimates of and obtained by minimizing are clearly very close to those reported by the ARIMA function. The estimate of is given by .
sigma_hat = np.sqrt(S_func(result.x, dt) / len(dt))
print("Estimated sigma:", sigma_hat)
print("ARIMA reported sigma:", np.sqrt(md.params[2]))Estimated sigma: 0.9799726652424028
ARIMA reported sigma: 0.9799438524963823
Standard Errors corresponding to the parameter estimates¶
After obtaining the MLEs and , the next step is to obtain standard errors. For this, we can use the Bayesian approach with the standard prior:
The posterior is then
This is the joint posterior of . To obtain the posterior of (without ), we integrate the above over (from 0 to ). This integration is exactly the same as Lecture 4, and we get
If is a quadratic function of , then this will be a -density. However our will involve higher powers of and is not quadratic. The posterior usually is fairly concentrated around the MLE though, so we it makes sense to approximate by a quadratic around . This is done by Taylor expansion as follows. Let and . Taylor expansion for near gives
where we used because minimizes . Here denotes the Hessian of at .
Therefore
Comparing the above with the formula:
for the -variate -density , we see that (ignoring the indicator function )
So the standard errors corresponding to and can be obtained by taking the square roots of the diagonal entries of
In order to compute these, we need to calculate the Hessian of with respect to . We will use numerical differentiation (specifically a function from the library numdifftools) for this.
import numdifftools as nd
# this library has functions to calculate first and second derivatives
alphaest = result.x
n = len(dt)
H = nd.Hessian(lambda alpha: S_func(alpha, dt), step = 1e-6)(alphaest)
sighat = np.sqrt(S_func(alphaest, dt) / (n - 2))
covmat = (sighat ** 2) * np.linalg.inv(0.5 * H)
stderrs = np.sqrt(np.diag(covmat))
# ---- Output ----
print("Estimated mu:", alphaest[0])
print("Estimated theta:", alphaest[1])
print("Estimated sigma:", sighat)
print("Covariance matrix:\n", covmat)
print("Standard errors:", stderrs)
#The standard errors of mu and theta reported by ARIMA function are:
print("ARIMA standard errors:", md.bse[0:2])Estimated mu: 5.009073896055675
Estimated theta: -0.6582279929829957
Estimated sigma: 0.9824318225976606
Covariance matrix:
[[ 2.84046626e-04 -2.35806811e-06]
[-2.35806811e-06 1.33611939e-03]]
Standard errors: [0.01685368 0.03655297]
ARIMA standard errors: [0.01689423 0.03682785]
It is easy to see that the standard errors obtained by our formula are very close to the standard errors reported by the ARIMA function.
Varve Dataset from Lecture 21¶
Let us see if the numbers from the calculations above match with those reported by ARIMA on the varve dataset (which we used in Lecture 21).
varve_data = pd.read_csv("varve.csv")
yraw = varve_data['x']
plt.plot(yraw)
plt.xlabel('Time')
plt.ylabel('Thickness')
plt.title('Glacial Varve Thickness')
plt.show()
In Lecture 21, we fit the MA(1) model to the differences of logarithms of this dataset.
# First take logarithms:
ylog = np.log(yraw)
plt.plot(ylog)
plt.xlabel('Time')
plt.ylabel('log(Thickness)')
plt.title('Logarithm of Glacial Varve Thickness')
plt.show()
# Then take differences:
ylogdiff = np.diff(ylog)
plt.plot(ylogdiff)
plt.xlabel('Time')
plt.ylabel('diff(log(Thickness))')
plt.title('Differenced Logarithm of Glacial Varve Thickness')
plt.show()
From the sample acf and pacf plots for this data, it is clear that MA(1) is appropriate.
L = 60
# Plot sample ACF/PACF
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 4))
plot_acf(ylogdiff, lags=L, ax=ax1, title='Sample ACF')
plot_pacf(ylogdiff, lags=L, ax=ax2, title='Sample PACF')
plt.tight_layout()
plt.show()
The MA(1) model can be fit to this data as follows.
mamod = ARIMA(ylogdiff, order=(0, 0, 1)).fit()
print(mamod.summary()) SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 633
Model: ARIMA(0, 0, 1) Log Likelihood -440.678
Date: Sat, 22 Nov 2025 AIC 887.356
Time: 19:48:49 BIC 900.707
Sample: 0 HQIC 892.541
- 633
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0013 0.004 -0.280 0.779 -0.010 0.008
ma.L1 -0.7710 0.023 -33.056 0.000 -0.817 -0.725
sigma2 0.2353 0.012 18.881 0.000 0.211 0.260
===================================================================================
Ljung-Box (L1) (Q): 9.16 Jarque-Bera (JB): 7.58
Prob(Q): 0.00 Prob(JB): 0.02
Heteroskedasticity (H): 0.95 Skew: -0.22
Prob(H) (two-sided): 0.69 Kurtosis: 3.30
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
We now use our functions to check if they result in similar numbers for the estimates and standard errors.
def S_func(params, y): # this is the function S(\mu, \theta)
mu, theta = params
n = len(y)
eps = np.zeros(n)
eps[0] = y[0] - mu
for t in range(1, n):
eps[t] = y[t] - mu - theta * eps[t-1]
S_val = np.sum(eps**2)
return S_valdt = ylogdiff
# Initial guess: [mu_init, theta_init]
mu_init = np.mean(dt)
theta_init = 0
init_params = [mu_init, theta_init]
# Perform the optimization
result = minimize(S_func, init_params, args=(dt,))
print(result)
mu_hat, theta_hat = result.x
print("Estimated mu:", mu_hat)
print("Estimated theta:", theta_hat) message: Optimization terminated successfully.
success: True
status: 0
fun: 149.0042362521212
x: [-1.137e-03 -7.728e-01]
nit: 12
jac: [ 0.000e+00 0.000e+00]
hess_inv: [[ 4.132e-05 4.927e-06]
[ 4.927e-06 2.550e-03]]
nfev: 63
njev: 21
Estimated mu: -0.0011366484142282629
Estimated theta: -0.7728310145141554
print(result.x)
print(mamod.params)[-0.00113665 -0.77283101]
[-0.00125667 -0.77099236 0.23528045]
sigma_hat = np.sqrt(S_func(result.x, dt) / len(dt))
print("Estimated sigma:", sigma_hat)
print("MA reported sigma:", np.sqrt(mamod.params[2]))Estimated sigma: 0.485173925675123
MA reported sigma: 0.4850571581949342
The estimates are very close to each other. Below are the standard errors.
alphaest = result.x
n = len(dt)
H = nd.Hessian(lambda alpha: S_func(alpha, dt), step = 1e-6)(alphaest)
sighat = np.sqrt(S_func(alphaest, dt) / (n - 2))
covmat = (sighat ** 2) * np.linalg.inv(0.5 * H)
stderrs = np.sqrt(np.diag(covmat))
# ---- Output ----
print("Estimated mu:", alphaest[0])
print("Estimated theta:", alphaest[1])
print("Estimated sigma:", sighat)
print("Covariance matrix:\n", covmat)
print("Standard errors:", stderrs)
# The standard errors of mu and theta reported by ARIMA function are:
print("MA standard errors:", md.bse[0:2])Estimated mu: -0.0011366484142282629
Estimated theta: -0.7728310145141554
Estimated sigma: 0.48594221424137307
Covariance matrix:
[[1.94152441e-05 8.29322690e-07]
[8.29322690e-07 1.17034011e-03]]
Standard errors: [0.00440627 0.03421023]
MA standard errors: [0.01689423 0.03682785]
The standard error corresponding to that we computed is a little off from the one reported by the ARIMA function, but the standard errors corresponding to are almost the same.
Note that we did not calculate standard errors for . This can be done by calculating the posterior of (this can be written in terms of the chi-squared distribution).
The ARIMA function works with the full likelihood unlike the analysis above which works with the conditional likelihood (conditioning on ). So the answers will be slightly different. The full likelihood is more complicated to write down.