Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.optimize import minimize
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_process import ArmaProcess, arma_acf, arma_pacf
from statsmodels.tsa.arima.model import ARIMA

Parameter Estimation in MA(1)

Estimating the parameters of ARMA (as well as ARIMA, SARIMA models) is much harder than parameter estimation in AR models which was handled by standard regression (ordinary least squares). We will not study this topic in any detail (and simply rely on the ARIMA function for fitting these models to data). But today, we will discuss estimation for MA(1) to gain some insight into how estimation works in these models.

Recall that the MA(1) model is given by

yt=μ+ϵt+θϵt1y_t = \mu + \epsilon_t + \theta \epsilon_{t-1}

with ϵti.i.dN(0,σ2)\epsilon_t \overset{\text{i.i.d}}{\sim} N(0, \sigma^2). For parameter estimation, we need to write down the likelihood.

Likelihood

The joint density of y1,,yny_1, \dots, y_n is multivariate normal with mean vector m:=(μ,,μ)Tm := (\mu, \dots, \mu)^T and covariance matrix Σ\Sigma where Σ\Sigma equals the n×nn \times n matrix whose (i,j)th(i, j)^{th} entry is given by σ2(1+θ2)\sigma^2 (1 + \theta^2) when i=ji = j, equals σ2θ\sigma^2 \theta when ij=1|i - j| = 1 and equals 0 for all other (i,j)(i, j). The likelihood is therefore

(12π)n(detΣ)1/2exp(12(ym)Σ1(ym))\begin{align*} \left(\frac{1}{\sqrt{2 \pi}} \right)^n \left(\det \Sigma \right)^{-1/2} \exp \left( - \frac{1}{2} (y - m)' \Sigma^{-1} (y - m)\right) \end{align*}

where yy is the n×1n \times 1 vector with components y1,,yny_1, \dots, y_n. This is a function of the unknown parameters μ,θ,σ\mu, \theta, \sigma which can be estimated by maximizing the logarithm of the likelihood. The presence of Σ1\Sigma^{-1} makes this computationally expensive. Some (exact or approximate) formula should be used for Σ1\Sigma^{-1} so that one does not need to invert an n×nn \times n matrix every time the log-likelihood is to be computed.

Conditional Likelihood

Instead of the full likelihood, the form of the conditional likelihood when conditioned on ϵ0=0\epsilon_0 = 0 is simpler. This is similar to the story for the AR(1) model where the form of the likelihood when conditioned on y1y_1 is much simpler than the full likelihood. The difference here is that instead of conditioning on y1y_1 being the observed value, we are conditioning on ϵ0=0\epsilon_0 = 0.

Below we write the conditional likelihood:

fy1,,ynϵ0=0(y1,,yn).f_{y_1, \dots, y_n \mid \epsilon_0 = 0}(y_1, \dots, y_n).

This likelihood can be broken down as

fy1ϵ0=0(y1)fy2y1,ϵ0=0(y2)fy3y1,y2,ϵ0=0(y3)fyny1,,yn1,ϵ0=0(yn)f_{y_1 \mid \epsilon_0 = 0}(y_1) f_{y_2 \mid y_1, \epsilon_0 = 0}(y_2) f_{y_3 \mid y_1, y_2, \epsilon_0 = 0}(y_3) \dots f_{y_n \mid y_1, \dots, y_{n-1}, \epsilon_0 = 0}(y_n)

Each of the terms above can be written explicitly. Let ϵ^1=y1μ\hat{\epsilon}_1 = y_1 - \mu and, for t=2,,nt = 2, \dots, n,

ϵ^t=ytμθϵ^t1\hat{\epsilon}_t = y_t- \mu - \theta \hat{\epsilon}_{t-1}

Then

fy1ϵ0=0(y1)=12πσexp((y1μ)22σ2)f_{y_1 \mid \epsilon_0 = 0}(y_1) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(y_1 - \mu)^2}{2 \sigma^2} \right)

and

fyty1,,yt1,ϵ=0(yt)=fytϵ1=ϵ^1,ϵ2=ϵ^2,,ϵt1=ϵ^t1,ϵ0=0(yt)=fϵt(ytμθϵ^t1)=12πσexp(12σ2(ytμθϵ^t1)2).\begin{align*} f_{y_t \mid y_1, \dots, y_{t-1}, \epsilon = 0}(y_t) &= f_{y_t \mid \epsilon_1 = \hat{\epsilon}_1, \epsilon_2 = \hat{\epsilon}_2, \dots, \epsilon_{t-1} = \hat{\epsilon}_{t-1}, \epsilon_0 = 0}(y_t) \\ &= f_{\epsilon_t}(y_t - \mu - \theta \hat{\epsilon}_{t-1}) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^2} (y_t - \mu - \theta \hat{\epsilon}_{t-1})^2 \right). \end{align*}

Thus the conditional likelihood given ϵ0=0\epsilon_0 = 0 is given by

likelihood=12πσexp((y1μ)22σ2)[t=2n12πσexp(12σ2(ytμθϵ^t1)2)]=12πσexp(ϵ^122σ2)[t=2n12πσexp(ϵ^t22σ2)]=(1σ2π)nexp(12σ2t=1nϵ^t2).\begin{align*} \text{likelihood} &= \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(y_1 - \mu)^2}{2 \sigma^2} \right) \left[\prod_{t=2}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^2} (y_t - \mu - \theta \hat{\epsilon}_{t-1})^2 \right) \right] \\ & = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\hat{\epsilon}_1^2}{2 \sigma^2} \right) \left[\prod_{t=2}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\hat{\epsilon}_t^2}{2 \sigma^2} \right) \right] \\ &= \left(\frac{1}{\sigma \sqrt{2 \pi}} \right)^n \exp \left(-\frac{1}{2 \sigma^2} \sum_{t=1}^n \hat{\epsilon}_t^2 \right). \end{align*}

The key term in the above likelihood is t=1nϵ^t2\sum_{t=1}^n \hat{\epsilon}_t^2 which depends on the data {yt}\{y_t\} and also on the parameters μ\mu and θ\theta. So let us denote

S(μ,θ):=t=1nϵ^t2\begin{align*} S(\mu, \theta) := \sum_{t=1}^n \hat{\epsilon}_t^2 \end{align*}

so that the likelihood and log-likelihood become

likelihood=(1σ2π)nexp(S(μ,θ)2σ2)\begin{align*} \text{likelihood} = \left(\frac{1}{\sigma \sqrt{2 \pi}} \right)^n \exp \left(-\frac{S(\mu, \theta)}{2 \sigma^2} \right) \end{align*}

and

log-likelihood=n2log(σ2)S(μ,θ)2σ2.\begin{align*} \text{log-likelihood} = -\frac{n}{2} \log (\sigma^2) - \frac{S(\mu, \theta)}{2 \sigma^2}. \end{align*}

Thus the MLEs of μ,θ\mu, \theta are obtained by minimizing S(μ,θ)S(\mu, \theta), and the MLE of σ\sigma is

σ^=S(μ^,θ^)n\begin{align*} \hat{\sigma} = \sqrt{\frac{S(\hat{\mu}, \hat{\theta})}{n}} \end{align*}
def S_func(params, y): # this is the function S(\mu, \theta)
    mu, theta = params
    n = len(y)
    eps = np.zeros(n)
    eps[0] = y[0] - mu
    for t in range(1, n):
        eps[t] = y[t] - mu - theta * eps[t-1]
    S_val = np.sum(eps**2)
    return S_val

We will use the function ‘minimize’ from scipy.optimize to minimize S_func. First let us simulate some data from the MA(1) model with theta = -0.7 to evaluate the performance of our estimation strategy.

arma_process = ArmaProcess([1], [1, -0.7])
dt = arma_process.generate_sample(nsample=400)
dt = dt + 5 # adding a mean of 5

The output from the ARIMA function is given below.

md = ARIMA(dt, order=(0, 0, 1)).fit()
print(md.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  400
Model:                 ARIMA(0, 0, 1)   Log Likelihood                -559.755
Date:                Sat, 22 Nov 2025   AIC                           1125.511
Time:                        19:48:49   BIC                           1137.485
Sample:                             0   HQIC                          1130.253
                                - 400                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0092      0.017    296.502      0.000       4.976       5.042
ma.L1         -0.6566      0.037    -17.828      0.000      -0.729      -0.584
sigma2         0.9603      0.066     14.579      0.000       0.831       1.089
===================================================================================
Ljung-Box (L1) (Q):                   0.53   Jarque-Bera (JB):                 0.33
Prob(Q):                              0.47   Prob(JB):                         0.85
Heteroskedasticity (H):               0.90   Skew:                             0.02
Prob(H) (two-sided):                  0.54   Kurtosis:                         3.14
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Let us now minimize S(μ,θ)S(\mu, \theta) and compare the resulting estimates of μ\mu and θ\theta with those reported by the ARIMA function.

# Initial guess: [mu_init, theta_init]
mu_init = np.mean(dt)
theta_init = 0
init_params = [mu_init, theta_init]

# Perform the optimization
result = minimize(S_func, init_params, args=(dt,))
print(result)
mu_hat, theta_hat = result.x

print("Estimated mu:", mu_hat)
print("Estimated theta:", theta_hat)
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 384.13856984891936
        x: [ 5.009e+00 -6.582e-01]
      nit: 10
      jac: [ 0.000e+00  0.000e+00]
 hess_inv: [[ 1.476e-04 -1.728e-06]
            [-1.728e-06  6.930e-04]]
     nfev: 51
     njev: 17
Estimated mu: 5.009073896055675
Estimated theta: -0.6582279929829957
print(result.x)
print(md.params)
[ 5.0090739  -0.65822799]
[ 5.00917415 -0.65657733  0.96028995]

The estimates of μ\mu and θ\theta obtained by minimizing S(μ,θ)S(\mu, \theta) are clearly very close to those reported by the ARIMA function. The estimate of σ\sigma is given by S(μ^,θ^)/n\sqrt{S(\hat{\mu}, \hat{\theta})/n}.

sigma_hat = np.sqrt(S_func(result.x, dt) / len(dt))
print("Estimated sigma:", sigma_hat)
print("ARIMA reported sigma:", np.sqrt(md.params[2]))
Estimated sigma: 0.9799726652424028
ARIMA reported sigma: 0.9799438524963823

Standard Errors corresponding to the parameter estimates

After obtaining the MLEs μ^\hat{\mu} and θ^\hat{\theta}, the next step is to obtain standard errors. For this, we can use the Bayesian approach with the standard prior:

μ,θ,logσi.i.dunif(C,C).\begin{align*} \mu, \theta, \log \sigma \overset{\text{i.i.d}}{\sim} \text{unif}(-C, C). \end{align*}

The posterior is then

fμ,θ,σdata(μ,θ,σ)(12πσ)nexp(S(μ,θ)2σ2)×1σI{C<μ,θ,logσ<C}σn1exp(S(μ,θ)2σ2)I{C<μ,θ,logσ<C}\begin{align*} f_{\mu, \theta, \sigma \mid \text{data}}(\mu, \theta, \sigma) &\propto \left(\frac{1}{\sqrt{2 \pi}\sigma}\right)^n \exp \left(-\frac{S(\mu, \theta)}{2 \sigma^2} \right) \times \frac{1}{\sigma} I\{ -C < \mu, \theta, \log \sigma < C\} \\ &\propto \sigma^{-n-1} \exp \left(-\frac{S(\mu, \theta)}{2 \sigma^2} \right) I\{ -C < \mu, \theta, \log \sigma < C\} \end{align*}

This is the joint posterior of μ,θ,σ\mu, \theta, \sigma. To obtain the posterior of μ,θ\mu, \theta (without σ\sigma), we integrate the above over σ\sigma (from 0 to \infty). This integration is exactly the same as Lecture 4, and we get

fμ,θdata(μ,θ)(1S(μ,θ))n/2I{C<μ,θ<C}(S(μ^,θ^)S(μ,θ))n/2I{C<μ,θ<C}\begin{align*} f_{\mu, \theta \mid \text{data}}(\mu, \theta) \propto \left(\frac{1}{S(\mu, \theta)} \right)^{n/2} I\{ -C < \mu, \theta < C\} \propto \left(\frac{S(\hat{\mu}, \hat{\theta})}{S(\mu, \theta)} \right)^{n/2} I\{ -C < \mu, \theta < C\} \end{align*}

If S(μ,θ)S(\mu, \theta) is a quadratic function of μ,θ\mu, \theta, then this will be a tt-density. However our S(μ,θ)S(\mu, \theta) will involve higher powers of θ\theta and is not quadratic. The posterior usually is fairly concentrated around the MLE (μ^,θ^)(\hat{\mu}, \hat{\theta}) though, so we it makes sense to approximate S(μ,θ)S(\mu, \theta) by a quadratic around (μ^,θ^)(\hat{\mu}, \hat{\theta}). This is done by Taylor expansion as follows. Let α=(μ,θ)\alpha = (\mu, \theta) and α^=(μ^,θ^)\hat{\alpha} = (\hat{\mu}, \hat{\theta}). Taylor expansion for α\alpha near α^\hat{\alpha} gives

S(α)=S(α^)+<S(α^),αα^>+(αα^)T(12HS(α^))(αα^)=S(α^)+(αα^)T(12HS(α^))(αα^)\begin{align*} S(\alpha) &= S(\hat{\alpha}) + \left<\nabla S(\hat{\alpha}), \alpha - \hat{\alpha} \right> + \left(\alpha - \hat{\alpha} \right)^T \left(\frac{1}{2} HS(\hat{\alpha}) \right) \left(\alpha - \hat{\alpha} \right) \\ &= S(\hat{\alpha}) + \left(\alpha - \hat{\alpha} \right)^T \left(\frac{1}{2} HS(\hat{\alpha}) \right) \left(\alpha - \hat{\alpha} \right) \end{align*}

where we used S(α^)=0\nabla S(\hat{\alpha}) = 0 because α^\hat{\alpha} minimizes S(α)S(\alpha). Here HS(α^)HS(\hat{\alpha}) denotes the Hessian of SS at α^\hat{\alpha}.

Therefore

fμ,θdata(μ,θ)(1S(μ,θ))n/2I{C<μ,θ<C}(S(α^)S(α))n/2I{C<μ,θ<C}=(S(α^)S(α^)+(αα^)T(12HS(α^))(αα^))n/2I{C<μ,θ<C}=(11+(αα^)T(12S(α^)HS(α^))(αα^))n/2I{C<μ,θ<C}=(11+1n2(αα^)T(n22S(α^)HS(α^))(αα^))n2+22I{C<μ,θ<C}.\begin{align*} & f_{\mu, \theta \mid \text{data}}(\mu, \theta) \\ &\propto \left(\frac{1}{S(\mu, \theta)} \right)^{n/2} I\{ -C < \mu, \theta < C\} \\ &\propto \left(\frac{S(\hat{\alpha})}{S(\alpha)} \right)^{n/2} I\{ -C < \mu, \theta < C\} \\ &= \left(\frac{S(\hat{\alpha})}{S(\hat \alpha) + \left(\alpha - \hat{\alpha} \right)^T \left(\frac{1}{2} HS(\hat{\alpha}) \right) \left(\alpha - \hat{\alpha} \right)} \right)^{n/2} I\{-C < \mu, \theta < C\} \\ &= \left(\frac{1}{1 + \left(\alpha - \hat{\alpha} \right)^T \left(\frac{1}{2 S(\hat{\alpha})} HS(\hat{\alpha}) \right) \left(\alpha - \hat{\alpha} \right)} \right)^{n/2} I\{-C < \mu, \theta < C\} \\ &= \left(\frac{1}{1 + \frac{1}{n-2}\left(\alpha - \hat{\alpha} \right)^T \left(\frac{n-2}{2 S(\hat{\alpha})} HS(\hat{\alpha}) \right) \left(\alpha - \hat{\alpha} \right)} \right)^{\frac{n-2+2}{2}} I\{-C < \mu, \theta < C\}. \end{align*}

Comparing the above with the formula:

(11+1k(xm)TΣ1(xm))k+p2\begin{align*} \left(\frac{1}{1 + \frac{1}{k} (x - m)^T \Sigma^{-1} (x - m)} \right)^{\frac{k + p}{2}} \end{align*}

for the pp-variate tt-density tk,p(μ,Σ)t_{k, p}(\mu, \Sigma), we see that (ignoring the indicator function I{C<μ,θ<C} I\{-C < \mu, \theta < C\})

αdatatn2,2(α^,S(α^)n2(12HS(α^))1).\begin{align*} \alpha \mid \text{data} \sim t_{n-2, 2} \left(\hat{\alpha}, \frac{S(\hat{\alpha})}{n - 2} \left(\frac{1}{2} HS(\hat{\alpha}) \right)^{-1} \right). \end{align*}

So the standard errors corresponding to μ\mu and θ\theta can be obtained by taking the square roots of the diagonal entries of

S(α^)n2(12HS(α^))1\begin{align*} \frac{S(\hat{\alpha})}{n - 2} \left(\frac{1}{2} HS(\hat{\alpha}) \right)^{-1} \end{align*}

In order to compute these, we need to calculate the Hessian of S(α)S(\alpha) with respect to α\alpha. We will use numerical differentiation (specifically a function from the library numdifftools) for this.

import numdifftools as nd 
# this library has functions to calculate first and second derivatives 

alphaest = result.x
n = len(dt)
H = nd.Hessian(lambda alpha: S_func(alpha, dt), step = 1e-6)(alphaest)

sighat = np.sqrt(S_func(alphaest, dt) / (n - 2))
covmat = (sighat ** 2) * np.linalg.inv(0.5 * H)
stderrs = np.sqrt(np.diag(covmat))

# ---- Output ----
print("Estimated mu:", alphaest[0])
print("Estimated theta:", alphaest[1])
print("Estimated sigma:", sighat)
print("Covariance matrix:\n", covmat)
print("Standard errors:", stderrs)

#The standard errors of mu and theta reported by ARIMA function are:
print("ARIMA standard errors:", md.bse[0:2])
Estimated mu: 5.009073896055675
Estimated theta: -0.6582279929829957
Estimated sigma: 0.9824318225976606
Covariance matrix:
 [[ 2.84046626e-04 -2.35806811e-06]
 [-2.35806811e-06  1.33611939e-03]]
Standard errors: [0.01685368 0.03655297]
ARIMA standard errors: [0.01689423 0.03682785]

It is easy to see that the standard errors obtained by our formula are very close to the standard errors reported by the ARIMA function.

Varve Dataset from Lecture 21

Let us see if the numbers from the calculations above match with those reported by ARIMA on the varve dataset (which we used in Lecture 21).

varve_data = pd.read_csv("varve.csv")
yraw = varve_data['x']
plt.plot(yraw)
plt.xlabel('Time')
plt.ylabel('Thickness')
plt.title('Glacial Varve Thickness')
plt.show()
<Figure size 640x480 with 1 Axes>

In Lecture 21, we fit the MA(1) model to the differences of logarithms of this dataset.

# First take logarithms: 
ylog = np.log(yraw)
plt.plot(ylog)
plt.xlabel('Time')
plt.ylabel('log(Thickness)')
plt.title('Logarithm of Glacial Varve Thickness')
plt.show()
<Figure size 640x480 with 1 Axes>
# Then take differences:
ylogdiff = np.diff(ylog)
plt.plot(ylogdiff)
plt.xlabel('Time')
plt.ylabel('diff(log(Thickness))')
plt.title('Differenced Logarithm of Glacial Varve Thickness')
plt.show()
<Figure size 640x480 with 1 Axes>

From the sample acf and pacf plots for this data, it is clear that MA(1) is appropriate.

L = 60
# Plot sample ACF/PACF
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 4))
plot_acf(ylogdiff, lags=L, ax=ax1, title='Sample ACF')
plot_pacf(ylogdiff, lags=L, ax=ax2, title='Sample PACF')
plt.tight_layout()
plt.show()
<Figure size 1200x400 with 2 Axes>

The MA(1) model can be fit to this data as follows.

mamod = ARIMA(ylogdiff, order=(0, 0, 1)).fit()
print(mamod.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  633
Model:                 ARIMA(0, 0, 1)   Log Likelihood                -440.678
Date:                Sat, 22 Nov 2025   AIC                            887.356
Time:                        19:48:49   BIC                            900.707
Sample:                             0   HQIC                           892.541
                                - 633                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0013      0.004     -0.280      0.779      -0.010       0.008
ma.L1         -0.7710      0.023    -33.056      0.000      -0.817      -0.725
sigma2         0.2353      0.012     18.881      0.000       0.211       0.260
===================================================================================
Ljung-Box (L1) (Q):                   9.16   Jarque-Bera (JB):                 7.58
Prob(Q):                              0.00   Prob(JB):                         0.02
Heteroskedasticity (H):               0.95   Skew:                            -0.22
Prob(H) (two-sided):                  0.69   Kurtosis:                         3.30
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

We now use our functions to check if they result in similar numbers for the estimates and standard errors.

def S_func(params, y): # this is the function S(\mu, \theta)
    mu, theta = params
    n = len(y)
    eps = np.zeros(n)
    eps[0] = y[0] - mu
    for t in range(1, n):
        eps[t] = y[t] - mu - theta * eps[t-1]
    S_val = np.sum(eps**2)
    return S_val
dt = ylogdiff

# Initial guess: [mu_init, theta_init]
mu_init = np.mean(dt)
theta_init = 0
init_params = [mu_init, theta_init]

# Perform the optimization
result = minimize(S_func, init_params, args=(dt,))
print(result)
mu_hat, theta_hat = result.x

print("Estimated mu:", mu_hat)
print("Estimated theta:", theta_hat)
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 149.0042362521212
        x: [-1.137e-03 -7.728e-01]
      nit: 12
      jac: [ 0.000e+00  0.000e+00]
 hess_inv: [[ 4.132e-05  4.927e-06]
            [ 4.927e-06  2.550e-03]]
     nfev: 63
     njev: 21
Estimated mu: -0.0011366484142282629
Estimated theta: -0.7728310145141554
print(result.x)
print(mamod.params)
[-0.00113665 -0.77283101]
[-0.00125667 -0.77099236  0.23528045]
sigma_hat = np.sqrt(S_func(result.x, dt) / len(dt))
print("Estimated sigma:", sigma_hat)
print("MA reported sigma:", np.sqrt(mamod.params[2]))
Estimated sigma: 0.485173925675123
MA reported sigma: 0.4850571581949342

The estimates are very close to each other. Below are the standard errors.

alphaest = result.x
n = len(dt)
H = nd.Hessian(lambda alpha: S_func(alpha, dt), step = 1e-6)(alphaest)

sighat = np.sqrt(S_func(alphaest, dt) / (n - 2))
covmat = (sighat ** 2) * np.linalg.inv(0.5 * H)
stderrs = np.sqrt(np.diag(covmat))

# ---- Output ----
print("Estimated mu:", alphaest[0])
print("Estimated theta:", alphaest[1])
print("Estimated sigma:", sighat)
print("Covariance matrix:\n", covmat)
print("Standard errors:", stderrs)

# The standard errors of mu and theta reported by ARIMA function are:
print("MA standard errors:", md.bse[0:2])
Estimated mu: -0.0011366484142282629
Estimated theta: -0.7728310145141554
Estimated sigma: 0.48594221424137307
Covariance matrix:
 [[1.94152441e-05 8.29322690e-07]
 [8.29322690e-07 1.17034011e-03]]
Standard errors: [0.00440627 0.03421023]
MA standard errors: [0.01689423 0.03682785]

The standard error corresponding to μ\mu that we computed is a little off from the one reported by the ARIMA function, but the standard errors corresponding to θ\theta are almost the same.

Note that we did not calculate standard errors for σ\sigma. This can be done by calculating the posterior of σ\sigma (this can be written in terms of the chi-squared distribution).

The ARIMA function works with the full likelihood unlike the analysis above which works with the conditional likelihood (conditioning on ϵ0=0\epsilon_0 = 0). So the answers will be slightly different. The full likelihood is more complicated to write down.