Lab 2 - Stat 153/248 - Stat 153 / 248

This lab will go into concepts from Lectures 3 and 4 (covering measures of dependence). This includes:

Calculating mean functions
Calculating autocovariance for white noise, moving average, and random walk data
Calculating autocorrelation
Calculating cross-covariance
Calculating cross-correlation

For this lab, you will fill in the aspects of the code marked ... or with the comment # FILL IN

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from statsmodels.tsa.stattools import acf, acovf, ccf
from statsmodels.graphics.tsaplots import plot_acf, plot_ccf
#!pip install astsa
import astsa # You should have pip installed this for Lab 1.. if not, uncomment line above

# Set the random seed, this is so you will generate the same answers
# each time (for example, when generating white noise)
np.random.seed(42)

# Let's use the `moving_average`, `white_noise`, and `random_walk_drift` functions from Lab 1
# Here we also extend the white_noise function so it returns a matrix
# with `nt` time points and `nobs` observations.
def white_noise(nt, nobs=1, var=1):
    '''
    Generate a time series with uncorrelated random variables, `w_t`, 
    with mean 0 and finite variance `var`.
    If var=1 this is the standard normal distribution.
    Inputs:
        nt [int] : number of time points
        nobs [int] : number of white noise time series to generate
        var [float] : variance
    Output:
        w [np.array] = array of length nt
    '''
    w = np.sqrt(var) * np.random.randn(nt,nobs) 
    return w

    
def moving_average(w, n=3):
    '''
    Create an acausal moving average of the time series `w` 
    using an `n`-point moving average (default 3).
    '''
    w = np.asarray(w)
    if w.ndim != 2:
        w = np.atleast_2d(w).T
    nt, nobs = w.shape
    v = np.zeros((nt, nobs)) * np.nan
    half_win = n // 2
    for t in np.arange(half_win, nt - half_win):
        v[t,:] = np.mean(w[t-half_win : t+half_win+1,:], axis=0)
    return v


def random_walk_drift(nt, drift, x0=0):
    '''
    Create a random walk with drift for `nt` time points, `drift` drift, and
    with initial value `x0`.
    Inputs:
        nt (int) : number of time points
        drift (float) : constant drift
        x0 (float) : initial value
    '''
    x = np.zeros((nt,))
    x[0] = x0
    for n in np.arange(1,nt):
        x[n] = x[n-1] + white_noise(1).item() + drift
    return x

Mean and Variance Function of a Moving Average Series¶

We showed in class that applying a moving average to our time series does not change the mean function, but that it does reduce the variance. Here, let’s define a matrix of observations of white noise time series w_matrix and apply the moving_average function to it to get w_matrix_ma.

Unlike Lab 1, we’re going to initialize both of these variables as matrices so we don’t have to loop through the individual observations. This will make our code much more efficient.

nt = 150
nobs = 1000
moving_average_win = 3

# Initialize white noise matrix `nt` time points and `nobs` observations
w_matrix = white_noise(nt, nobs)

# Initialize moving average version of w
w_matrix_ma = moving_average(w_matrix, n=moving_average_win)

# Now let's plot the mean functions of the white noise time
# series, the moving average noise series, and their variance
# functions.

# Try also changing the `moving_average_win` above
# to see how it affects the mean and variance
plt.figure()
plt.subplot(1,2,1)
plt.plot(w_matrix.mean(1))
plt.plot(w_matrix_ma.mean(1))
plt.gca().set_ylim([-1,1])
plt.xlabel('Time')
plt.ylabel('Mean')

plt.subplot(1,2,2)
plt.plot(w_matrix.var(1))
plt.plot(w_matrix_ma.var(1))
plt.xlabel('Time')
plt.ylabel('Variance');

plt.tight_layout()

As we showed in class, theoretically the mean function should not change when applying a moving average to a white noise time series, and should just continue to be 0.

$\mu_{vt} = \mathbb{E}(v_t) = \frac{1}{3}[\mathbb{E}(w_{t-1})+\mathbb{E}(w_{t})+\mathbb{E}(w_{t+1})] = 0$

So why, in practice, did we get means that were slightly different? (Hint: try changing nobs).

On the other hand, the variance does indeed change when we smooth the data - notice how it changes as you increase the number of points in the moving average (change the value for moving_average_win). The value should hover around $\frac{1}{n}$ .

Autocovariance and Autocorrelation¶

In our lectures we also talked about autocovariance, which is the linear dependence between two points of the same series observed at different times $s$ and $t$ . We can define a function for autocovariance as:

$\gamma_x(s,t) =\text{cov}(x_s,x_t) = \text{E}[(x_s-\mu_s)(x_t-\mu_t)]$

Where $\mu_s$ and $\mu_t$ are the expected values of $x_s$ and $x_t$ .

Autocorrelation is the normalized form of autocovariance:

$\gamma_x(s,t) =\frac{\text{cov}(x_s,x_t)}{\text{cov}(x_s,x_s)\text{cov}(x_t,x_t)}$

Most commonly we will use built-in functions, for example from the statsmodels package, we have the acf (autocorrelation) and acovf (autocovariance) functions. acf also has a built-in plotting tool plot_acf, which can plot confidence intervals.

# Autocovariance and autocorrelation of white noise
nt = 10000
w = white_noise(nt)

# Now let's try using the statsmodels package
acf_values = acf(w, nlags=100)
acovf_values = acovf(w, nlag=100)

# Use statsmodels plot_acf, which also calculates the acf values:
plot_acf(w, lags=100, alpha=0.05)
plt.title('Autocorrelation Function (ACF) Plot')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()

# The acovf doesn't have a separate plotting function, so we 
# will just use a stem plot.
plt.stem(acovf_values)
plt.title('Autocovariance Function Plot')
plt.xlabel('Lags')
plt.ylabel('Autocovariance')
plt.show();

Why are the autocovariance and autocorrelation the same here? What if we multiply our white noise series w by a constant? What happens to the values at lag=0? Does this white noise time series satisfy the contraints for weak and strict stationarity?

# Make a new white noise
c = 10
v = white_noise(nt, var=1)*c # New scaled white noise time series

acf_values = acf(v, nlags=100)
acovf_values = acovf(v, nlag=100)

plot_acf(v, lags=100, alpha=0.05)
plt.title('Autocorrelation Function (ACF) Plot')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()

# The acovf doesn't have a separate plotting function...
plt.stem(acovf_values)
plt.title('Autocovariance Function Plot')
plt.xlabel('Lags')
plt.ylabel('Autocovariance')
plt.show();

Autocorrelation of moving average¶

Now let’s look at the autocorrelation function of a moving average with different window lengths n. Note that we have to drop the first n-1 samples, which are not defined (stored as NaNs). You do not want to set these NaN values to 0 as this can induce artificial correlation structure in your data.

Is the moving average (weakly) stationary? If it is, you should notice that the autocorrelation and autocovariance functions decay as a function of lag.

Try changing the window length win and the number of time points nt. How do these affect what you observe?

win = 6
nt = 1000
w = white_noise(nt, var=2)
ma_data = moving_average(w, n=win)

plot_acf(ma_data[~np.isnan(ma_data)], lags=100);
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')

Autocorrelation of random walk with drift¶

Now try plotting the autocorrelation for the random walk with drift, generating several examples of the random walk time series. What do you notice? Is it stationary? What changes if you change the scale of the drift (positive, negative, zero, large, small)?

nt = 100
rwd = random_walk_drift(nt, drift=0.01)
plot_acf(rwd, lags=len(rwd)-1);
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')

Real data¶

We have some examples below from the astsa package of real data shown in the Shumway and Stoffer textbook. Try plotting the autocorrelation of these real datasets (or choose others that interest you!). What do you notice about the stationarity of the time series, and about the relationship between the autocorrelation and autocovariance?

First we’ll print the list of potential datasets from astsa again -- these start with the word load_

dir(astsa.datasets)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'load_EQ5',
 'load_EXP6',
 'load_Hare',
 'load_Lynx',
 'load_chicken',
 'load_djia',
 'load_fmri1',
 'load_gtemp_land',
 'load_gtemp_ocean',
 'load_jj',
 'load_rec',
 'load_soi',
 'load_speech',
 'load_sunspotz',
 'utils']

# Let's try the fMRI data

fmri_data = astsa.load_fmri1()
fmri_data

# Let's also load the hare and lynx datasets

hare_data = astsa.load_Hare()
lynx_data = astsa.load_Lynx()

print(hare_data)
print(lynx_data)

    Time  Value
0   1845  19.58
1   1846  19.60
2   1847  19.61
3   1848  11.99
4   1849  28.04
..   ...    ...
86  1931  19.52
87  1932  82.11
88  1933  89.76
89  1934  81.66
90  1935  15.76

[91 rows x 2 columns]
    Time  Value
0   1845  30.09
1   1846  45.15
2   1847  49.15
3   1848  39.52
4   1849  21.23
..   ...    ...
86  1931   8.31
87  1932  16.01
88  1933  24.82
89  1934  29.70
90  1935  35.40

[91 rows x 2 columns]

# Now let's look at the autcorrelation function of the fMRI data from a particular region ('thal2')
# We'll also try applying a moving average to our fMRI data, with a window length of 11
region = 'thal2'
win = 11
ma_fmri = moving_average(fmri_data[region], n=win)


plt.figure(figsize=(10,3))
plt.subplot(1,3,1)
plt.plot(fmri_data[region], label='original fMRI data')
plt.plot(ma_fmri, label='smoothed fMRI data')
plt.legend()

plt.subplot(1,3,2)
plot_acf(fmri_data[region], lags=60, ax=plt.gca())
plt.gca().axis('tight')
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')

plt.subplot(1,3,3)
plot_acf(ma_fmri[~np.isnan(ma_fmri)], lags=60, ax=plt.gca(), color='orange', vlines_kwargs={'color': 'orange'}, );
plt.gca().axis('tight')
plt.title('Autocorrelation for smoothed data')
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')

What happens to the autocorrelation of these data after smoothing? Try it for other brain areas - any of the column names for the fmri_data dataset should work (see cell above when we printed it).

# Now try this for the hare and lynx dataset. We will look at the cross-correlation of these datasets later.

plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
plt.plot(hare_data['Value'])
plt.xlabel('Time point')
plt.ylabel('Value')
plt.ylabel('Hare pelts')
plt.title('Hare data')

plt.subplot(2,2,2)
plot_acf(hare_data['Value'], lags=len(hare_data['Value'])-1, ax=plt.gca());
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.title('ACF - Hare')

plt.subplot(2,2,3)
plt.plot(lynx_data['Value'])
plt.xlabel('Time point')
plt.ylabel('Lynx pelts')
plt.title('Lynx data')

plt.subplot(2,2,4)
plot_acf(lynx_data['Value'], lags=len(hare_data['Value'])-1, ax=plt.gca());
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.title('ACF - Lynx');

plt.tight_layout()

A Practical use of autocorrelation¶

Finally, let’s try this for the speech dataset. Here, the autocorrelation function can be used to calculate the pitch of the person’s voice, if we express the time lags in seconds rather than in samples. Here, we know that the sampling rate of the speech data was 10000 Hz (given in your book).

We can look for peaks in the autocorrelation function, and large peaks outsize of lag=0 correspond to the pitch period and multiples of the pitch period (its harmonics).

speech_data = astsa.load_speech()
sampling_rate = 10000 # This is the sampling rate of the speech data in samples per sec (Hz), according to SS
t = (speech_data['Time']-1)/sampling_rate # Subtract 1 so time series starts from 0 seconds 

speech_acf = acf(speech_data['Value'], nlags=250)
plot_acf(speech_data['Value'], lags=250);
plt.gca().set_xticks(np.arange(251,step=50), labels=[t[a] for a in np.arange(251,step=50)]);
plt.xlabel('Time (s)')

# It looks like there is a large peak around t=0.01 seconds (and tellingly, another
# around t=0.02 seconds, which is a multiple. t=0.01 seconds is the pitch period, 
# but typically we convert this to a frequency by dividing 1 second by this amount:
print(f'The pitch is approximately {1/0.01} Hz')

# We could use a function like `scipy.signal.findpeaks` to do this more accurately,
# but for now that is beyond the scope of the lab (you can try to implement yourself!)

The pitch is approximately 100.0 Hz

Cross-covariance and cross-correlation¶

Many times, we don’t just want to describe the relationship between time points in the same series, we want to describe the relationship between multiple time series. For example, how does the time course of reported cases of a disease relate to time course of reported deaths? How do the numbers of lynx and hare pelts relate to each other in our prior example?

In this case, we can use cross-correlation (or look at cross-covariance) of the two time series.

#plot_ccf(lynx_data['Value'], hare_data['Value'], lags=len(lynx_data['Value'])-1, negative_lags=True)

x = lynx_data['Value'].values
y = hare_data['Value'].values

max_lag = 30   # Let's choose some value here explicitly

# Positive lags: lynx data leads hare data
ccf_xy = ccf(x, y, adjusted=False)[:max_lag+1]

# Negative lags: hare data leads lynx data
ccf_yx = ccf(y, x, adjusted=False)[:max_lag+1]

# Build symmetric lag axis
lags = np.arange(-max_lag, max_lag+1)

# Combine (exclude duplicate lag 0)
ccf_full = np.r_[ccf_yx[1:][::-1], ccf_xy]

plt.stem(lags, ccf_full)
plt.axvline(0, color='r')
plt.xlabel('Lag')
plt.ylabel('CCF')
plt.title('positive: lynx leads hare, negative: hare leads lynx');
plt.grid()

Interpreting cross correlation¶

One point of caution is that with these real data examples presented here, they are not usually weakly stationary, so interpreting the cross-correlation analysis can be problematic. In the future, we will use other tools such as prewhitening, and detrending to help before we run the cross-correlation analysis. See Example 1.29 and 1.30 in your book (SS pages 34-35) to read more about this.