This lab will go into concepts from Lectures 3 and 4 (covering measures of dependence). This includes:
Calculating mean functions
Calculating autocovariance for white noise, moving average, and random walk data
Calculating autocorrelation
Calculating cross-covariance
Calculating cross-correlation
For this lab, you will fill in the aspects of the code marked ... or with the comment # FILL IN
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from statsmodels.tsa.stattools import acf, acovf, ccf
from statsmodels.graphics.tsaplots import plot_acf, plot_ccf
#!pip install astsa
import astsa # You should have pip installed this for Lab 1.. if not, uncomment line above
# Set the random seed, this is so you will generate the same answers
# each time (for example, when generating white noise)
np.random.seed(42) # Let's use the `moving_average`, `white_noise`, and `random_walk_drift` functions from Lab 1
# Here we also extend the white_noise function so it returns a matrix
# with `nt` time points and `nobs` observations.
def white_noise(nt, nobs=1, var=1):
'''
Generate a time series with uncorrelated random variables, `w_t`,
with mean 0 and finite variance `var`.
If var=1 this is the standard normal distribution.
Inputs:
nt [int] : number of time points
nobs [int] : number of white noise time series to generate
var [float] : variance
Output:
w [np.array] = array of length nt
'''
w = np.sqrt(var) * np.random.randn(nt,nobs)
return w
def moving_average(w, n=3):
'''
Create an acausal moving average of the time series `w`
using an `n`-point moving average (default 3).
'''
w = np.asarray(w)
if w.ndim != 2:
w = np.atleast_2d(w).T
nt, nobs = w.shape
v = np.zeros((nt, nobs)) * np.nan
half_win = n // 2
for t in np.arange(half_win, nt - half_win):
v[t,:] = np.mean(w[t-half_win : t+half_win+1,:], axis=0)
return v
def random_walk_drift(nt, drift, x0=0):
'''
Create a random walk with drift for `nt` time points, `drift` drift, and
with initial value `x0`.
Inputs:
nt (int) : number of time points
drift (float) : constant drift
x0 (float) : initial value
'''
x = np.zeros((nt,))
x[0] = x0
for n in np.arange(1,nt):
x[n] = x[n-1] + white_noise(1).item() + drift
return xMean and Variance Function of a Moving Average Series¶
We showed in class that applying a moving average to our time series does not change the mean function, but that it does reduce the variance. Here, let’s define a matrix of observations of white noise time series w_matrix and apply the moving_average function to it to get w_matrix_ma.
Unlike Lab 1, we’re going to initialize both of these variables as matrices so we don’t have to loop through the individual observations. This will make our code much more efficient.
nt = 150
nobs = 1000
moving_average_win = 3
# Initialize white noise matrix `nt` time points and `nobs` observations
w_matrix = white_noise(nt, nobs)
# Initialize moving average version of w
w_matrix_ma = moving_average(w_matrix, n=moving_average_win)# Now let's plot the mean functions of the white noise time
# series, the moving average noise series, and their variance
# functions.
# Try also changing the `moving_average_win` above
# to see how it affects the mean and variance
plt.figure()
plt.subplot(1,2,1)
plt.plot(w_matrix.mean(1))
plt.plot(w_matrix_ma.mean(1))
plt.gca().set_ylim([-1,1])
plt.xlabel('Time')
plt.ylabel('Mean')
plt.subplot(1,2,2)
plt.plot(w_matrix.var(1))
plt.plot(w_matrix_ma.var(1))
plt.xlabel('Time')
plt.ylabel('Variance');
plt.tight_layout()
As we showed in class, theoretically the mean function should not change when applying a moving average to a white noise time series, and should just continue to be 0.
So why, in practice, did we get means that were slightly different? (Hint: try changing nobs).
On the other hand, the variance does indeed change when we smooth the data - notice how it changes as you increase the number of points in the moving average (change the value for moving_average_win). The value should hover around .
Autocovariance and Autocorrelation¶
In our lectures we also talked about autocovariance, which is the linear dependence between two points of the same series observed at different times and . We can define a function for autocovariance as:
Where and are the expected values of and .
Autocorrelation is the normalized form of autocovariance:
Most commonly we will use built-in functions, for example from the statsmodels package, we have the acf (autocorrelation) and acovf (autocovariance) functions. acf also has a built-in plotting tool plot_acf, which can plot confidence intervals.
# Autocovariance and autocorrelation of white noise
nt = 10000
w = white_noise(nt)
# Now let's try using the statsmodels package
acf_values = acf(w, nlags=100)
acovf_values = acovf(w, nlag=100)
# Use statsmodels plot_acf, which also calculates the acf values:
plot_acf(w, lags=100, alpha=0.05)
plt.title('Autocorrelation Function (ACF) Plot')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()
# The acovf doesn't have a separate plotting function, so we
# will just use a stem plot.
plt.stem(acovf_values)
plt.title('Autocovariance Function Plot')
plt.xlabel('Lags')
plt.ylabel('Autocovariance')
plt.show();

Why are the autocovariance and autocorrelation the same here? What if we multiply our white noise series w by a constant? What happens to the values at lag=0? Does this white noise time series satisfy the contraints for weak and strict stationarity?
# Make a new white noise
c = 10
v = white_noise(nt, var=1)*c # New scaled white noise time series
acf_values = acf(v, nlags=100)
acovf_values = acovf(v, nlag=100)
plot_acf(v, lags=100, alpha=0.05)
plt.title('Autocorrelation Function (ACF) Plot')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()
# The acovf doesn't have a separate plotting function...
plt.stem(acovf_values)
plt.title('Autocovariance Function Plot')
plt.xlabel('Lags')
plt.ylabel('Autocovariance')
plt.show();

Autocorrelation of moving average¶
Now let’s look at the autocorrelation function of a moving average with different window lengths n. Note that we have to drop the first n-1 samples, which are not defined (stored as NaNs). You do not want to set these NaN values to 0 as this can induce artificial correlation structure in your data.
Is the moving average (weakly) stationary? If it is, you should notice that the autocorrelation and autocovariance functions decay as a function of lag.
Try changing the window length win and the number of time points nt. How do these affect what you observe?
win = 6
nt = 1000
w = white_noise(nt, var=2)
ma_data = moving_average(w, n=win)
plot_acf(ma_data[~np.isnan(ma_data)], lags=100);
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
Autocorrelation of random walk with drift¶
Now try plotting the autocorrelation for the random walk with drift, generating several examples of the random walk time series. What do you notice? Is it stationary? What changes if you change the scale of the drift (positive, negative, zero, large, small)?
nt = 100
rwd = random_walk_drift(nt, drift=0.01)
plot_acf(rwd, lags=len(rwd)-1);
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
Real data¶
We have some examples below from the astsa package of real data shown in the Shumway and Stoffer textbook. Try plotting the autocorrelation of these real datasets (or choose others that interest you!). What do you notice about the stationarity of the time series, and about the relationship between the autocorrelation and autocovariance?
First we’ll print the list of potential datasets from astsa again -- these start with the word load_
dir(astsa.datasets) ['__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'load_EQ5',
'load_EXP6',
'load_Hare',
'load_Lynx',
'load_chicken',
'load_djia',
'load_fmri1',
'load_gtemp_land',
'load_gtemp_ocean',
'load_jj',
'load_rec',
'load_soi',
'load_speech',
'load_sunspotz',
'utils']# Let's try the fMRI data
fmri_data = astsa.load_fmri1()
fmri_data# Let's also load the hare and lynx datasets
hare_data = astsa.load_Hare()
lynx_data = astsa.load_Lynx()
print(hare_data)
print(lynx_data) Time Value
0 1845 19.58
1 1846 19.60
2 1847 19.61
3 1848 11.99
4 1849 28.04
.. ... ...
86 1931 19.52
87 1932 82.11
88 1933 89.76
89 1934 81.66
90 1935 15.76
[91 rows x 2 columns]
Time Value
0 1845 30.09
1 1846 45.15
2 1847 49.15
3 1848 39.52
4 1849 21.23
.. ... ...
86 1931 8.31
87 1932 16.01
88 1933 24.82
89 1934 29.70
90 1935 35.40
[91 rows x 2 columns]
# Now let's look at the autcorrelation function of the fMRI data from a particular region ('thal2')
# We'll also try applying a moving average to our fMRI data, with a window length of 11
region = 'thal2'
win = 11
ma_fmri = moving_average(fmri_data[region], n=win)
plt.figure(figsize=(10,3))
plt.subplot(1,3,1)
plt.plot(fmri_data[region], label='original fMRI data')
plt.plot(ma_fmri, label='smoothed fMRI data')
plt.legend()
plt.subplot(1,3,2)
plot_acf(fmri_data[region], lags=60, ax=plt.gca())
plt.gca().axis('tight')
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.subplot(1,3,3)
plot_acf(ma_fmri[~np.isnan(ma_fmri)], lags=60, ax=plt.gca(), color='orange', vlines_kwargs={'color': 'orange'}, );
plt.gca().axis('tight')
plt.title('Autocorrelation for smoothed data')
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')

What happens to the autocorrelation of these data after smoothing? Try it for other brain areas - any of the column names for the fmri_data dataset should work (see cell above when we printed it).
# Now try this for the hare and lynx dataset. We will look at the cross-correlation of these datasets later.
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
plt.plot(hare_data['Value'])
plt.xlabel('Time point')
plt.ylabel('Value')
plt.ylabel('Hare pelts')
plt.title('Hare data')
plt.subplot(2,2,2)
plot_acf(hare_data['Value'], lags=len(hare_data['Value'])-1, ax=plt.gca());
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.title('ACF - Hare')
plt.subplot(2,2,3)
plt.plot(lynx_data['Value'])
plt.xlabel('Time point')
plt.ylabel('Lynx pelts')
plt.title('Lynx data')
plt.subplot(2,2,4)
plot_acf(lynx_data['Value'], lags=len(hare_data['Value'])-1, ax=plt.gca());
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.title('ACF - Lynx');
plt.tight_layout()
A Practical use of autocorrelation¶
Finally, let’s try this for the speech dataset. Here, the autocorrelation function can be used to calculate the pitch of the person’s voice, if we express the time lags in seconds rather than in samples. Here, we know that the sampling rate of the speech data was 10000 Hz (given in your book).
We can look for peaks in the autocorrelation function, and large peaks outsize of lag=0 correspond to the pitch period and multiples of the pitch period (its harmonics).
speech_data = astsa.load_speech()
sampling_rate = 10000 # This is the sampling rate of the speech data in samples per sec (Hz), according to SS
t = (speech_data['Time']-1)/sampling_rate # Subtract 1 so time series starts from 0 seconds
speech_acf = acf(speech_data['Value'], nlags=250)
plot_acf(speech_data['Value'], lags=250);
plt.gca().set_xticks(np.arange(251,step=50), labels=[t[a] for a in np.arange(251,step=50)]);
plt.xlabel('Time (s)')
# It looks like there is a large peak around t=0.01 seconds (and tellingly, another
# around t=0.02 seconds, which is a multiple. t=0.01 seconds is the pitch period,
# but typically we convert this to a frequency by dividing 1 second by this amount:
print(f'The pitch is approximately {1/0.01} Hz')
# We could use a function like `scipy.signal.findpeaks` to do this more accurately,
# but for now that is beyond the scope of the lab (you can try to implement yourself!)The pitch is approximately 100.0 Hz

Cross-covariance and cross-correlation¶
Many times, we don’t just want to describe the relationship between time points in the same series, we want to describe the relationship between multiple time series. For example, how does the time course of reported cases of a disease relate to time course of reported deaths? How do the numbers of lynx and hare pelts relate to each other in our prior example?
In this case, we can use cross-correlation (or look at cross-covariance) of the two time series.
#plot_ccf(lynx_data['Value'], hare_data['Value'], lags=len(lynx_data['Value'])-1, negative_lags=True)
x = lynx_data['Value'].values
y = hare_data['Value'].values
max_lag = 30 # Let's choose some value here explicitly
# Positive lags: lynx data leads hare data
ccf_xy = ccf(x, y, adjusted=False)[:max_lag+1]
# Negative lags: hare data leads lynx data
ccf_yx = ccf(y, x, adjusted=False)[:max_lag+1]
# Build symmetric lag axis
lags = np.arange(-max_lag, max_lag+1)
# Combine (exclude duplicate lag 0)
ccf_full = np.r_[ccf_yx[1:][::-1], ccf_xy]
plt.stem(lags, ccf_full)
plt.axvline(0, color='r')
plt.xlabel('Lag')
plt.ylabel('CCF')
plt.title('positive: lynx leads hare, negative: hare leads lynx');
plt.grid()
Interpreting cross correlation¶
One point of caution is that with these real data examples presented here, they are not usually weakly stationary, so interpreting the cross-correlation analysis can be problematic. In the future, we will use other tools such as prewhitening, and detrending to help before we run the cross-correlation analysis. See Example 1.29 and 1.30 in your book (SS pages 34-35) to read more about this.