Here you will be practicing nonlinear sinusoidal regressions on some fake data (that we will generate). Go through each of the cells and fill in the parts that say #FILLIN
import numpy as np
import math
from matplotlib import pyplot as plt
import statsmodels.api as smCreating a sinusoid¶
We will first make a sinusoid as we discussed this week in class. This is defined as
Where . We will also specify a sampling rate fs, a duration of our signal in seconds, and the initial parameters .
Recall that:
is the intercept that shifts the sinusoid from mean 0.
is the frequency of the oscillation (how many peaks in one second)
is the amplitude
is the phase shift
is the variance of our white noise signal
fs = 500 # sampling rate
duration = 2
t = np.arange(0,duration,step=1/fs)
B0 = 2
phi = 0
f = 3.2
R = 2.5
var_eps = 0.2 # Variance of white noise
# Our true sinusoid
y = B0 + R*np.cos(2*math.pi*f*t + phi) + np.sqrt(var_eps)*np.random.randn(len(t))
# Let's plot it and our x and y origin lines
plt.plot(t, y)
plt.axhline(0, color='k', linewidth=0.5)
plt.axvline(0, color='k', linewidth=0.5)
plt.xlabel('Time')
plt.ylabel('value')
Finding the frequency, ¶
Let’s assume we have received these data and we have no idea what the true is. We can find it with a grid search across a number of potential values, then do OLS using the relationship that we discussed, which is that we can rewrite as:
Where and . We will still solve for these values with OLS, since we also don’t know what or are.
We will start with a linearly spaced vector of values for the potential . Looking at the graph above, we can count the peaks and narrow this down to probably ~3 peaks, so we don’t actually have to test everything, though if we want to be exhaustive, we would test frequencies from .
We then fit the regression model using our matrix, defined as:
and repeat this for every value of and plot the residual sum of squares (RSS). We are looking for the minimum value of this plot. The number of values of we test in the grid allows us to be more accurate, so try varying the number of values nf.
nf = 100 # Test this number of frequencies
f_grid = np.linspace(0,fs/2,nf)
RSS = np.zeros(len(f_grid))
for idx, f_hat in enumerate(f_grid):
cos_col = np.cos(2 * np.pi * f_hat * t )
sin_col = np.sin(2 * np.pi * f_hat * t )
# create our X matrix
X = np.column_stack([np.ones(len(t)), cos_col, sin_col])
model = sm.OLS(y, X).fit()
betas = model.params
RSS[idx] = np.sum(model.resid **2)
plt.plot(f_grid, RSS,'-')
best_f = f_grid[RSS.argmin()] # Find the best f_hat
plt.axvline(best_f, color='r', linewidth=0.5, linestyle='--')
plt.text(best_f + 1, np.min(RSS), f'f={best_f:.3f}')
plt.xlabel('Frequency')
plt.ylabel('RSS')
Use our best f to calculate the other coefficients¶
From this we can choose the best and try fitting OLS again and looking at the parameters and performance of the model.
Again:
but with only the best
cos_col = np.cos(2 * np.pi * best_f * t )
sin_col = np.sin(2 * np.pi * best_f * t )
X = np.column_stack([np.ones(len(t)), cos_col, sin_col])
model = sm.OLS(y, X).fit()
betas = model.params
y_hat = model.fittedvalues
plt.figure()
plt.plot(y, label='y')
plt.plot(y_hat, label='yhat')
plt.legend()
plt.figure()
plt.plot(y, y_hat,'.')
plt.xlabel('y')
plt.ylabel('yhat')

Estimating the original parameters of the sinusoid¶
Recall we can get back the original parameters through their relationship to the values:
Let’s see what we get.
B0_hat = betas[0]
R_hat = np.sqrt(betas[1]**2 + betas[2]**2)
phi_hat = np.arctan(-betas[2]/betas[1])
print(f'B0={B0}\t B0_hat={B0_hat}')
print(f'R={R}\t R_hat={R_hat}')
print(f'phi={phi}\t phi_hat={phi_hat}')
print(f'f={f}\t f_hat={best_f}')B0=2 B0_hat=2.029404412553876
R=2.5 R_hat=0.4737475535789633
phi=0 phi_hat=1.0137021097636343
f=3.2 f_hat=2.525252525252525
How does this look? It’s okay, but not quite right. Try redoing the grid search again below in a way that you think will make this more accurate.
nf = 10000 # Test this number of frequencies
f_grid = np.linspace(0,fs/2,nf)
# or
nf = 1000
f_grid = np.linspace(2,4,nf)
RSS = np.zeros(len(f_grid))
for idx, f_hat in enumerate(f_grid):
cos_col = np.cos(2 * np.pi * f_hat * t )
sin_col = np.sin(2 * np.pi * f_hat * t )
# create our X matrix
X = np.column_stack([np.ones(len(t)), cos_col, sin_col])
model = sm.OLS(y, X).fit()
betas = model.params
RSS[idx] = np.sum(model.resid **2)
plt.plot(f_grid, RSS,'-')
best_f = f_grid[RSS.argmin()] # Find the best f_hat
plt.axvline(best_f, color='r', linewidth=0.5, linestyle='--')
plt.text(best_f + 1, np.min(RSS), f'f={best_f:.3f}')
plt.xlabel('Frequency')
plt.ylabel('RSS')
cos_col = np.cos(2 * np.pi * best_f * t )
sin_col = np.sin(2 * np.pi * best_f * t )
X = np.column_stack([np.ones(len(t)), cos_col, sin_col])
model = sm.OLS(y, X).fit()
betas = model.params
y_hat = model.fittedvalues
plt.figure()
plt.plot(y, label='y')
plt.plot(y_hat, label='yhat')
plt.legend()

Compare and directly¶
We can also just plot our actual data versus the estimate. If our linear model is perfect, we should see the unity line
plt.figure()
plt.plot(y, y_hat,'.')
plt.plot(y,y,'-')
plt.xlabel('y')
plt.ylabel('yhat');
Estimating the original parameters of the sinusoid¶
Let’s get back the original parameters again with this (hopefully) improved regression:
Let’s see what we get.
B0_hat = betas[0]
R_hat = np.sqrt(betas[1]**2 + betas[2]**2)
phi_hat = np.arctan(-betas[2]/betas[1])
print(f'B0={B0}\t B0_hat={B0_hat}')
print(f'R={R}\t R_hat={R_hat}')
print(f'phi={phi}\t phi_hat={phi_hat}')
print(f'f={f}\t f_hat={best_f}')B0=2 B0_hat=1.9930655070550236
R=2.5 R_hat=2.4680140137480113
phi=0 phi_hat=-0.006247204538511174
f=3.2 f_hat=3.201201201201201
How does this look now?¶
What about new data?¶
We have fit our data using the sinusoid pretty successfully here. What if we tried to use this as a predictive model to forecast new data? Let’s try generating some new data in the future from the same underlying sinusoid, then use the coefficients that we fit above on our “training set” to fit the new “test set” data.
# Generate some new data for some new time points in the future $t2$
duration2 = 1
t2 = np.arange(t[-1],t[-1]+duration2,step=1/fs)
y_test = B0 + R*np.cos(2*math.pi*f*t2 + phi) + np.sqrt(var_eps)*np.random.randn(len(t2))
plt.plot(t, y, label='original training')
plt.plot(t2, y_test, label='new test data')
plt.legend()
plt.xlabel('t')
plt.ylabel('value');
# Make a new X matrix with the original best_f we estimated from the training data, and
# use t2 as our new time observations, then calculate our new estimate of y
cos_col = np.cos(2 * np.pi * best_f * t2 )
sin_col = np.sin(2 * np.pi * best_f * t2 )
X = np.column_stack([np.ones(len(t2)), cos_col, sin_col])
y_hat_test_orig = np.dot(X, model.params)
plt.plot(t, y)
plt.plot(t, model.fittedvalues)
plt.plot(t2, y_test)
plt.plot(t2, y_hat_test_orig)
train_RMSE = np.sqrt(np.mean((model.resid)**2))
test_RMSE = np.sqrt(np.mean((y_test - y_hat_test_orig)**2))
print(f'root mean squared error on training data: {train_RMSE}')
print(f'root mean squared error on test data: {test_RMSE}')
plt.figure()
plt.plot(t2, y_test - y_hat_test_orig)
plt.xlabel('time')
plt.ylabel('Residual')
plt.title('Residuals')root mean squared error on training data: 0.44177693508270116
root mean squared error on test data: 0.4524990060606751


How does this look?¶
More noise¶
Now try re-running the analysis above using a higher noise level (i.e. set var_eps to be much larger for the error variance). What do you notice? Fill in your edited code below, and first try doing a grid search over all frequencies from 0 to the Nyquist limit.
When do things break down?
fs = 500 # sampling rate
duration = 2
t = np.arange(0,duration,step=1/fs)
B0 = 2
phi = 0
f = 3.2
R = 2.5
var_eps = 1000 # Variance of white noise
# Our true sinusoid
y = B0 + R*np.cos(2*math.pi*f*t + phi) + np.sqrt(var_eps)*np.random.randn(len(t))
# Let's plot it and our x and y origin lines
plt.plot(t, y)
plt.axhline(0, color='k', linewidth=0.5)
plt.axvline(0, color='k', linewidth=0.5)
plt.xlabel('Time')
plt.ylabel('value')
nf = 1000 # Test this number of frequencies
f_grid = np.linspace(0,fs/2,nf)
# or
#nf = 1000
#f_grid = np.linspace(2,4,nf)
RSS = np.zeros(len(f_grid))
for idx, f_hat in enumerate(f_grid):
cos_col = np.cos(2 * np.pi * f_hat * t )
sin_col = np.sin(2 * np.pi * f_hat * t )
# create our X matrix
X = np.column_stack([np.ones(len(t)), cos_col, sin_col])
model = sm.OLS(y, X).fit()
betas = model.params
RSS[idx] = np.sum(model.resid **2)
plt.plot(f_grid, RSS,'-')
best_f = f_grid[RSS.argmin()] # Find the best f_hat
plt.axvline(best_f, color='r', linewidth=0.5, linestyle='--')
plt.text(best_f + 1, np.min(RSS), f'f={best_f:.3f}')
plt.xlabel('Frequency')
plt.ylabel('RSS')
cos_col = np.cos(2 * np.pi * best_f * t )
sin_col = np.sin(2 * np.pi * best_f * t )
X = np.column_stack([np.ones(len(t)), cos_col, sin_col])
model = sm.OLS(y, X).fit()
betas = model.params
y_hat = model.fittedvalues
plt.figure()
plt.plot(y, label='y')
plt.plot(y_hat, label='yhat')
plt.legend()
