Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

We now look at another example for nonlinear regression. The following dataset is downloaded from https://www.sidc.be/SILSO/datafiles (silso stands for sunspot index and long term solar observations). Data description can be found here: https://www.sidc.be/SILSO/infosnytot. There are five columns:

  1. Column 1 is the year (2020.5 refers to the year 2020 for example)

  2. Column 2 is the yearly mean total sunspot number (this is obtained by taking a simple arithmetic mean of the daily sunspot number over all the days for that year)

  3. Column 3 is the yearly mean standard deviation of the input sunspot numbers from individual stations (-1 indicates missing value)

  4. Column 4 is the number of observations used to compute the yearly mean sunspot number (-1 indicates a missing value)

  5. Column 5 is a definitive/provisional marker (1 indicates that the data point is definitive, and 0 indicates that it is still provisional)

We shall work with the data in column 2 (yearly mean total sunspot number).

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
#annual sunspots dataset:
sunspots = pd.read_csv('SN_y_tot_V2.0.csv', header = None, sep = ';')
print(sunspots.head())
sunspots.columns = ['year', 'sunspotsmean', 'sunspotssd', 'sunspotsnobs', 'isdefinitive']
print(sunspots.head(10))
        0     1    2  3  4
0  1700.5   8.3 -1.0 -1  1
1  1701.5  18.3 -1.0 -1  1
2  1702.5  26.7 -1.0 -1  1
3  1703.5  38.3 -1.0 -1  1
4  1704.5  60.0 -1.0 -1  1
     year  sunspotsmean  sunspotssd  sunspotsnobs  isdefinitive
0  1700.5           8.3        -1.0            -1             1
1  1701.5          18.3        -1.0            -1             1
2  1702.5          26.7        -1.0            -1             1
3  1703.5          38.3        -1.0            -1             1
4  1704.5          60.0        -1.0            -1             1
5  1705.5          96.7        -1.0            -1             1
6  1706.5          48.3        -1.0            -1             1
7  1707.5          33.3        -1.0            -1             1
8  1708.5          16.7        -1.0            -1             1
9  1709.5          13.3        -1.0            -1             1
y = sunspots['sunspotsmean']
print(y.head())
plt.plot(y)
plt.xlabel('Time (year)')
plt.ylabel('Number of sunspots')
plt.title('Annual sunspots data')
plt.show()
0     8.3
1    18.3
2    26.7
3    38.3
4    60.0
Name: sunspotsmean, dtype: float64
<Figure size 640x480 with 1 Axes>

We will fit the sinusoidal model:

yt=β0+β1cos(2πft)+β2sin(2πft)+ϵty_t = \beta_0 + \beta_1 \cos(2 \pi f t) + \beta_2 \sin(2 \pi f t) + \epsilon_t

with ϵt\epsilon_t being i.i.d N(0,σ2)N(0, \sigma^2). When ff is known, this is a linear regression model. We can fit this model to the data and see how good the fit is while varying ff. This will give us an idea of which frequencies give good fits to the data. Later we shall consider ff as an unknown parameter and discuss estimation and inference for ff.

#Let us try some frequencies to get  intuition:
n = len(y)
x = np.arange(1, n+1)
f = 1/10
#f = 1/11 #wikipedia article on sunspots states that the sunspots periodicity is 11
#f = 1/4
#f = 1/15
sinvals = np.sin(2 * np.pi * f * x)
cosvals = np.cos(2 * np.pi * f * x)
X = np.column_stack([np.ones(n), sinvals, cosvals])
md = sm.OLS(y, X).fit()
print(md.summary())

plt.plot(y)
plt.plot(md.fittedvalues)
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sunspotsmean   R-squared:                       0.132
Model:                            OLS   Adj. R-squared:                  0.127
Method:                 Least Squares   F-statistic:                     24.49
Date:                Thu, 18 Sep 2025   Prob (F-statistic):           1.26e-10
Time:                        22:46:41   Log-Likelihood:                -1778.8
No. Observations:                 325   AIC:                             3564.
Df Residuals:                     322   BIC:                             3575.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         78.8996      3.213     24.559      0.000      72.579      85.220
x1            -4.5149      4.543     -0.994      0.321     -13.453       4.423
x2            31.4722      4.543      6.928      0.000      22.535      40.410
==============================================================================
Omnibus:                        9.217   Durbin-Watson:                   0.360
Prob(Omnibus):                  0.010   Jarque-Bera (JB):                9.412
Skew:                           0.393   Prob(JB):                      0.00904
Kurtosis:                       2.720   Cond. No.                         1.41
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<Figure size 640x480 with 1 Axes>

As before, we use RSS to calculate goodness of fit. We can vary over the values of ff and find the ff for which RSS is minimized.

def rss(f):
    x = np.arange(1, n+1)
    xcos = np.cos(2 * np.pi * f * x)
    xsin = np.sin(2 * np.pi * f * x)
    X = np.column_stack([np.ones(n), xcos, xsin])
    md = sm.OLS(y, X).fit()
    rss = np.sum(md.resid ** 2)
    return rss
print(rss(1/15))
print(rss(1/4))
print(rss(1/11))
print(rss(1/10))
1235555.4316413044
1243980.0504058595
866129.2421378704
1079858.9542035519
num_f_vals = 10000 #this is the number of different values of f we will try
allfvals = np.linspace(0, 0.5, num_f_vals)
rssvals = np.array([rss(f) for f in allfvals])
plt.plot(allfvals, rssvals)
plt.show()
<Figure size 640x480 with 1 Axes>
fhat = allfvals[np.argmin(rssvals)]
print(fhat)
print(1/fhat) #this is the estimated periodicity in years
0.09090909090909091
11.0

To estimate β0,β1,β2\beta_0, \beta_1, \beta_2 and σ\sigma, we simply run linear regression with f=f^f = \hat{f}:

f = fhat
x = np.arange(1, n+1)
xcos = np.cos(2 * np.pi * f * x)
xsin = np.sin(2 * np.pi * f * x)
X = np.column_stack([np.ones(n), xcos, xsin])
md = sm.OLS(y, X).fit()
print(md.summary())
sighat = np.sqrt(np.sum(md.resid ** 2) / (n - 3))
print(md.params, sighat)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sunspotsmean   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.299
Method:                 Least Squares   F-statistic:                     70.26
Date:                Thu, 18 Sep 2025   Prob (F-statistic):           4.76e-26
Time:                        22:46:47   Log-Likelihood:                -1742.9
No. Observations:                 325   AIC:                             3492.
Df Residuals:                     322   BIC:                             3503.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         78.8760      2.877     27.414      0.000      73.215      84.537
x1           -38.2823      4.066     -9.414      0.000     -46.283     -30.282
x2           -29.2787      4.071     -7.192      0.000     -37.288     -21.269
==============================================================================
Omnibus:                       55.766   Durbin-Watson:                   0.383
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               91.522
Skew:                           0.998   Prob(JB):                     1.34e-20
Kurtosis:                       4.665   Cond. No.                         1.42
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
const    78.875992
x1      -38.282316
x2      -29.278663
dtype: float64 51.86369025520452

Let us now do uncertainty quantification for ff. We use the posterior formula:

posterior(f)I{0<f<1/2}XfTXf1/2(1RSS(f))(np)/2\begin{align*} \text{posterior}(f) \propto I\{0 < f < 1/2 \} |X_f^T X_f|^{-1/2} \left(\frac{1}{RSS(f)} \right)^{(n-p)/2} \end{align*}
def logpost(f):
    x = np.arange(1, n+1)
    xcos = np.cos(2 * np.pi * f * x)
    xsin = np.sin(2 * np.pi * f * x)
    X = np.column_stack([np.ones(n), xcos, xsin])
    p = X.shape[1]
    md = sm.OLS(y, X).fit()
    rss = np.sum(md.resid ** 2)
    sgn, log_det = np.linalg.slogdet(np.dot(X.T, X)) #sgn gives the sign of the determinant (in our case, this should 1)
    #log_det gives the logarithm of the absolute value of the determinant
    logval = ((p-n)/2) * np.log(rss) - (0.5)*log_det
    return logval

The main role in this calculation is played by the RSS(f)RSS(f) term. The other term XfTXf1/2|X_f^T X_f|^{-1/2} generally does not vary much with ff. This is demonstrated below.

def logpost_nodet(f): #here we are calculating the log posterior without the determinant term (just to see how much difference it makes)
    x = np.arange(1, n+1)
    xcos = np.cos(2 * np.pi * f * x)
    xsin = np.sin(2 * np.pi * f * x)
    X = np.column_stack([np.ones(n), xcos, xsin])
    p = X.shape[1]
    md = sm.OLS(y, X).fit()
    rss = np.sum(md.resid ** 2)
    sgn, log_det = np.linalg.slogdet(np.dot(X.T, X)) #sgn gives the sign of the determinant (in our case, this should 1)
    #log_det gives the logarithm of the absolute value of the determinant
    logval = ((p-n)/2) * np.log(rss) #- (0.5)*log_det #we drop the determinant term
    return logval

While evaluating the log-posterior, we remove values of ff that are too close to 0 or 0.5.

allfvals_modified = allfvals[10:-10]
logpostvals = np.array([logpost(f) for f in allfvals_modified])
logpostvals_nodet = np.array([logpost_nodet(f) for f in allfvals_modified]) 

Below we plot the log-posterior with the determinant term (which is the correct formula), and without the determinant term.

plt.plot(allfvals_modified, logpostvals)
plt.plot(allfvals_modified, logpostvals_nodet, color = 'red')
plt.show()
<Figure size 640x480 with 1 Axes>

The two curves above look largely parallel (except for frequencies ff which are close to the endpoints). This suggests that the determinant has minimal impact on the posterior. Below we plot their difference.

plt.plot(allfvals_modified, logpostvals - logpostvals_nodet)
plt.show()
<Figure size 640x480 with 1 Axes>

Again the determinant term is only having an effect near the endpoints. This shows that the main term in the posterior is the term involving the RSS.

allfvals_modified = allfvals[10:-10]
logpostvals = np.array([logpost(f) for f in allfvals_modified])
postvals = np.exp(logpostvals - np.max(logpostvals))
postvals = postvals/(np.sum(postvals))
plt.plot(allfvals_modified, logpostvals)
plt.show()
<Figure size 640x480 with 1 Axes>

The above plot looks similar to the plot of RSS(f)RSS(f) (except it is upside down). The maximizer of the posterior density here is exactly the same as the MLE.

print(allfvals_modified[np.argmax(postvals)])
print(allfvals[np.argmin(rssvals)])
print(1/allfvals_modified[np.argmax(postvals)])
print(1/allfvals[np.argmin(rssvals)])
0.09090909090909091
0.09090909090909091
11.0
11.0

Below we plot the posterior. This curve is much more peaked than the log posterior plot. In fact, only one peak is visible in the posterior.

plt.plot(allfvals_modified, postvals)
plt.show()
<Figure size 640x480 with 1 Axes>

Below we find 95% Bayesian uncertainty interval for ff. We calculate the posterior probability around the posterior maximizer, and keep expanding the region until the probability is 0.95.

def PostProbAroundMax(m):
    est_ind = np.argmax(postvals)
    ans = np.sum(postvals[(est_ind-m):(est_ind+m+1)])
    return(ans)
print(PostProbAroundMax(0))
print(PostProbAroundMax(1))
print(PostProbAroundMax(6))
0.14143517124809554
0.4068681511534429
0.9779441017540897
m = 6
est_ind = np.argmax(postvals)
f_est = allfvals_modified[est_ind]
#95% credible interval for f:
ci_f_low = allfvals_modified[est_ind - m]
ci_f_high = allfvals_modified[est_ind + m]
print(np.array([f_est, ci_f_low, ci_f_high])) #this is the 95% uncertainty interval for f
[0.09090909 0.09060906 0.09120912]

To get the interval for the period (which is the inverse of frequency), do the following.

#For the period (1/frequency)
print(np.array([1/f_est, 1/ci_f_high, 1/ci_f_low]))
[11.         10.96381579 11.03642384]

This interval is highly concentrated around 11. The reason for such tight concentration is that RSS(f)RSS(f) significantly increases even if ff changes slightly from 11.

print(rss(1/11), rss(1/11.05), rss(1/10.95))
866129.2421378704 887294.206337193 891222.8998348333

The above clearly shows that the RSS increases significantly even if ff moves slightly away from 1/111/11. This is the reason why we get such a narrow uncertainty interval for ff around 1/111/11.

Efficient Computation of RSS(f) at Fourier Frequencies

Given a time series dataset y1,,yny_1, \dots, y_n, the quantity RSS(f)RSS(f) is defined by

RSS(f):=minβ0,β1,β2t=1n(ytβ0β1cos(2πft)β2sin(2πft))2RSS(f) := \min_{\beta_0, \beta_1, \beta_2} \sum_{t=1}^n \left(y_t - \beta_0 - \beta_1 \cos (2 \pi f t) - \beta_2 \sin(2 \pi f t) \right)^2

For each frequency ff, RSS(f)RSS(f) tells us how well the best sinusoid at frequency ff fits the data. This is used for identifying periodicities present in the data.

Computing RSS(f)RSS(f) for each ff in a grid (as above) can be quite time-consuming if the sample size nn is large. For example, consider the following audio dataset. The "Hear Piano Note - Middle C.mp3$ is an audio file consisting of about 14 seconds. It contains the sound of the Middle C note in the piano. The python library “Librosa” will be used for loading the audio file (see https://librosa.org/doc/latest/index.html for instructions on installing librosa, and tutorials etc.)

import librosa
y,sr=librosa.load("Hear Piano Note - Middle C.mp3")
n = len(y)
print(n)
print(sr)
print(n/sr)
301272
22050
13.66312925170068

This dataset has n=301272n = 301272 points. Clearly this is a time series dataset of a large size. The data (sound waveform) is plotted below.

plt.plot(y)
plt.xlabel("Time")
plt.ylabel("Sound Waveform")
plt.show()
<Figure size 640x480 with 1 Axes>

The full plot of the data is not very revealing as the data size is very long. But if we restrict to a smaller portion of the dataset, we can visualize the cyclical behavior more easily.

y_smallpart = y[50000:(50000 + 500)]
plt.plot(y_smallpart)
plt.xlabel('Time')
plt.title('A small segment of the full audio data')
plt.ylabel('Sound waveform')
<Figure size 640x480 with 1 Axes>

In order to figure out the main frequencies present in the data, we can fit the sinusoidal model to this dataset. As before, we can compute RSS(f)RSS(f) for each ff in a grid of frequencies in [0,1/2][0, 1/2], and then find the best ff with the smallest RSS(f)RSS(f). However this code will take way too long to run because of the large sample size.

ngrid = 10000
allfvals = np.linspace(0, 0.5, ngrid)
rssvals = np.array([rss(f) for f in allfvals])
plt.plot(allfvals, rssvals)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[158], line 3
      1 ngrid = 10000
      2 allfvals = np.linspace(0, 0.5, ngrid)
----> 3 rssvals = np.array([rss(f) for f in allfvals])
      4 plt.plot(allfvals, rssvals)

Cell In[158], line 3, in <listcomp>(.0)
      1 ngrid = 10000
      2 allfvals = np.linspace(0, 0.5, ngrid)
----> 3 rssvals = np.array([rss(f) for f in allfvals])
      4 plt.plot(allfvals, rssvals)

Cell In[127], line 6, in rss(f)
      4 xsin = np.sin(2 * np.pi * f * x)
      5 X = np.column_stack([np.ones(n), xcos, xsin])
----> 6 md = sm.OLS(y, X).fit()
      7 rss = np.sum(md.resid ** 2)
      8 return rss

File ~/mambaforge/envs/stat153fall2025/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:333, in RegressionModel.fit(self, method, cov_type, cov_kwds, use_t, **kwargs)
    328 if method == "pinv":
    329     if not (hasattr(self, 'pinv_wexog') and
    330             hasattr(self, 'normalized_cov_params') and
    331             hasattr(self, 'rank')):
--> 333         self.pinv_wexog, singular_values = pinv_extended(self.wexog)
    334         self.normalized_cov_params = np.dot(
    335             self.pinv_wexog, np.transpose(self.pinv_wexog))
    337         # Cache these singular values for use later.

File ~/mambaforge/envs/stat153fall2025/lib/python3.11/site-packages/statsmodels/tools/tools.py:274, in pinv_extended(x, rcond)
    272     else:
    273         s[i] = 0.
--> 274 res = np.dot(np.transpose(vt), np.multiply(s[:, np.newaxis],
    275                                            np.transpose(u)))
    276 return res, s_orig

KeyboardInterrupt: 

In the next lecture, we shall see how to compute the RSS much more efficiently at the Fourier frequencies ff (ff is a Fourier frequency if nfnf is an integer). The basic formula is as follows. Recall first that, given a time series dataset y1,,yny_1, \dots, y_n, the quantity RSS(f)RSS(f) is defined by

RSS(f):=minβ0,β1,β2t=1n(ytβ0β1cos(2πft)β2sin(2πft))2RSS(f) := \min_{\beta_0, \beta_1, \beta_2} \sum_{t=1}^n \left(y_t - \beta_0 - \beta_1 \cos (2 \pi f t) - \beta_2 \sin(2 \pi f t) \right)^2

For each frequency ff, RSS(f)RSS(f) tells us how well the best sinusoid at frequency ff fits the data.

The following is an alternative formula for RSS(f)RSS(f) which holds when f(0,1/2)f \in (0, 1/2) is a Fourier frequency (i.e., nfnf is an integer):

RSS(f)=t(ytyˉ)22I(f)RSS(f) = \sum_{t} (y_t - \bar{y})^2 - 2I(f)

where

I(f):=1nt=1nyte2πift2I(f) := \frac{1}{n} \left| \sum_{t=1}^n y_t e^{-2 \pi i f t} \right|^2

is called the Periodogram. The quantity

t=1nyte2πift\sum_{t=1}^n y_t e^{-2 \pi i f t}

when ff varies over the Fourier frequencies in (0,1/2)(0, 1/2) is called the Discrete Fourier Transform (DFT) of the data y1,,yny_1, \dots, y_n. An efficient algorithm called FFT (Fast Fourier Transform) allows fast computation of the DFT. We shall discuss the DFT in more detail in the next lecture, and see how it can be used to compute RSS(f)RSS(f) efficiently at Fourier frequencies ff.