Lab 4 - Stat 153 / 248

We ran through parts of this notebook in class for Lecture 7 and Lecture 8. Now we will extend these analyses to look at our data in more detail and apply some of the

import numpy as np
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
import glob
import os

summary_df = pd.read_csv('Lab4_data/summary_df.csv')  
df_all = pd.read_csv('Lab4_data/df_all.csv')
df_bins = pd.read_csv('Lab4_data/df_bins.csv')
google_form = pd.read_csv('Lab4_data/google_form.csv')

# summary_df has one row per observation and shows the number of taps for that observation
summary_df

# df_all has all of the data from everyone, including all the individual taps, as well
# as repeated metadata about the person ('subj') who completed the task
# For example, the first several rows are from subj 0, who is left handed and used 
# their left index finger for the first part of the task
df_all

# df_bins is the binned data when we use 10 second bins -- this is very
# much an oversimplification of the trend over time, but for now
# is simple to look at
# For example, the first row is for subj 0, time_bin 0 corresponds to the
# first 10 seconds of their attempt, and they had 66 taps in that time bin.
# In the second 10 seconds, they had 54 taps.
df_bins

# google_form has additional data about whether they slept enough, whether
# they play video games more than 10 hrs a week, and how many years they
# played a sport. Remember this doesn't fully overlap with our other dataset
# and has a different number of rows, though we could try to match them up 
# later
google_form

Using seaborn to explore our dataset¶

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

We’ll use it to explore some of the features of our data

!pip install seaborn # Comment out if already installed
import seaborn as sns

Requirement already satisfied: seaborn in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from seaborn) (2.2.6)
Requirement already satisfied: pandas>=1.2 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from seaborn) (2.3.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from seaborn) (3.10.8)
Requirement already satisfied: contourpy>=1.0.1 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.61.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (25.0)
Requirement already satisfied: pillow>=8 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (12.1.0)
Requirement already satisfied: pyparsing>=3 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.3.1)
Requirement already satisfied: python-dateutil>=2.7 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2025.3)
Requirement already satisfied: six>=1.5 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)

sns.lineplot(x='tap_index', y='t_seconds', hue='finger', data=df_all)

<Axes: xlabel='tap_index', ylabel='t_seconds'>

sns.violinplot(data=df_all, x='dominant_hand', y='dt_seconds', hue='finger')

<Axes: xlabel='dominant_hand', ylabel='dt_seconds'>

summary_df

Fitting regression models¶

Now we’re going to fit some simple linear regression models with OLS. This is not necessarily the best choice for these data, but more on that later. For now, we’re interested in whether the number of taps that someone makes can be modeled as a function of whether they used their dominant hand and which finger they used:

$\text{ntaps} = \beta_0 + \beta_1 (\text{dominant hand}) + \beta_2 (\text{pinky finger})$

As we saw in class, we could use dummy coded variables to assign Left/Right and Pinky/Index to new columns of zeros and ones, but we can also use the statsmodels formulas to make this a little more intuitive, as follows:

import statsmodels.formula.api as smf

model = smf.ols(formula="ntaps~ C(dominant_hand) + C(finger)", data=summary_df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  ntaps   R-squared:                       0.262
Model:                            OLS   Adj. R-squared:                  0.247
Method:                 Least Squares   F-statistic:                     17.38
Date:                Thu, 12 Feb 2026   Prob (F-statistic):           3.47e-07
Time:                        15:23:19   Log-Likelihood:                -531.87
No. Observations:                 101   AIC:                             1070.
Df Residuals:                      98   BIC:                             1078.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                  323.5174      8.624     37.515      0.000     306.404     340.631
C(dominant_hand)[T.True]    46.7067      9.599      4.866      0.000      27.658      65.755
C(finger)[T.pinky]         -28.2498      9.549     -2.958      0.004     -47.200      -9.300
==============================================================================
Omnibus:                        6.839   Durbin-Watson:                   1.930
Prob(Omnibus):                  0.033   Jarque-Bera (JB):                7.578
Skew:                           0.395   Prob(JB):                       0.0226
Kurtosis:                       4.085   Cond. No.                         3.38
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# What if we just wanted to fit whether taps depend only on using dominant hand? Note
# the difference in Adj. R-squared compared to the last example.

# FILL IN

Plots to look at balance within our data¶

Are we sampling evenly across handedness? Around 10-12% of the world’s population is left-handed, so we should expect that to be roughly the case for the data we get here. Let’s see if that pans out.

sns.countplot(x='handedness', data=summary_df)

<Axes: xlabel='handedness', ylabel='count'>

We do indeed have an imbalance in the data, with many more right-handed people than left-handed people. If we ran a regression just looking at whether using your right hand helps, it would show that there is a strong positive effect, but this is drive by the fact that we are mostly running this on right-handed people! So remember to think about how to interpret your coefficients in the context of your actual data.

Let’s also just plot which hand the person used to do the task, and whether it was their dominant hand:

sns.countplot(x='hand', data=summary_df, hue='dominant_hand')

<Axes: xlabel='hand', ylabel='count'>

Google form data¶

We also collected data from our google form on sports, gaming, and sleep. Let’s look at that here:

google_form

Investigate other model effects¶

Effect of playing sports and being a gamer are shown below:

model = smf.ols(formula = "ntaps ~ C(finger) + C(dominant_hand) + sport", data=google_form).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  ntaps   R-squared:                       0.308
Model:                            OLS   Adj. R-squared:                  0.274
Method:                 Least Squares   F-statistic:                     9.064
Date:                Thu, 12 Feb 2026   Prob (F-statistic):           4.75e-05
Time:                        15:23:19   Log-Likelihood:                -348.99
No. Observations:                  65   AIC:                             706.0
Df Residuals:                      61   BIC:                             714.7
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                  325.4028     14.726     22.098      0.000     295.957     354.849
C(finger)[T.Pinky]         -35.4348     13.660     -2.594      0.012     -62.749      -8.120
C(dominant_hand)[T.True]    51.0389     13.933      3.663      0.001      23.179      78.899
sport                        1.1967      1.428      0.838      0.405      -1.660       4.053
==============================================================================
Omnibus:                        2.871   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.238   Jarque-Bera (JB):                2.078
Skew:                           0.283   Prob(JB):                        0.354
Kurtosis:                       3.669   Cond. No.                         20.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model = smf.ols(formula = "ntaps ~ C(finger) + C(dominant_hand) + C(gamer)", data=google_form).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  ntaps   R-squared:                       0.384
Model:                            OLS   Adj. R-squared:                  0.354
Method:                 Least Squares   F-statistic:                     12.69
Date:                Thu, 12 Feb 2026   Prob (F-statistic):           1.51e-06
Time:                        15:23:19   Log-Likelihood:                -345.21
No. Observations:                  65   AIC:                             698.4
Df Residuals:                      61   BIC:                             707.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                  326.3398     12.480     26.148      0.000     301.384     351.296
C(finger)[T.Pinky]         -36.3306     12.819     -2.834      0.006     -61.964     -10.697
C(dominant_hand)[T.True]    49.5358     13.084      3.786      0.000      23.372      75.699
C(gamer)[T.Yes]             58.5620     20.311      2.883      0.005      17.949      99.175
==============================================================================
Omnibus:                        1.340   Durbin-Watson:                   1.818
Prob(Omnibus):                  0.512   Jarque-Bera (JB):                0.691
Skew:                           0.163   Prob(JB):                        0.708
Kurtosis:                       3.386   Cond. No.                         4.20
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

sns.boxplot(x='dominant_hand', y='ntaps', hue='gamer', data=google_form)

<Axes: xlabel='dominant_hand', ylabel='ntaps'>

# can you make a boxplot showing whether using your pinky vs your index finger
# results in more taps for gamers vs. non gamers? Your data should be colored
# by gamer vs. nongamer.
sns.boxplot(x='finger', y='ntaps', hue='gamer', data=google_form)

<Axes: xlabel='finger', ylabel='ntaps'>

# Let's show the same data (ntaps) but change which is x and which is hue.
# (Let's make the finger the color this time)
# Does this make you interpret the data differently or think about it 
# differently? Do the two graphs make certain comparisons more obvious?
sns.boxplot(x='gamer', y='ntaps', hue='finger', data=google_form)

<Axes: xlabel='gamer', ylabel='ntaps'>

Look at the data as a time series¶

What if we want to see how the number of taps changes over time? We can bin the data (though this can be dangerous... so do look at how binning data affects your results). We’ll start here by binning our data in 10 second bins and looking at decay over time bins.

bin_size = 10 # in seconds -- you could also change this to something else!
df_all['time_bin'] = (df_all['t_seconds'] // bin_size).astype(int)
df_all

df_bins = ( df_all.groupby(['subj', 'handedness', 'finger','hand','dominant_hand', 'time_bin'])
            .size()
            .reset_index(name='taps_bin')
          )
df_bins

Examples of how taps change as a function of time bin¶

Again let’s do some exploration of the dataset, splitting by different categories.

sns.lineplot(x='time_bin', y='taps_bin', data=df_bins)

<Axes: xlabel='time_bin', ylabel='taps_bin'>

sns.lineplot(x='time_bin', y='taps_bin', hue='hand', data=df_bins)

<Axes: xlabel='time_bin', ylabel='taps_bin'>

sns.lineplot(x='time_bin', y='taps_bin', hue='finger', data=df_bins)

<Axes: xlabel='time_bin', ylabel='taps_bin'>

Fitting models¶

Now we can fit some models to see how the number of taps per bin varies as a function of the time bin and other covariates. Note the adjusted R-squared and other metrics. Are these good models? Why or why not?

model=smf.ols('taps_bin ~ time_bin', data=df_bins).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               taps_bin   R-squared:                       0.062
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     39.96
Date:                Thu, 12 Feb 2026   Prob (F-statistic):           5.06e-10
Time:                        15:23:19   Log-Likelihood:                -2264.1
No. Observations:                 603   AIC:                             4532.
Df Residuals:                     601   BIC:                             4541.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     60.4624      0.746     81.047      0.000      58.997      61.927
time_bin      -1.5608      0.247     -6.321      0.000      -2.046      -1.076
==============================================================================
Omnibus:                       38.053   Durbin-Watson:                   0.665
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               51.912
Skew:                           0.520   Prob(JB):                     5.34e-12
Kurtosis:                       3.991   Cond. No.                         5.76
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model = smf.ols('taps_bin ~ time_bin*C(finger) + C(dominant_hand)', data=df_bins).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               taps_bin   R-squared:                       0.282
Model:                            OLS   Adj. R-squared:                  0.278
Method:                 Least Squares   F-statistic:                     58.82
Date:                Thu, 12 Feb 2026   Prob (F-statistic):           7.05e-42
Time:                        15:23:19   Log-Likelihood:                -2183.5
No. Observations:                 603   AIC:                             4377.
Df Residuals:                     598   BIC:                             4399.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                      58.5275      0.990     59.112      0.000      56.583      60.472
C(finger)[T.pinky]             -6.3238      1.317     -4.800      0.000      -8.911      -3.736
C(dominant_hand)[T.True]        8.2628      0.749     11.027      0.000       6.791       9.734
time_bin                       -1.7458      0.291     -5.991      0.000      -2.318      -1.173
time_bin:C(finger)[T.pinky]     0.4467      0.435      1.026      0.305      -0.409       1.302
==============================================================================
Omnibus:                       65.282   Durbin-Watson:                   0.739
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              144.550
Skew:                           0.608   Prob(JB):                     4.09e-32
Kurtosis:                       5.067   Cond. No.                         14.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Try the regressions again but using a different binning ... what do you notice?¶

# binning of 2 seconds, 5 seconds, something else?

What assumptions are we making?¶

When we collected our data, we had each person tap twice, once with their pinky and once with their index finger, but using the same hand. Does this affect the validity of any of the assumptions we make with this analysis?

Relating to other topics¶

How do the data we’ve collected here relate to topics about autocovariance? (can you calculate the autocovariance of your own dataset by loading the csv?)
How do you think this regression would look if we had people trying to do the finger tapping task for an even longer time? Is linear regression the best solution?