We ran through parts of this notebook in class for Lecture 7 and Lecture 8. Now we will extend these analyses to look at our data in more detail and apply some of the
import numpy as np
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
import glob
import ossummary_df = pd.read_csv('Lab4_data/summary_df.csv')
df_all = pd.read_csv('Lab4_data/df_all.csv')
df_bins = pd.read_csv('Lab4_data/df_bins.csv')
google_form = pd.read_csv('Lab4_data/google_form.csv')# summary_df has one row per observation and shows the number of taps for that observation
summary_df# df_all has all of the data from everyone, including all the individual taps, as well
# as repeated metadata about the person ('subj') who completed the task
# For example, the first several rows are from subj 0, who is left handed and used
# their left index finger for the first part of the task
df_all# df_bins is the binned data when we use 10 second bins -- this is very
# much an oversimplification of the trend over time, but for now
# is simple to look at
# For example, the first row is for subj 0, time_bin 0 corresponds to the
# first 10 seconds of their attempt, and they had 66 taps in that time bin.
# In the second 10 seconds, they had 54 taps.
df_bins # google_form has additional data about whether they slept enough, whether
# they play video games more than 10 hrs a week, and how many years they
# played a sport. Remember this doesn't fully overlap with our other dataset
# and has a different number of rows, though we could try to match them up
# later
google_formUsing seaborn to explore our dataset¶
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
We’ll use it to explore some of the features of our data
!pip install seaborn # Comment out if already installed
import seaborn as snsRequirement already satisfied: seaborn in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from seaborn) (2.2.6)
Requirement already satisfied: pandas>=1.2 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from seaborn) (2.3.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from seaborn) (3.10.8)
Requirement already satisfied: contourpy>=1.0.1 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.61.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (25.0)
Requirement already satisfied: pillow>=8 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (12.1.0)
Requirement already satisfied: pyparsing>=3 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.3.1)
Requirement already satisfied: python-dateutil>=2.7 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2025.3)
Requirement already satisfied: six>=1.5 in /Users/liberty/anaconda3/envs/stat153_sp26/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)
sns.lineplot(x='tap_index', y='t_seconds', hue='finger', data=df_all)<Axes: xlabel='tap_index', ylabel='t_seconds'>
sns.violinplot(data=df_all, x='dominant_hand', y='dt_seconds', hue='finger')<Axes: xlabel='dominant_hand', ylabel='dt_seconds'>
summary_dfFitting regression models¶
Now we’re going to fit some simple linear regression models with OLS. This is not necessarily the best choice for these data, but more on that later. For now, we’re interested in whether the number of taps that someone makes can be modeled as a function of whether they used their dominant hand and which finger they used:
As we saw in class, we could use dummy coded variables to assign Left/Right and Pinky/Index to new columns of zeros and ones, but we can also use the statsmodels formulas to make this a little more intuitive, as follows:
import statsmodels.formula.api as smfmodel = smf.ols(formula="ntaps~ C(dominant_hand) + C(finger)", data=summary_df).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: ntaps R-squared: 0.262
Model: OLS Adj. R-squared: 0.247
Method: Least Squares F-statistic: 17.38
Date: Thu, 12 Feb 2026 Prob (F-statistic): 3.47e-07
Time: 15:23:19 Log-Likelihood: -531.87
No. Observations: 101 AIC: 1070.
Df Residuals: 98 BIC: 1078.
Df Model: 2
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 323.5174 8.624 37.515 0.000 306.404 340.631
C(dominant_hand)[T.True] 46.7067 9.599 4.866 0.000 27.658 65.755
C(finger)[T.pinky] -28.2498 9.549 -2.958 0.004 -47.200 -9.300
==============================================================================
Omnibus: 6.839 Durbin-Watson: 1.930
Prob(Omnibus): 0.033 Jarque-Bera (JB): 7.578
Skew: 0.395 Prob(JB): 0.0226
Kurtosis: 4.085 Cond. No. 3.38
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# What if we just wanted to fit whether taps depend only on using dominant hand? Note
# the difference in Adj. R-squared compared to the last example.
# FILL INPlots to look at balance within our data¶
Are we sampling evenly across handedness? Around 10-12% of the world’s population is left-handed, so we should expect that to be roughly the case for the data we get here. Let’s see if that pans out.
sns.countplot(x='handedness', data=summary_df)<Axes: xlabel='handedness', ylabel='count'>
We do indeed have an imbalance in the data, with many more right-handed people than left-handed people. If we ran a regression just looking at whether using your right hand helps, it would show that there is a strong positive effect, but this is drive by the fact that we are mostly running this on right-handed people! So remember to think about how to interpret your coefficients in the context of your actual data.
Let’s also just plot which hand the person used to do the task, and whether it was their dominant hand:
sns.countplot(x='hand', data=summary_df, hue='dominant_hand')<Axes: xlabel='hand', ylabel='count'>
Google form data¶
We also collected data from our google form on sports, gaming, and sleep. Let’s look at that here:
google_formInvestigate other model effects¶
Effect of playing sports and being a gamer are shown below:
model = smf.ols(formula = "ntaps ~ C(finger) + C(dominant_hand) + sport", data=google_form).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: ntaps R-squared: 0.308
Model: OLS Adj. R-squared: 0.274
Method: Least Squares F-statistic: 9.064
Date: Thu, 12 Feb 2026 Prob (F-statistic): 4.75e-05
Time: 15:23:19 Log-Likelihood: -348.99
No. Observations: 65 AIC: 706.0
Df Residuals: 61 BIC: 714.7
Df Model: 3
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 325.4028 14.726 22.098 0.000 295.957 354.849
C(finger)[T.Pinky] -35.4348 13.660 -2.594 0.012 -62.749 -8.120
C(dominant_hand)[T.True] 51.0389 13.933 3.663 0.001 23.179 78.899
sport 1.1967 1.428 0.838 0.405 -1.660 4.053
==============================================================================
Omnibus: 2.871 Durbin-Watson: 1.779
Prob(Omnibus): 0.238 Jarque-Bera (JB): 2.078
Skew: 0.283 Prob(JB): 0.354
Kurtosis: 3.669 Cond. No. 20.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
model = smf.ols(formula = "ntaps ~ C(finger) + C(dominant_hand) + C(gamer)", data=google_form).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: ntaps R-squared: 0.384
Model: OLS Adj. R-squared: 0.354
Method: Least Squares F-statistic: 12.69
Date: Thu, 12 Feb 2026 Prob (F-statistic): 1.51e-06
Time: 15:23:19 Log-Likelihood: -345.21
No. Observations: 65 AIC: 698.4
Df Residuals: 61 BIC: 707.1
Df Model: 3
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 326.3398 12.480 26.148 0.000 301.384 351.296
C(finger)[T.Pinky] -36.3306 12.819 -2.834 0.006 -61.964 -10.697
C(dominant_hand)[T.True] 49.5358 13.084 3.786 0.000 23.372 75.699
C(gamer)[T.Yes] 58.5620 20.311 2.883 0.005 17.949 99.175
==============================================================================
Omnibus: 1.340 Durbin-Watson: 1.818
Prob(Omnibus): 0.512 Jarque-Bera (JB): 0.691
Skew: 0.163 Prob(JB): 0.708
Kurtosis: 3.386 Cond. No. 4.20
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
sns.boxplot(x='dominant_hand', y='ntaps', hue='gamer', data=google_form)<Axes: xlabel='dominant_hand', ylabel='ntaps'>
# can you make a boxplot showing whether using your pinky vs your index finger
# results in more taps for gamers vs. non gamers? Your data should be colored
# by gamer vs. nongamer.
sns.boxplot(x='finger', y='ntaps', hue='gamer', data=google_form)<Axes: xlabel='finger', ylabel='ntaps'>
# Let's show the same data (ntaps) but change which is x and which is hue.
# (Let's make the finger the color this time)
# Does this make you interpret the data differently or think about it
# differently? Do the two graphs make certain comparisons more obvious?
sns.boxplot(x='gamer', y='ntaps', hue='finger', data=google_form)<Axes: xlabel='gamer', ylabel='ntaps'>
Look at the data as a time series¶
What if we want to see how the number of taps changes over time? We can bin the data (though this can be dangerous... so do look at how binning data affects your results). We’ll start here by binning our data in 10 second bins and looking at decay over time bins.
bin_size = 10 # in seconds -- you could also change this to something else!
df_all['time_bin'] = (df_all['t_seconds'] // bin_size).astype(int)
df_all
df_bins = ( df_all.groupby(['subj', 'handedness', 'finger','hand','dominant_hand', 'time_bin'])
.size()
.reset_index(name='taps_bin')
)
df_binsExamples of how taps change as a function of time bin¶
Again let’s do some exploration of the dataset, splitting by different categories.
sns.lineplot(x='time_bin', y='taps_bin', data=df_bins)<Axes: xlabel='time_bin', ylabel='taps_bin'>
sns.lineplot(x='time_bin', y='taps_bin', hue='hand', data=df_bins)<Axes: xlabel='time_bin', ylabel='taps_bin'>
sns.lineplot(x='time_bin', y='taps_bin', hue='finger', data=df_bins)<Axes: xlabel='time_bin', ylabel='taps_bin'>
Fitting models¶
Now we can fit some models to see how the number of taps per bin varies as a function of the time bin and other covariates. Note the adjusted R-squared and other metrics. Are these good models? Why or why not?
model=smf.ols('taps_bin ~ time_bin', data=df_bins).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: taps_bin R-squared: 0.062
Model: OLS Adj. R-squared: 0.061
Method: Least Squares F-statistic: 39.96
Date: Thu, 12 Feb 2026 Prob (F-statistic): 5.06e-10
Time: 15:23:19 Log-Likelihood: -2264.1
No. Observations: 603 AIC: 4532.
Df Residuals: 601 BIC: 4541.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 60.4624 0.746 81.047 0.000 58.997 61.927
time_bin -1.5608 0.247 -6.321 0.000 -2.046 -1.076
==============================================================================
Omnibus: 38.053 Durbin-Watson: 0.665
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.912
Skew: 0.520 Prob(JB): 5.34e-12
Kurtosis: 3.991 Cond. No. 5.76
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
model = smf.ols('taps_bin ~ time_bin*C(finger) + C(dominant_hand)', data=df_bins).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: taps_bin R-squared: 0.282
Model: OLS Adj. R-squared: 0.278
Method: Least Squares F-statistic: 58.82
Date: Thu, 12 Feb 2026 Prob (F-statistic): 7.05e-42
Time: 15:23:19 Log-Likelihood: -2183.5
No. Observations: 603 AIC: 4377.
Df Residuals: 598 BIC: 4399.
Df Model: 4
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 58.5275 0.990 59.112 0.000 56.583 60.472
C(finger)[T.pinky] -6.3238 1.317 -4.800 0.000 -8.911 -3.736
C(dominant_hand)[T.True] 8.2628 0.749 11.027 0.000 6.791 9.734
time_bin -1.7458 0.291 -5.991 0.000 -2.318 -1.173
time_bin:C(finger)[T.pinky] 0.4467 0.435 1.026 0.305 -0.409 1.302
==============================================================================
Omnibus: 65.282 Durbin-Watson: 0.739
Prob(Omnibus): 0.000 Jarque-Bera (JB): 144.550
Skew: 0.608 Prob(JB): 4.09e-32
Kurtosis: 5.067 Cond. No. 14.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Try the regressions again but using a different binning ... what do you notice?¶
# binning of 2 seconds, 5 seconds, something else?What assumptions are we making?¶
When we collected our data, we had each person tap twice, once with their pinky and once with their index finger, but using the same hand. Does this affect the validity of any of the assumptions we make with this analysis?
Relating to other topics¶
How do the data we’ve collected here relate to topics about autocovariance? (can you calculate the autocovariance of your own dataset by loading the csv?)
How do you think this regression would look if we had people trying to do the finger tapping task for an even longer time? Is linear regression the best solution?