Race Results Part 2 - Analyzing and Visualizing Finish Times

cc

In my previous post, I used the BeautifulSoup Python library to scrape HTML data from the web, clean it, and read it into Pandas. The data that I use are the 2015 and 2016 results from the Empire Open Cross Country Meet - a 3.43 mile race that my running club hosts every year in Santa Rosa.

Now I have a cleaned dataset of 2 years of race results parsed from HTML. In this post I will actually analyze this data a bit and generate some interesting stats and plots. I’ll use the grouping operations in Pandas to compare team performances across years; Matplotlib and its wrapper seaborn to make plots that visualize these comparisons; and Statsmodels to build a basic regression model that can be used to control for age and gender when comparing times.

A Jupyter notebook of my code can be downloaded here.

#Import necessary Python modules

import pandas as pd
from datetime import datetime, timedelta
import seaborn as sns
import matplotlib.pylab as plt
from matplotlib.cm import get_cmap
import numpy as np
%matplotlib inline

First I read in the cleaned dataframe that I generated in the last post and saved as a Pickle file (this is just an efficient way to save a Python object to disk and then read it back in).

This data contains 547 race finish times across 2 years of the same race. I have the data sorted by finish time across both years. As you can see, from the first 5 rows printed below, the same runner placed 2nd in both 2015 and 2016. The data includes the information we would expect to see in race results - Place, Name, Team, Age, Gender, Time, etc.

df = pd.read_pickle('cleaned_race_results.p')
print df.shape
print df.head()
(547, 12)
   Place              Name                    Team  Bib No  Age Gender  \
0      1      Isaac Chavez     Aggies Running Club     414   24      M   
1      2  Malcolm Richards  West Valley Track Club     306   32      M   
2      1    Trevor Halsted              UNATTACHED     701   23      M   
3      2  Malcolm Richards  West Valley Track Club     623   33      M   
4      3  Alfonso Cisneros     Aggies Running Club     413   23      M   

    Age Group Total Time    Pace  Year     Time    Minutes  
0  1/52 16-99      16:51  4:54/M  2015 00:16:51  16.850000  
1  2/52 16-99      16:56  4:56/M  2015 00:16:56  16.933333  
2     1 16-99      16:56  4:56/M  2016 00:16:56  16.933333  
3     2 16-99      17:05  4:59/M  2016 00:17:05  17.083333  
4  3/52 16-99      17:07  4:59/M  2015 00:17:07  17.116667  

Get Average Time Team Rankings by Gender and Year

A useful piece of information would be to compare teams by gender and year to see which teams are the fastest. In cross-country races the fastest team is determined by the place of its top 5 runners. However, I am instead going to look at the average time of the top 5 runners, which will provide more useful comparisons across years.

The first step here is to identify the top 5 runners on each team for each year and gender. Not all teams had a “full-team” of 5 runners so the first step is to identify which runners are on full-teams. Then I subset the top 5 runners from each of these full teams. This is easy to do by applying functions to Pandas groupby objects, which I do below.

I can also easily determine the number of full-teams by gender and year, which I print out below. In 2015 there were 7 full women’s teams and 11 full men’s teams, and in 2016 there were 8 of each.

#define columns that will be used to aggregate by
group_cols = ['Team','Gender','Year']

#find out which runners are on a full team (at least 5 people and not "unattached")
full_team = (df.groupby(group_cols)['Name'].transform('size')>=5) & (df.Team<>'UNATTACHED')

#get the team, year, gender, and time of those top 5 runners on full teams 
top5 = df[full_team].groupby(group_cols)['Minutes'].nsmallest(5).reset_index()[['Team', 'Gender','Year','Minutes']]

#determine number of complete teams in each year
print top5.groupby(['Gender','Year'])['Team'].apply(lambda x:len(x.unique())).unstack()
Year    2015  2016
Gender            
F          7     8
M         11     8

Now that I have subsetted the top 5 runners in each full-team, I can calculate the average time for each of those teams. I do that below using the Pandas cross-tab function, specifying that I want the data tabulated by Team, Year, and Gender, while calculating the mean time.

The result is a dataframe indexed by each unique team, with columns for the average time of that team by year and gender. There are some cases where a team only has runners of one gender or only ran one year; and in these cases the team has a null value.

#use crosstab function and aggregate times by taking the mean
avg_times = pd.crosstab(top5.Team, [top5.Gender, top5.Year], values = top5.Minutes, aggfunc=lambda x:round(x.mean(),2))
print avg_times.sort_index()
Gender                                 F             M       
Year                                2015   2016   2015   2016
Team                                                         
Aggies Running Club                  NaN    NaN  17.63    NaN
Buffalo Chips                        NaN  27.02  29.11  25.47
Empire Runners Club                23.46  23.06  19.76  19.27
Excelsior Running Club               NaN    NaN  18.99  20.01
Golden Valley Harriers             22.90  24.38  20.66  19.75
Humboldt Track Club                  NaN    NaN  22.07    NaN
Impala Racing Team                 21.86  21.68    NaN    NaN
Lake Merritt Joggers and Striders  26.15  26.05    NaN    NaN
Pamakids                           23.33  24.41  21.20  22.66
Strawberry Canyon TC               21.06  22.27  18.13  17.94
Tamalpa Runners                    23.56  28.03  20.77  21.25
West Valley Joggers & Striders       NaN    NaN  19.85    NaN
West Valley Track Club               NaN    NaN  17.85  17.62

Then from my dataframe of average times, I can calculate the team rankings by age and gender. Note that I first exclude teams with null values (because they didn’t have a team for a gender or year), so that these teams do not appear in the listings.

I show the top 3 teams for each year and gender by average time below.

team_rankings = avg_times.apply(lambda x: x.index.where(x.notnull(), '')[x.values.argsort()]).reset_index(drop=True)
team_rankings.index = team_rankings.index+1
team_rankings.columns.names=[None, None]
print team_rankings[:3]
                        F                                             M  \
                     2015                  2016                    2015   
1    Strawberry Canyon TC    Impala Racing Team     Aggies Running Club   
2      Impala Racing Team  Strawberry Canyon TC  West Valley Track Club   
3  Golden Valley Harriers   Empire Runners Club    Strawberry Canyon TC   

                           
                     2016  
1  West Valley Track Club  
2    Strawberry Canyon TC  
3     Empire Runners Club  

Comparing Team Performances Across Years

Next, I will look among teams that competed both in 2015 and 2016 and compare their results to see which teams got faster and which teams got slower. Below, I calculate the number of seconds faster that each team ran in 2016. Below I identify that the Tamalpa Runners Women’s team improved the most, with a top 5 runner average 268 seconds faster (almost 4.5 minutes!) in 2016 than 2016.

team_time_compare = avg_times.stack(level=0).dropna()
seconds_faster = (team_time_compare[2016] - team_time_compare[2015]).sort_values(ascending=False)*60
print seconds_faster.idxmax(), seconds_faster.max()
(u'Tamalpa Runners', u'F') 268.2

Most other teams did not improve quite as much. About half of the teams that ran both years ran faster, and about half ran slower.

print "{} teams ran faster and {} teams ran slower".format((seconds_faster>0).sum(), (seconds_faster<0).sum())
print "Team average time improved by an average of {} seconds in 2016".format(int(seconds_faster.mean()))
7 teams ran faster and 8 teams ran slower
Team average time improved by an average of 20 seconds in 2016

To help visualize these team comparisons, I make a bar plot of team average time differences. Additionally, I use a Blue-to-Red colormap to visualize the amount faster or slower that each team ran. This is all pretty easy to do using Matplotlib along with the built in plotting functions in Pandas.

#define colormap
cmap = get_cmap('RdBu')

#normalize data 0-1 to get each teams color within the color map
norm_data = (seconds_faster-seconds_faster.min())/(seconds_faster.max()-seconds_faster.min())
colors = [cmap(pct) for pct in norm_data]

#define figure and axis
fig, ax = plt.subplots(figsize=(10,8))   
#use pandas buit in plotting features to plot to this axis
seconds_faster.plot(kind='bar', ax=ax, fontsize=14, color=colors, width=1, rot=80)

#Set title and labels
ax.set_title('Empire Open Team Average Time Differences 2015 - 2016',fontsize=18)
ax.set_ylabel('Seconds Faster in 2016', fontsize=18)

<matplotlib.text.Text at 0xcbe5170>

cc

Model Effect of Age and Gender on Race Time

The last thing I would like to do is use this data to visualize and quantify the effect that age and gender have on race times. Using the lmplot method in seaborn I can easily visualize the relationship between age, gender, and time in 1 plot for both years. I use different colors to identify male vs female, and then plot age against time, specifying that I want each year in a separate plot.

As we might expect to see, in general, times get slower as age increases, but this relationship is not linear. In the younger age groups, people run faster as they get older, but after 30 or so, people start running slower. This same general pattern holds in both years and for both men and women.

The lmplot method also fits a linear regression model and plots the line along with its confidence interval. Given the fact that the relationship between age and gender has a curve to it, I fit the model with a 2nd order polynomial, which fits the curvilinear relationship.

g=sns.lmplot('Age', 'Minutes', data=df, hue='Gender', order=2, palette='dark', col='Year')
t=plt.suptitle('Empire Open 3.43 Mile Race: Finish Times by Age and Gender', size=15)
g.fig.subplots_adjust(top=.85)

cc

While seaborn will fit and plot a regression line, you can’t actually obtain the results of that regression. For this I use statsmodels, a statistics module that has a formula framework similar to R and also interfaces really nicely with Pandas.

Below I fit a basic OLS regression model that replicates the model used by seaborn to generate the regression lines plotted above. I regress race minutes against age, age-squared, gender, and year. I include both age and age-squared in the model to take into account the 2nd order polynomial fit. I also specify that I want to treat gender and year as categorical variables (Statsmodels automatically chooses the excluded group for categorical variables).

import statsmodels.formula.api as smf
model = smf.ols(formula = 'Minutes ~ Age + I(Age**2) + C(Gender) + C(Year)', data = df).fit()
print model.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Minutes   R-squared:                       0.560
Model:                            OLS   Adj. R-squared:                  0.557
Method:                 Least Squares   F-statistic:                     172.4
Date:                Sun, 26 Mar 2017   Prob (F-statistic):           3.69e-95
Time:                        21:36:48   Log-Likelihood:                -1408.5
No. Observations:                 547   AIC:                             2827.
Df Residuals:                     542   BIC:                             2848.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept          30.3114      0.970     31.264      0.000        28.407    32.216
C(Gender)[T.M]     -3.8807      0.278    -13.953      0.000        -4.427    -3.334
C(Year)[T.2016]     0.1722      0.274      0.628      0.530        -0.367     0.711
Age                -0.4188      0.045     -9.346      0.000        -0.507    -0.331
I(Age ** 2)         0.0068      0.001     13.452      0.000         0.006     0.008
==============================================================================
Omnibus:                      165.883   Durbin-Watson:                   1.173
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              491.159
Skew:                           1.456   Prob(JB):                    2.22e-107
Kurtosis:                       6.616   Cond. No.                     1.85e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The coefficients of the terms in this model behave as we would expect given the plots above. The coefficient for male is negative and significant because in general times for men are faster than for women. The coefficient for year is not well-supported meaning there is no significant difference in times between the two years. The age effect also makes sense although is a bit more complicated to interpret because we include both age and age-squared to measure the curvilinear relationship. The negative age coefficient means that in general as age increases the number of minutes it takes to complete the race decreases (people get faster as they get older), but the positive squared term means that this effect decreases at higher ages (people start getting slower). Therefore the relationship between age and time is a u-shape.

With statsmodels I can use the predict method to get expected values given defined inputs. Below I show that the predicted time of a 26-year-old male in 2015 was 20.12. I then print out my time in that year and see that I beat my predicted time by 1.4 minutes! (although, I fear that my time would be much less stellar if I had used one of the traditionally more competitive meets to build this model).

p = model.predict({'Age':26, 'Gender':'M', 'Year':2015})[0]
print "Predicted time (in minutes) of a 26 year old male in 2015: {}".format(round(p,2))
Predicted time (in minutes) of a 26 year old male in 2015: 20.12
print df.loc[(df.Name=='Andrew Gaidus')&(df.Year==2015), 'Minutes'].iloc[0]
18.75

Comparing the predicted value versus the actual value (as I just did for myself above) in a sense gives an indication of how well someone ran, controlling for their age and gender. The difference in predicted values vs actual values is simply the model residuals, and these can easily be obtained from the model. Below I use observations in 2016 with the most negative residuals to identify the top performances controlling for age and gender (according to my very simple model).

According to this ranking the top performance in 2016 was by a 71 year-old man who ran his race at 7:02 mile pace! This definitely sounds impressive to me.

#Get model residuals 
res=pd.Series(model.resid,name='Residuals')

#Top 5 performances controlling for age and gender
best_performances = df.join(res).sort_values(by='Residuals')

print 'Top 3 performances controlling for age and gender'
print best_performances[best_performances.Year==2016][['Name', 'Age', 'Gender', 'Time', 'Pace']][:3]

Top 3 performances controlling for age and gender
            Name  Age Gender     Time    Pace
286  Len Goldman   71      M 00:24:08  7:02/M
504  Bill Dodson   81      M 00:31:49  9:16/M
349  Gene French   70      M 00:25:13  7:21/M

And lastly, I will now generate team rankings in 2016 controlling for age and gender, by using the residuals of the top 5 runners of each team.

top5_resid = df[full_team].join(res).groupby(['Team','Gender','Year'])['Residuals'].nsmallest(5).reset_index()
best_teams = top5_resid.groupby(['Team','Gender','Year'])['Residuals'].sum().unstack()[2016].sort_values()
print best_teams[:3]
Team                    Gender
Tamalpa Runners         M        -21.456713
Impala Racing Team      F        -19.757007
West Valley Track Club  M        -14.611984
Name: 2016, dtype: float64

As you can see, the Tamalpa Runners Men’s team has the most negative sum of residuals. If we look at the age and times of these runners we see that they are indeed very impressive. The average age of their top 5 was 46 years old and the average pace is 5:56 per mile!

df16=df[df.Year==2016]
top_team =  df.set_index(['Team','Gender']).loc[best_teams.idxmin()].reset_index()[:5][['Age','Gender','Time', 'Pace']]
print top_team
   Age Gender     Time    Pace
0   38      M 00:20:03  5:50/M
1   47      M 00:20:20  5:55/M
2   52      M 00:20:22  5:56/M
3   51      M 00:20:23  5:56/M
4   44      M 00:20:44  6:02/M
Written on March 26, 2017