Quantitative Analysis in Sports

I've written primarily about hockey and baseball, two sports that appear to be as far apart as can be in the way the games are typically modelled, though there is a fair amount of common ground to be had.

Hockey projects:

  • The Impact of Puck Possession and Location on Ice Hockey Strategy. Journal of Quantitative Analysis in Sports, (2)1. [2006] -- Since offence and defence are highly entangled concepts -- a team is less likely to be scored upon in situations where they're likely to score -- I separate the two elements of each into puck possession and location and assess the offensive and defensive potential in each situation; this is simulated using a semi-Markov process. The data were manually collected from Harvard Crimson Men's games in 2004-2005 and are available upon request.
  • Inter-Arrival Times of Goals in Ice Hockey. Journal of Quantitative Analysis in Sports, 3(3). [2007] -- Taking times between goals in NHL games, and accounting for the censored nature of data, I estimate the probability distribution of the time between events using survival analysis, then use this to estimate the value of a goal in terms of the change in win probability for a team. This approach was originally inspired by work in baseball by George Lindsey and others. Data for this paper is available here.

Baseball projects:

  • That's the Second-Biggest Hitting Streak I've Ever Seen! Verifying Simulated Historical Extremes in Baseball Working paper, soon to be submitted [2010] -- Given season statistics for players throughout the history of Major League Baseball, I shrink these statistics toward fitted career curves for each player, then use these to simulate player-games and determine hitting and on-base streaks, first under an assumption of independent game outcomes, then introducing a parameter for streakiness or anti-streakiness. Having verified that the model accurately captures lower-order hitting streaks, and that the variability of pitchers in their ability to get outs on balls in play has decreased, I conclude that the fabled DiMaggio 56-game streak is an impressive accomplishment in itself, and highly surprising in the modern era. A similar model addressing on-base streaks suggests that Ted Williams's 84-game record is less surprising, though that single individuals have more sway and call the model's validity into question. Supplementary material in an R package.
  • The Catcher Spotting Project is an attempt to measure pitcher intent by recording the potential target of each pitch, and adding this information to already available sources. I'm currently seeking funding and/or students to push this forward. An old version of this is available here.

Content copyright (c) 2009, Andrew C. Thomas.