Research: Analysis of Relational Data and Complex Networks
Broadly speaking, relational data are observations and outcomes as measured between two individual units: people, schools, countries, and so forth -- a field that includes (binary) social network analysis as a subgroup. I focus on methods for evaluating and predicting relations based on individual and relational characteristics, and on the outcomes measured on the units that make up these networks.. In particular, I use hierarchical/multilevel modelling and tools provided by Bayesian computational statistics to shed new light on old methods and models.
I research the conditions under which we can infer contagion on social networks, and how we can distinguish it from homophily, or the tendency of similar people to be connected. Cosma Shalizi and I are skeptical.
Research: Quantitative Analysis of Sports and Games
I've written primarily about hockey and baseball, two sports that appear to be as far apart as can be in the way the games are typically modelled, though there is a fair amount of common ground to be had.
- Competing Process Hazard Function Models for Player Ratings in Ice Hockey, with Sam Ventura, Shane Jensen and Stephen Ma. Read more about it here.
- Inter-Arrival Times of Goals in Ice Hockey. Journal of Quantitative Analysis in Sports, 3(3).  -- Taking times between goals in NHL games, and accounting for the censored nature of data, I estimate the probability distribution of the time between events using survival analysis, then use this to estimate the value of a goal in terms of the change in win probability for a team. This approach was originally inspired by work in baseball by George Lindsey and others. Data for this paper is available here.
- The Impact of Puck Possession and Location on Ice Hockey Strategy. Journal of Quantitative Analysis in Sports, (2)1.  -- Since offence and defence are highly entangled concepts -- a team is less likely to be scored upon in situations where they're likely to score -- I separate the two elements of each into puck possession and location and assess the offensive and defensive potential in each situation; this is simulated using a semi-Markov process. The data were manually collected from Harvard Crimson Men's games in 2004-2005 and are available upon request.
- That's the Second-Biggest Hitting Streak I've Ever Seen! Verifying Simulated Historical Extremes in Baseball, Journal for Quantitative Analysis in Sports, 6(4).  -- Given season statistics for players throughout the history of Major League Baseball, I shrink these statistics toward fitted career curves for each player, then use these to simulate player-games and determine hitting and on-base streaks, first under an assumption of independent game outcomes, then introducing a parameter for streakiness or anti-streakiness. Having verified that the model accurately captures lower-order hitting streaks, and that the variability of pitchers in their ability to get outs on balls in play has decreased, I conclude that the fabled DiMaggio 56-game streak is an impressive accomplishment in itself, and highly surprising in the modern era. A similar model addressing on-base streaks suggests that Ted Williams's 84-game record is less surprising, though that single individuals have more sway and call the model's validity into question. Supplementary material in an R package.
- Pitcher Accuracy Through Catcher Spotting: Assessing Rater Reliability (originally presented at NCSSORS, now in JQAS 7(2)) is a pilot study using the above applet to collect data to see how different raters would rate the same pitch targets and impacts using several different input methods. If the rater can make the input directly on the screen, it works a lot better. The data set from this is here.
Algorithms for Perfect Sampling: Broadly speaking, Perfect Sampling refers to a method by which a stochastic input can be translated to a corresponding stochastic output with no estimation or addition of noise. In the Markov chain literature, this is when the desired output is the stationary distribution of a Markov chain, the input being observations from the chain. See David Wilson's primer on the history of the method, notably the development of Coupling From the Past.
- A Practical Implementation of the Bernoulli Factory, with Jose Blanchet (submitted). This extracts and explains the nature of the Bernoulli Factory mechanism used below and deserved further illumination. R code is here.
- Exact Simulation and Error-Controlled Sampling via Regeneration and a Bernoulli Factory, with Jose Blanchet. This version was presented at the New England Statistics Symposium in 2007 and won an IBM T.J. Watson Student Paper Award. This incorporates two separate sampling methods:
- A Bernoulli Factory, which takes an input of a stream of Bernoulli random variables (coin flips) with unknown but fixed success probability p and outputs a stream of Bernoullis with success probability f(p), where the function itself is known.
- A regeneration scheme, wherein a Markov chain can be divided into subsections that are independent to each other and identically distributed.
My involvement in teaching at CMU has been rewarding, educational and enjoyable. In addition to offering independent study courses, I've been involved in several lecture and seminar courses.
36-724, Applied Bayesian and Computational Methods, taught in Spring 2012, Fall 2010, Spring 2010.
36-757/758, Advanced Data Analysis, taught in Spring 2012, Fall 2011, Spring 2011.
I coordinated the department seminar series from 2009-2011.