Earlier implementations of hexagonal bin plotting for NHL shots on goal were very productive, so thanks to those who gave feedback and helped to improve it.

The new version has many upgrades:

  • A permanent name: Hextally, to represent the hexagonal binning process and to follow the tradition of PECOTA and other player-named methods. The name of course came second, but I do find it funny that this applet and method is used to judge shooting skill, and is named for the only goaltender to score two goals on two shots.

  • Player charts! We can now look not only at the shots taken by each player, but also at the differing performances of the team when that player is on and off the ice. 

  • Man situations: full strength, power-play/shorthanded and four-on-four are all available.

  • Rink adjustments. The number of shots on goal is maintained to be the same, but to correct the overall balance of shots by zone, I randomly select a number of shots to move to a neighboring zone, such that the proportion of shots of each type is the same at home and away. (Snap shots and wrist shots were pooled due to the systemic confusion between these types by the official scorers.)

  • Adjustment for small sample sizes. Since the method estimates shooting rates per 60 minutes, players that have low time-on-ice in a particular scienario, like a penalty killer with minimal power-play time, will have rates with high variances. To compensate, we add "fake" shots on goal to each scoring zone with the same rate as the team without that player, with sufficient extra time up to 300 minutes of time on ice. (This was chosen on eye inspection and has not been peer-reviewed, but it suits the goal of evaluating players compared to their teammates.)
Comments are appreciated to improve both the presentation and the methods at hand!

Hockey writing these days is peppered with references to Corsi and Fenwick, which are fancier names for the differential of shot attempts made by and against a team, or with respect to when a particular player is on the ice. These are fairly predictive of future success (or failure) because they indicate the balance to which a team has possession of the puck in their offensive zone -- the two requisite conditions for scoring a goal under most circumstances.

And yet, the shorthand for this in the media is that these are proxies for puck possession, leaving location out in the cold, with some exceptions that are regrettably in the minority. Leave aside the notion that a blocked shot or a shot from the point is less valuable than a bona fide scoring chance; what's the actual consequence of this preference for possession over location?

If this was in widespread use 10 years ago, I'd have said that prioritizing possession over location would have made a negative impact for one reason: the era of the neutral zone trap and the explicit importance that successful teams placed on pinning the opposing team in their own zone meant that of those two elements, location really was supreme. In my very first published paper, I used a very limited but fun-to-collect data set to collect zone time and puck possession; for these games, it was clear that being in the offensive zone, but not having possession of the puck, was on average better for that team in terms of net goals scored in the ensuing seconds. And this was in a league that already had two-line passes.

The real danger isn't so much for people who are in the know, but for the increasing acceptance of #fancystats into the public sphere, it would be far too easy to assume that playing keep-away is necessarily better than playing dump-and-chase when it's a word, not actual number-crunching, that's pushing that point. 

Pucksberry: Adapting Hexagonal Bin Plots for NHL Display

One of my favorite classes of statistical graphics in sports media is the hexagonal bin plot, used by Grantland's Kirk Goldsberry to illuminate the shooting patterns and successes of shooters in the NBA. Combined with his access to the luxuriously rich SportVU data, Goldsberry has made a second career using a single graphic to tell stories (he's also a geography professor). 

As of this NBA season, SportVU gives the x-y locations of all shots taken along with their success or failure in scoring, so Goldsberry has two variables to plot: the relative location of shots, and the proportion of shots that go in. These make for glorious comparisons to make a point, like how the Spurs dominated during their winning streak:


So of course, as a statistics professor who teaches graphics and does research on hockey, my first instinct is to steal it for a massive profit see how I can adopt, adapt and improve this method for the hockey community at large, particularly since x-y data on shot attempts has been available for the NHL since 2008

So what are the big differences between NBA and NHL data that we have to bear in mind? And when do we get to see some pretty pictures? (The answer to both after the jump.)

nhlscrapr: An R package whose purpose is right there in the name

In putting together the game data from the NHL for the games we needed, my students and I (namely Sam Ventura) have been trying to dig through the NHL Real-Time Scoring System database, or at least what's facing the user side, that we could get data down to our desired data resolution. The data available from the NHL online extends back to the 2002-2003 season, though new elements appeared in 2007 (a better play-by-play data set) and again in 2008 (x-y coordinates of some events).

We decided that it would be in everyone's best interest to have this data available and sharable to everyone, but not where we would have to host this data ourselves, since the NHL is doing that already. 

As a result, we created the R package nhlscrapr, which has been available on CRAN for some time but has been updated to be much more usable, particularly as new games are played and added to the NHL website.

Ultimately, everything we need for longer-term studies is contained in these tables:

1) A unique roster of all players, with de-duplication of cases when players change numbers, or scorekeeper-preferred spellings. Jean-Sebastian Giguere and J.S. Giguere might be the same person, but many of these spellings change over time on a game-by-game basis, and even if the NHL has unique player identifiers, we don't.

2) A table of all games played in the regular season and playoffs. (We cared less about preseason games for our tasks, mainly because of the excess of players who would not play in the NHL in future.)

3) A full, annotated and augmented play-by-play table for each recorded event, including (and especially!) player substitutions. This was most important for us as the unit of interest was the "shift" -- the contiguous space of time between events with the same players on the ice that ends with a noteworthy event. In the beginning this was simply goals and changes; we have since extended this to all events including shots, hits and penalties.

What's Pulling The Goalie Actually Worth To A Team?

Summary: It's an innovation with no monetary cost, so let's figure out the actual gain to pulling the goalie earlier, and show that there's little harm to changing.

So apparently I can complain about progress a little bit. Pulling the goalie is one of my favorite topics in analytics for many reasons, but the biggest is that it feels like the easiest sell to make to teams as to why they should trust data-driven analysis: a change in strategy that costs no money to implement, no new assets to acquire and no new technology to trust.

When I deconstructed the pulled-goalie timing data even further, it became clear that the driving force was not earlier pulls in one-goal games but in two-goal games. Here are all the teams' average times divided by season and score differential when the goalie was pulled at even strength:

"Data Science" Is A Useful Label, Even If It's Usually 5% Science

(Summary: I embrace the term Data Science because it lets us nurture a number of underappreciated talents in our students.)

One of the least developed skills that you'll find in the profession of statistics is how to name something appealingly. The discipline itself is a victim of that; not only is it less sexy than most of its competitors, the word has both plural and singular meanings. The plural is how the discipline is seen from the outside: a dry collection of summaries and figures boiled down to fit on the back of a baseball card, rather than its "singular" meaning: how the deal with uncertainty in data collection in a principled manner, which is a skill that, frankly, everyone we know can use

The meaning comes out a bit better when you call it "the discipline of statistics", or "probability and statistics", which is connected but not identical, or the sleep-inducing "theoretical statistics" or seemingly redundant (but far cooler) "applied statistics". The buzz 10 years ago was to call it "statistical science", as if our whole process was governed by the scientific method, when math is developed by proof and construction and rarely by experiment or clinical observation.

We're seeing the whole thing cook up now with the emergence of the term Data Science, which again seems to have multiple meanings, depending on who you ask:

1) "Data Science" is a catch-all term for probabilistic inference and prediction, emerging as a kind of compromise to the statistics and machine learning communities. An expert in this kind of data science should be familiar with both inference and prediction as the end goal. This seems to be the term favored by academics, particularly in how they market these tools as the curriculum for Master's programs.

2) A "data scientist" is a professional who can manage the flow of data from its collection and initial processing into a form usable for standard inference and prediction routines, then report the results of these routines to a decision maker. This definition of "data science" as the process by which this happens is favored by people in industry. The idea that the source of this data should be "Big" is often assumed but not necessary.

It also doesn't help that the term has been coined at least 3 times in the past 10 years by 4 different people, each with a stake in making their definition stick; and as I will hammer in, isn't really science, but is so essential *to* good science that I'm willing to give it a mulligan.

So why would I step into what looks like a silly semantic debate? Partly because I'm paid to. I'm teaching these skills to multiple audiences, and over the course of the past year, two books by colleagues of mine have been published by O'Reilly: "Data Science for Business" by NYU professor Foster Provost and quasi-academic Tom Fawcett, and "Doing Data Science" by industry authorities Rachel Schutt and Cathy O'Neil. Both came about because of courses with the words "Data Science" in the title, at NYU and Columbia respectively; both make excellent reading for people who want to work with data in any meaningful capacity but like me prefer an informal style; and both will be on the recommended list when I teach R for Data Science again in the spring of 2014. It is also no accident that the content of Data Science for Business hews closer to the academic definition, and Doing Data Science, with its multiple contributions from industry specialists, lines right up with the industry definition.

The fact that I teach such a broad range of students, many of whom are very smart but technically inxperienced, is what's motivated me to think more deeply about process and less about particular skills. I'd have to guess that at best, the work I can do that I would call "science" is no more than a quarter of my total output. Yes, I build models, make inferences and predictions and design experiments, but the actual engineering I do is the clear dominating factor; I write code according to design principles as much as scientific thinking -- if I know a quick routine will take one-tenth the time but be 95% as accurate as a slower but more correct routine, I'll weigh which method to use in the long run by some other function.

For all these reasons, we should probably call it Data Engineering (or Data Flow Management) but we're stuck with Data Science as a popular, job defining label. Far from an embarrassment of language (says the man who has effectively admitted that his blog's name is exaggerated by a factor of four), my preferred interpretation of a Data Scientist takes the best part of the previous two:

3) Someone who is *trained* to examine unprocessed data, learn something about its underlying structural properties, construct the appropriate structured data set(s), uses those to fit inferential or predictive models (possibly of their own design) and effectively report on the consequences is someone who has earned the title of Data Scientist.

What I've seen in all my time in academia is the assumption that these ancillary skills are necessary but can -- if not should -- be self-taught, particularly for PhD students but even for MS students and undergraduates. Cosma's got it exactly right that any self-respecting graduate of our department should have those skills, but we never explicitly test them on it or venerate those students who prove it. And if the problem is getting rid of the posers, we need to do a lot better when it comes to emphasizing this in our culture. To add another term to the stew, do we need to emphasize Data Literacy as an explicit skill? Or would it not be easier to appropriate Data Science as a term that gets down to brass tacks?

Skating Toward Progress, 2.5 seconds Per Year

I tuned in during third period action to watch the Avalanche play the Devils last night, while the Avs trailed 1-0 and realized I might see something special: Avs coach Patrick Roy pulling his goaltender earlier than other coaches would do. And of course, I looked away too early to see it actually happen, but there it was: Roy pulled J.S. Giguere with two and a half minutes to go in regulation, the Avs tied the game and won it in overtime. As someone convinced that NHL teams are far too conservative when it comes to pulling the goalie, that's one data point of vindication for pulling the goaltender earlier in the game! Right?

Well, sort of. While Roy's been known to pull the trigger far earlier than most, in his postgame comments he credited it to his instincts rather than his calculations: "sometimes you go with your feeling when to pull the goalie and fortunately it worked for us."

Still, Roy's Avalanche easily have the earliest empty-net trigger of any team in the last decade when trailing by a single goal in any end game situation:


The mean pull-time has also increased over the decade, from 61 seconds in 2002-2003 to 86 seconds through this season (not including last night's game), but no team has yet to approach the 3-minute mark in their average empty-net time -- the amount of time that most average Poisson-type models suggest is the minimum for this situation -- and only two are over the 2-minute mark at all. Still, I can't complain about progress!

Reflections on Teaching: Fall 2013

I last wrote a teaching statement 3 years ago, and the number of things that has changed in the meantime is considerable. I've now taught lecture classes for undergrads, master's students and doctoral candidates, supervised individual projects, served on dissertation committees in several departments and co-authored multiple papers with students. As I think across all those experiences, there are things I've taken to heart and others I've considered and discarded; times I've taken chances and times I've played it safe.

Beyond that, technology has come a long way since then in terms of its immediate appli-
cability in the classroom, and when to take advantage of that has also become a key question. What follows is my experiences in that time and how they've affected my perspectives, with
examples from the classes I've taught - particularly the two courses I recently concluded
teaching, in Statistical Graphics and Programming in R.

Elsevier Bought Mendeley; Internet Freaks Out; I'm Barely Surprised

I love it when my nerdiest pastime and professional interest -- bibliometrics and academic paper management -- makes the news in a big way. I like it more when it's direct evidence of all the issues that academia faces as a public good.

Mendeley is a  "freemium" service for managing collections of academic papers, offering a cloud-based storage service for personal libraries. Its users have considerable affection for the service, whose management team has proclaimed their dedication to the Open Access movement. In the process, and in contrast, the company has built an impressively large database on user activity, one that was kept to itself rather than being available to its users.

Which is why the backlash to its purchase by Elsevier, a company that takes advantage of our public good for its private enirchment, strikes me as extremely naive. Mendeley's supposed commitment to an open access movement was already betrayed by their Facebook-like business model.

I'm less shocked since this is only the latest in a series of "betrayals" by companies supposedly behind principles of openness:

Combine this with the recent rise of "predatory" journals, and you can see why my worry has less to do with any individual companies and much more about the need to solidify the process of scientific communication as a public good.

Resigned To Change

What follows: I resign from two editorial boards on principle. I don't feel heroic about it, but it had to be done.

Last year, I signed the Elsevier boycott as soon as it was announced. I firmly believed at the time that the principles of the boycott were sound: this was a company that had historically charged obscene prices, and made extreme profits, by selling other people's work with cartel-like levels of market control. I knew how this made sense in the past -- as both a filter and a distribution source, academics had little choice but to work with for-profit publishing companies. But now, the situation borders on the absurd. To make an example out of one of the biggest publishers seemed almost automatic, and I joined the official boycott without hesitation, in addition to years of avoiding Elsevier journals to publish my own work.

All that's needed for the system to work without big publishing companies is an environment of open publication, and so I've enthusiastically submitted my work to society journals and others with principles of openness. One of these was the Berkeley Electronic Press (bepress), which as a non-profit electronic publisher, committed to open access, promised a way forward: with the Internet as the ultimate distribution venue, all that would be needed is an editorial structure, handled as it has been by academics, the vast majority of whom work pro bono.

And so I joined two such efforts; first, the nascent journal Statistics, Politics and Policy, still in its infancy, in 2010; and second, the slightly more venerable Journal of Quantitative Analysis in Sports, which (to my delight, as a long time author and reader) I was asked to join roughly a year ago. Both have sterling editorial boards (aside from me) and I've enjoyed my time and efforts with both groups. But things got complicated in September 2011, when for-profit publisher De Gruyter announced that it was buying many bepress journals, including both SPP and JQAS. Originally it seemed as though little would change; my back-channel inquiries suggested that the new bosses wanted to change very little from the original bepress setup, which is why I was comfortable joining JQAS after the transition.


Recent Entries

The Statistical Properties of the Electoral College Are Perfectly Bearable
What follows: I give a not-so-ringing endorsement of the Electoral College, by showing that the current mode has reasonable partisan…
Digital Publishing Isn't Harming Science, It's Liberating It
It's somewhat appropriate that a complaint from a scientific authority on the decay of scientific publishing should be circulated on…
538's Uncertainty Estimates Are As Good As They Get
(or, in which I finally do an analysis of some 2012 election data)Many are celebrating the success of the poll…