October 2009 Archives

More on Correlations and Language

I was delighted to see the response in AG's blog to some of my thoughts on causative language, part of the reason being that word order implicates a causative notion. As most researchers know, this isn't even the beginning of the story. Translating mathematical concepts to plain language is difficult, even when the scientist is playing honestly with the facts.

Even in a simple case like baseball, we drop terms all the time from the explanation, as Phil Birnbaum demonstrates:

We can run a simple regression, runs scored vs. triples hit. I used a dataset consisting of all full team-seasons from 1961 to 2008 (only for teams that played at least 159 games, to omit strike seasons). That was 1,121 teams. The result of the regression:

Runs = 731 - (0.44 * triples)

That's not a misprint: the regression tells us that every triple actually *costs* its team almost half a run!

Birnbaum goes on to demonstrate that a triple really does have positive value in runs through a very nice matching argument, but I still think he undersells the problem with . Just looking at this statement alone, it's worth noting in English what this mathematical statement says:

"The expected number of runs that a Major League Baseball team (one of thirty) scores in a year is negatively correlated with the number of triples hit by said team, given the population and the underlying distribution of covariates"

and not

"one additional triple results in a loss of 0.44 runs".

In short, simplifying the language can strip out the details that matter most.

Model Checking and Baseball

One of the reasons I'm in the business of stochastic modelling in sports is that with such an abundance of data, it's easy to check a model against real world scenarios. There's a really nice discussion going on at Sabermetric Research that addresses why standard linear regression alone isn't going to get a good enough picture of the causal processes in baseball, let alone the non-game-based world.

This is related to what AG has discussed on the notion of model scaffolding: that by slightly changing the specification of a model, one can gauge how believable the model is for describing the situation at hand. It's also a good warning that a regression coefficient doesn't necessarily mean what you think it does; this is illustrated clearly in the above-linked article where a poorly-chosen model suggests that hitting triples leads to a decrease in run support.

What really impresses me about this one is that the proposed solution to the analysis problem is to run a matching-like experiment -- take all base-run situations, match up those for when triples are hit to when they aren't (for baserunners, pitching scenarios, hitters, etc), then compare the runs scored in the inning afterwards to get a plausible effect size.

Review: "SuperFreakonomics" by Steven D. Levitt and Stephen J. Dubner

There's been a whole lot of controversy about this follow-up to the original Freakonomics, largely because of two words in the subtitle that got attention: "global cooling". Along with the referred chapter on climate science, Levitt and Dubner have been assaulted with charges that they ignore the true state of the science, starting with the cries from the 1970s that the Earth was cooling. The shrill tone of the opposition was enough that Amazon stopped the ability to do text searches from their website (though I don't know what party was responsible for that.)

And all I can say to Levitt and Dubner is: you lucky bastards.

I'm afraid to say that the Steph/vens' new product would have been a victim of their previous success, as the market has been flooded by imitators in the past 5 years (just search Amazon). Combine this with a classic case of "sophomore slump" regression to the mean -- when a very entertaining book hits the jackpot and its title enters the zeitgeist, how could its successor top it on its merits alone?

This book is as exceptionally well written as the original, mainly in that the storytelling is as compelling as ever. But even as the authors admit that their overriding theme "people respond to incentives" leads directly into "beware unintended consequences", their own speculation on issues is rife with unaddressed consequences. As much as I think that geoengineering is an issue we should consider to mitigate the increase in carbon dioxide emissions, I come away feeling that we could be doing a lot better thinking of those unintended consequences.

I found their attitude toward skepticism in their first book to be much healthier, in particular with their controversial suggestion that Roe v. Wade was a contributing factor in lowering the violent crime rate, because they were advancing it as a hypothesis while still addressing the fact that an 18-year gap between cause and effect is big enough to cause a lot of problems with verifying it. They do not seem to show the same perspecuity when considering the smokestacks-to-the-sky or Salter Sink concepts and the dangers that could lie 25 years down the road because we didn't brainstorm hard enough.

In the end, a book that matches up with the original only in its writing style is probably going to outsell the first one because of a public relations media storm, and I can't help but think that some editor at William Morrow/HarperCollins is laughing, and enjoying a healthy bonus, for suggesting that "global cooling" might be prominently featured in the subtitle.

Review: "Mathletics", by Wayne Winston

Wayne Winston's name has been in the news recently, largely for his predictions about the coming NBA season. His predictions are based around his adjusted plus-minus statistics, some of which are open access, some of which come from his proprietary asociation with the Dallas Mavericks.

Part of the media attention he's been getting is timed with the release of his book "Mathletics: How Gamblers, Managers and Sports Enthusiasts Use Mathematics in Baseball, Basketball and Football", in which he sums up a number of methods for making data-driven decisions for building a professional sports team, as well as making decisions in-game.

First, what I like about the book: it's a good primer for the non-expert in technical workings. Winston gives a number of recipes for the methods he details through Excel spreadsheets, meaning that beginners can get their hands dirty immediately, something that as a teacher I wholeheartedly recommend; I can accept the choice of Excel as the language of choice for its universality (and that OpenOffice.org Spreadsheet can do it as well.)

Second, Winston's bibliography actually summarizes a bit about each of the academic papers he's citing, which is a nice introduction to the literature for the non-expert.

The biggest problem with the book from my point of view is that it's not the book I expected from its title; I was hoping that a consultant with the Mavericks would have a little more to say about how sports teams actually consider and weigh the evidence on a personal level, and how they consider the statistical evidence alongside personal experience.

Sure, I have lots of other gripes: the Acronym Soup that plagues a lot of the statistics in sports community (hockey especially); Winston's overuse of words like "brilliant" and "wonderful" when describing the authors and movies he likes, which I find overly grating (and redundant); the overabundance of tables (with too many significant figures) and poor-quality graphs (see Page 49, DICE, for an example; I blame Excel and Winston's graphics editor); and not least, the title Mathletics, which as catchy as it is, suffers from an unfortunate namespace overload.

All in all, I would recommend this book to the non-expert, as a handy overview of the state of analysis in sports these days, as well as asking the right kinds of questions that sports analysts should ask.

A Socially Responsible Method of Announcing Associations

| 1 TrackBack
AG was involved in a discussion regarding the use of causal language in associational studies (one discussion among many, it should be noted.) The gist of his point was that we as scientists shouldn't use causal language if the analysis isn't causal in nature, like how a regression analysis yields partial correlations.

The trouble is, causal claims have an order to them (like "aliens cause cancer"), and so do most if not all human sentences ("I like ice cream"). It's all too tempting to read a non-directional association claim as if it were so -- my (least) favourite was a radio blowhard who said that in teens, cellphone use was linked with sexual activity, and without skipping a beat angrily proclaimed that giving kids a cell phone was tantamount to exposing them to STDs. Even the use of language like "linked" can be gently ignored by the professional reader if the word order is there.

So here's a modest proposal: when possible, beat back the causal assumption by presenting an associational idea in the order least likely to be given a causal interpretation by a layperson or radio host.

Trying it this way, a random Google News headline reads: "Prolonged Use of Pacifier Linked to Speech Problems" and strongly implies a cause and effect relationship, despite the (weak) disclaimer from the quoted authors. Reverse that and you've got "Speech Problems linked to Prolonged Use of Pacifier" which is less insinuating, at least to me.

P.S. Yes, it's tantamount to underselling your own research, but your scientific soul will be cleaner for it, and in the end I think the trend would have some small payoff to society.

New England Statistics in Sports: videos available online