(Summary: I embrace the term Data Science because it lets us nurture a number of underappreciated talents in our students.)
One of the least developed skills that you'll find in the profession of statistics is how to name something appealingly. The discipline itself is a victim of that; not only is it less sexy than most of its competitors, the word has both plural and singular meanings. The plural is how the discipline is seen from the outside: a dry collection of summaries and figures boiled down to fit on the back of a baseball card, rather than its "singular" meaning: how the deal with uncertainty in data collection in a principled manner, which is a skill that, frankly, everyone we know can use
The meaning comes out a bit better when you call it "the discipline of statistics", or "probability and statistics", which is connected but not identical, or the sleep-inducing "theoretical statistics" or seemingly redundant (but far cooler) "applied statistics". The buzz 10 years ago was to call it "statistical science", as if our whole process was governed by the scientific method, when math is developed by proof and construction and rarely by experiment or clinical observation.
We're seeing the whole thing cook up now with the emergence of the term Data Science, which again seems to have multiple meanings, depending on who you ask:
1) "Data Science" is a catch-all term for probabilistic inference and prediction, emerging as a kind of compromise to the statistics and machine learning communities. An expert in this kind of data science should be familiar with both inference and prediction as the end goal. This seems to be the term favored by academics, particularly in how they market these tools as the curriculum for Master's programs.
2) A "data scientist" is a professional who can manage the flow of data from its collection and initial processing into a form usable for standard inference and prediction routines, then report the results of these routines to a decision maker. This definition of "data science" as the process by which this happens is favored by people in industry. The idea that the source of this data should be "Big" is often assumed but not necessary.
It also doesn't help that the term has been coined at least 3 times in the past 10 years by 4 different people, each with a stake in making their definition stick; and as I will hammer in, isn't really science, but is so essential *to* good science that I'm willing to give it a mulligan.
So why would I step into what looks like a silly semantic debate? Partly because I'm paid to. I'm teaching these skills to multiple audiences, and over the course of the past year, two books by colleagues of mine have been published by O'Reilly: "Data Science for Business"
by NYU professor Foster Provost and quasi-academic Tom Fawcett, and "Doing Data Science"
by industry authorities Rachel Schutt and Cathy O'Neil. Both came about because of courses with the words "Data Science" in the title, at NYU and Columbia respectively; both make excellent reading for people who want to work with data in any meaningful capacity but like me prefer an informal style; and both will be on the recommended list when I teach R for Data Science again in the spring of 2014. It is also no accident that the content of Data Science for Business hews closer to the academic definition, and Doing Data Science, with its multiple contributions from industry specialists, lines right up with the industry definition.
The fact that I teach such a broad range of students, many of whom are very smart but technically inxperienced, is what's motivated me to think more deeply about process and less about particular skills. I'd have to guess that at best, the work I can do that I would call "science" is no more than a quarter of my total output. Yes, I build models, make inferences and predictions and design experiments, but the actual engineering I do is the clear dominating factor; I write code according to design principles as much as scientific thinking -- if I know a quick routine will take one-tenth the time but be 95% as accurate as a slower but more correct routine, I'll weigh which method to use in the long run by some other function.
For all these reasons, we should probably call it Data Engineering (or Data Flow Management) but we're stuck with Data Science as a popular, job defining label. Far from an embarrassment of language (says the man who has effectively admitted that his blog's name is exaggerated by a factor of four), my preferred interpretation of a Data Scientist takes the best part of the previous two:
3) Someone who is *trained* to examine unprocessed data, learn something about its underlying structural properties, construct the appropriate structured data set(s), uses those to fit inferential or predictive models (possibly of their own design) and effectively report on the consequences is someone who has earned the title of Data Scientist.
What I've seen in all my time in academia is the assumption that these ancillary skills are necessary but can -- if not should -- be self-taught, particularly for PhD students but even for MS students and undergraduates. Cosma's got it exactly right
that any self-respecting graduate of our department should have those skills, but we never explicitly test them on it or venerate those students who prove it. And if the problem is getting rid of the posers
, we need to do a lot better when it comes to emphasizing this in our culture. To add another term to the stew, do we need to emphasize Data Literacy as an explicit skill? Or would it not be easier to appropriate Data Science as a term that gets down to brass tacks?