Statisticians aren’t the problem for data science. The real problem is too many posers
I met Cathy for coffee in Cambridge when she was presenting at MIT awhile ago. I liked her style and knowledge. Re-posted with permission from the author.
by Cathy O’Neil
a data scientist who lives in New York City and writes at mathbabe.org
Statisticians aren’t the problem for data science. The real problem is too many posers
(Crossposted on Naked Capitalism)
Cosma Shalizi
I recently was hugely flattered by my friend Cosma Shalizi’s articulate argument against my position that data science distinguishes itself from statistics in various ways.
Cosma is a well-read broadly educated guy, and a role model for what a statistician can be, not that every statistician lives up to his standard. I’ve enjoyed talking to him about data, big data, and working in industry, and I’ve blogged about his blogposts as well.
That’s not to say I agree with absolutely everything Cosma says in his post: in particular, there’s a difference between being a master at visualizations for the statistics audience and being able to put together a power point presentation for a board meeting, which some data scientists in the internet start-up scene definitely need to do (mostly this is a study in how to dumb stuff down without letting it become vapid, and in reading other people’s minds in advance to see what they find sexy).
And communications skills are a funny thing; my experience is communicating with an academic or a quant is a different kettle of fish than communicating with the Head of Product. Each audience has its own dialect.
But I totally believe that any statistician who willingly gets a job entitled “Data Scientist” would be able to do these things, it’s a self-selection process after all.
Statistics and Data Science are on the same team
I think that casting statistics as the enemy of data science is a straw man play. The truth is, an earnest, well-trained and careful statistician in a data scientist role would adapt very quickly to it and flourish as well, if he or she could learn to stomach the business-speak and hype (which changes depending on the role, and for certain data science jobs is really not a big part of it, but for others may be).
It would be a petty argument indeed to try to make this into a real fight. As long as academic statisticians are willing to admit they don’t typically spend just as much time (which isn’t to say they never spend as much time) worrying about how long it will take to train a model as they do wondering about the exact conditions under which a paper will be published, and as long as data scientists admit that they mostly just redo linear regression in weirder and weirder ways, then there’s no need for a heated debate at all.
Let’s once and for all shake hands and agree that we’re here together, and it’s cool, and we each have something to learn from the other.
Posers
What I really want to rant about today though is something else, namely posers. There are far too many posers out there in the land of data scientists, and it’s getting to the point where I’m starting to regret throwing my hat into that ring.
Without naming names, I’d like to characterize problematic pseudo-mathematical behavior that I witness often enough that I’m consistently riled up. I’ll put aside hyped-up, bullshit publicity stunts and generalized political maneuvering because I believe that stuff speaks for itself.
My basic mathematical complaint is that it’s not enough to just know how to run a black box algorithm. You actually need to know how and why it works, so that when it doesn’t work, you can adjust. Let me explain this a bit by analogy with respect to the Rubik’s cube, which I taught my beloved math nerd high school students to solve using group theory just last week.
Rubiks
First we solved the “position problem” for the 3-by-3-by-3 cube using 3-cycles, and proved it worked, by exhibiting the group acting on the cube, understanding it as a subgroup of and thinking hard about things like the sign of basic actions to prove we’d thought of and resolved everything that could happen. We solved the “orientation problem” similarly, with 3-cycles.
I did this three times, with the three classes, and each time a student would ask me if the algorithm is efficient. No, it’s not efficient, it takes about 4 minutes, and other people can solve it way faster, I’d explain. But the great thing about this algorithm is that it seamlessly generalizes to other problems. Using similar sign arguments and basic 3-cycle moves, you can solve the 7-by-7-by-7 (or any of them actually) and many other shaped Rubik’s-like puzzles as well, which none of the “efficient” algorithms can do.
Something I could have mentioned but didn’t is that the efficient algorithms are memorized by their users, are basically black-box algorithms. I don’t think people understand to any degree why they work. And when they are confronted with a new puzzle, some of those tricks generalize but not all of them, and they need new tricks to deal with centers that get scrambled with “invisible orientations”. And it’s not at all clear they can solve a tetrahedron puzzle, for example, with any success.
Democratizing algorithms: good and bad
Back to data science. It’s a good thing that data algorithms are getting democratized, and I’m all for there being packages in R or Octave that let people run clustering algorithms or steepest descent.
But, contrary to the message sent by much of Andrew Ng’s class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist. And, I’d add, if you’re not smart enough to understand the underlying math, then you’re not smart enough to be a data scientist.
I’m not being a snob. I’m not saying this because I want people to work hard. It’s not a laziness thing, it’s a matter of knowing your shit and being for real. If your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is. That’s your job.
As I see it, there are three problems with the democratization of algorithms:
- As described already, it lets people who can load data and press a button describe themselves as data scientists.
- It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.
- Businesses might think they have awesome data scientists when they don’t. That’s not an easy problem to fix from the business side: posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.
How do we purge the posers?
We need to come up with a plan to purge the posers, they are annoying and making a bad name for data science.
One thing that will be helpful in this direction is Rachel Schutt’s Data Science class at Columbia next semester, which is going to be a much-needed bullshit free zone. Note there’s been a time change that hasn’t been reflected on the announcement yet, namely it’s going to be once a week, Wednesdays for three hours starting at 6:15pm. I’m looking forward to blogging on the contents of these lectures.
This reminded me of a comment made by a Los Alamos statistician in the mid 80s about the venture of chemists into data analysis:
Richard J. Beckman
Los Alamos National Laboratory
“There has been an instrumentation revolution in the
chemical world which has changed the way both chemists
and statisticians think. Instrumentation has lead chemists to
multivariate data-much multivariate data. Gone are the
days when the chemist takes three univariate measurements
and discards the most outlying.
Faced with these large arrays of data the chemist can
become somewhat lost in the large assemblage of multivariate
methods available for the analysis of the data. It is
extremely difficult for the chemist-and the statistician for
that matter-to form hypotheses and develop answers about
the chemical systems under investigation when faced with
large amounts of multivariate chemical data.
Professor Harper proposes an intelligent instrument to
solve the problem of the analysis and interpretation of the
data. This machine will perform the experiments, formulate
the hypotheses, and “understand” the chemical systems
under investigation.
What impact will such an instrument have on both
chemists and statisticians? For the chemist, such an instrument
will allow more time for experimentation, more time
to think about the chemical systems under investigation, a
better understanding of the system, and better statistical and
numerical analyses. There would be a chemometrician in
every instrument! For the statistician, the instrument will
mean the removal of outliers, trimmed data, automated regressions,
and automated multivariate analyses. Most important,
the entire model building process will be automated.
There are some things to worry about with intelligent
instruments. Will the chemist know how the data have been
reduced and the meaning of the analysis? Instruments made
today do some data reduction, such as calibration and trimming,
and the methods used in this reduction are seldom
known by the chemist. With a totally automated system the
chemist is likely to know less about the analysis than he does
with the systems in use today.
The statistician when reading the paper of Professor
Harper probably asks what is the role of the statistician in
this process? Will the statistician be replaced with a microchip?
Can the statistician be replaced with a microchip?
In my view the statistician will be replaced by a microchip
in instruments such as those discussed by Professor Harper.
This will happen with or without the help of the statistician,
but it is with the statistician’s help that good statistical
practices will be part of the intelligent instrument.
Professor Harper should be thanked for her view of the
future chemical laboratory. This is an exciting time for both
the chemist and the statistician to work and learn together.”
Boy that cut-paste turned out ugly. Oh well, I forgot the source link:
http://nvlpubs.nist.gov/nistpubs/jres/090/6/V90-6.pdf
Do you really want to purge the posers, or do you want more than one level of “data scientist” identified? Seems to me that the people who are “posers” have contributions to make to the workforce, but the quality control function must be performed by “non-posers” and must be taken very seriously in the profession and by employers.