by Mike Kimel
Some Thoughts on Statistical Analysis (A bit wonky)
Via Tyler Cowen, a pretty good post on statistical analysis. While I suggest reading the post, I’ve reproduced the first sentence of each point the author makes below (though have cut out the pieces in betweeen):
1. When you don’t have to code your own estimators, you
probably won’t understand what you’re doing…
2. When it’s extremely low cost to perform inference, you are likely
to perform a lot of inferences…
3. When operating software doesn’t require a lot of training, users of
that software are likely to be poorly trained…
4. When you use proprietary software, you are sending the message that
you don’t care about whether people can replicate your analyses or
verify that the code was correct.
I think for the most part, this is a brilliant post, and it is worth keeping the main points in mind as they are very, very good. That said, I tend to disagree slightly with a few points. One is the author’s love of R, stated repeatedly in the pieces of the post I didn’t quote. I find R to be a pain in the #$$. So does the author, but he feels that is a virtue as per item 2 above. Frankly, the older I get, the more I use (wait for it!) Excel to do statistical analysis. Excel doesn’t do much in the way of sophisticated work, and what it does is clunky. I’m guessing it would violate the author’s point 4, being proprietary, but I’ve checked enough results on Excel
using matrix algebra that I’m satisfied its basic regression functionality works. If you want to do anything more sophisticated than a plain old OLS regression, you need to code it yourself in VBA.
But here’s what I love about using Excel… Odds are, whatever data you’re using to do your analysis, you started off by sorting it in Excel in the first place. Its easy to sort and graph the data, and the eyeball and the nose are always the most important tools in statistics. Another cool thing about Excel – it will spit out the
residuals. And you can graph and organize those residuals twenty three ways from Sunday, which means you can build your own diagnostics appropriate for whatever task you’re doing. Unless you absolutely positively have to run a system of equations fast, or you have a client who has glommed onto some absolutely useless statistical tool with a cool name (dig through an advanced econometrics book and you’ll spot ’em aplenty) and you can’t shake them from their ignorance, there isn’t much point to using R or Stata or you name the program. If your problem is not lack of time, but rather data sets that are too big for Excel to handle, find someone who is good in C++.
Another thing… I suspect the author of that post and I have similar views on SAS. My own experience – I have seen or had associations with a number of organizations that consider themselves data savvy, and in
my opinion, from what I can tell, if a company has X SAS licenses, it almost invariably means the organization has X junior analysts whose job it is to reach actionable conclusions using SAS who haven’t got
the vaguest clue how to interpret the results of the simplest statistical analysis. This isn’t to say SAS is a bad tool, but merely that many organizations consider having a SAS license to be, in an of itself, a holy grail that allows them to dispense with any real expertise.
One final disagreement with the author – he seems to be falling into fallacy many economists do, which is assuming that what economists do is a science. This to me indicates a failure to understand that a science is something scientists do, and what economists do is very different from what scientists do. In biology, you won’t get very far if you try to peddle Lamarckian evolution, Lysenkoism or ID. Try being a proponent of the theory of phlogiston or the aether wind and see what that does for your career in physics. But the equivalent
behavior in Economics won’t stop you, and may even prove beneficial to those wishing to get tenure at Harvard or Chicago, or be Dean of an Ivy League business school. And then there are the think tanks…
Mike
i tend to discount anything Cowen says because he has said a number of things i think are pretty foolish.
but he sounds fairly likely to be right here.
but i would add a maxim i learned in college, and learned even more at work: ” if you have to rely on statistics you don’t know what you are talking about.”
(just to avoid the obvious question: i used to teach statistics to undergraduates, so my disparagement is not based entirely on ignorance.)
“Hard to use” is never a positive even if one thinks it puts them in some kind of enlightened software guild.
Mike:
Nice to see you back! When I am doing LSS, post in Excel and then copy to Mini-Tab. Anything I could ever want to do is there. It is proprietary; but, it is also well known.
Cowen is probably correct on his assumptions. The biggest mistake most will make is not determining whether the data in normal or not. Each requires a different set of testing criteria. The second mistake often made is to assumption the results lead to a conclusion. Stats only point in a direction which needs further analysis.
Gracious alive, I did die and go to stat hell. (Last week I thought it was political stat hell; but it is just stat hell.) What goes around comes around. The world really is an endless groundhog day.
Mike, I wanted to send this to you but you didn’t respond to my question about your old email addr. So I’ll just post it here. This was written by a “real” statistician when asked to respond to a presentation about intelligent chemical instruments:
“Richard J. Beckman
Los Alamos National Laboratory
There has been an instrumentation revolution in the chemical world which has changed the way both chemists and statisticians think. Instrumentation has lead chemists to multivariate data-much multivariate data. Gone are the days when the chemist takes three univariate measurements and discards the most outlying.
Faced with these large arrays of data the chemist can become somewhat lost in the large assemblage of multivariate methods available for the analysis of the data. It is extremely difficult for the chemist-and the statistician for that matter-to form hypotheses and develop answers about the chemical systems under investigation when faced with large amounts of multivariate chemical data.
Professor Harper proposes an intelligent instrument to solve the problem of the analysis and interpretation of the data. This machine will perform the experiments, formulate the hypotheses, and “understand” the chemical systems under investigation.
What impact will such an instrument have on both chemists and statisticians? For the chemist, such an instrument will allow more time for experimentation, more time to think about the chemical systems under investigation, a better understanding of the system, and better statistical and numerical analyses.
There would be a chemometrician in every instrument! For the statistician, the instrument will mean the removal of outliers, trimmed data, automated regressions, and automated multivariate analyses. Most important, the entire model building process will be automated.
There are some things to worry about with intelligent instruments. Will the chemist know how the data have been reduced and the meaning of the analysis? Instruments made today do some data reduction, such as calibration and trimming, and the methods used in this reduction are seldom known by the chemist. With a totally automated system the chemist is likely to know less about the analysis than he does with the systems in use today.
The statistician when reading the paper of Professor Harper probably asks what is the role of the statistician in this process? Will the statistician be replaced with a microchip? Can the statistician be replaced with a microchip? In my view the statistician will be replaced by a microchip in instruments such as those discussed by Professor Harper. This will happen with or without the help of the statistician, but it is with the statistician’s help that good statistical practices will be part of the intelligent instrument.
Professor Harper should be thanked for her view of the future chemical laboratory. This is an exciting time for both the chemist and the statistician to work and learn together.”
Well, the above was in 1985 and the stories I could tell!! But, statistics are abused badly by everyone from voodoo economists to statisticians themselves. Some of my best friends produce and sell software tools.
The fear that canned programs would lead to people running data without really knowing how to prepare the data, how to analyze the data, or how to interpret the results. You can find plenty of examples in the early literature of ALL fields as they discovered computers and jumped in with both feet.
A lot of it was crap except that the crap got things going and today we have wonderful results in some quarters.
I actually kept my S-Plus and Matlab licenses almost to the end but rarely actually did anything important with them any more.
A few comments:
1. The post which I quoted is not by Tyler Cowen – it was linked to by Cowen.
2. Anna – try mike period kimel with a gmail period com at the end.