Elementary Statistics Review — Hypothesis Testing
I don’t know what to do about the widespread complete ignorance about the concepts of null hypothesis, rejection, failure to reject, p-levels.
I will rant after the jump
A glossary is urgently needed.
1. Size — the size of the test is the probability that a true null hypothesis will be rejected. For purely historical reasons the size 0.05 is often discussed. This is basically because the smallest 95% interval of a normal is roughly an even 4 standard deviations across. There is no particular importance of the number 0.05. Choosing it has nothing in particular to do with any mathematical result, theorem or conjecture. It is basically arbitrary.
2. Power. Power is not a number. It is a function. It is the probability that the null hypothesis will be rejected if the alternative hypothesis is true. Usually, the alternative hypothesis is not a particular probability distribution but the hypothesis that the data is generated by one element of set of possible data generating processes (DGPs). The power is the probability of rection as a function with that set as a domain.
3. p – level. This is the probability that a statistic will be in some set given some assumption about the DGP. Almost always it is the probability that a scaler will be greater than some number. It is related to size. If the p-level is Prob, then one would reject if one had decided to use a test with size > prob. The appeal of p-levels is that they are dimensionless — one need not consider units of measure or what is, in a particular case, large enough to be important. probabilities are alwys numbers between 0 and 1. by reducing a data set to a p-level, one can communicate without any trouble or any need to understand what the hell was measured or why.
This convenience of p-levels is a catastrophe. It is irresistably tempting to look at p-levels which are always eleements of the same simple set. One can sound as if one understands what is being discussed with no clue as to what was measure, what the variables are, or why anyone would care.
On the other hand, one will often say things that are dangerously false if one decides to look only at p-levels so one doesn’t need to know or understand what is going on at all.
4 Statisticall significant
I can’t define this because the two word phrase I wrote above doesn’t mean anything at all. It is a statement that a null was rejected by a test with some size. A meaningful phrase would be “statistically significant at size 5%” another equally important meaningful phrase would be “statistically significant at size 3.14159 %” but, by itself, “statistically signficant” doesn’t mean anything.
Even if it is well defined (say by stating size 5%) it doesn’t necessarily mean much.
“Sign,size, significance” is a slogan I was taught. It is unfortunate alliteration since what is meant is sign magnitude significnce. That is the first thing to look at is the sign of the point estimate (does the trial suggest the treatment is better or worse than the control). Then how big is the point estimate. Is it large enough that if the estimate were the truth we would care ? what if the true value is half the estiamte ? Finally third and last, is the null that the true parameter is within the set called the null hpothesis rejected at some arbitrary level. It is unfortunate that “size” means both magnitude and probability of rejecting a true null (don’t blame me, blame Neyman and Pearson).
It is clear that many people including most journalists do not understand these concepts. I am fairly sure that my effort at remedial education will be a complete (and irritating) waste of time.
Any discussion should start with all cases in which there are solid theorems. In fact this begins and ends with the Neyman-Pearson theorem and the useful part is the Neyman Person lemma
The lemma asks how to test a point null against a point alternative. So the question is whether the data are generation by distribution F(.) or G(.). This implies that there is no parameter estimatino. F is something like iid normal with mean zero and variance one (not normal with mean mu and variance sigma^2 — that’s a hypothesis such that there is no clear answer on how to test it).
in the lemma Neyman and Pearson (hence NP) show tht the correct test is to calculate the likelihood ratio G(X)/F(X) where X is the observed data and find a critical level for a desired size of the test. This will give the test with the greatest power against G for any chosen size.
The theorem notes that if G is not one single distribution, then there is no best test. Even if G is one of only two distributions G1 and G2, the best test of F against G1 is based on G1(X)/F(X) != G2(X)/F(X)
Well that doesn’t get us very far. Also that is as about as far as mathematical statistics has gotten. Beyond that there are choices, approximations, conventions but not, I claim, extremely useful theorems.
One advantage of the Neyman Pearson exmple of a point null v a point alternative is that it makes it clear that the choice of which is the null and which is the alternative is arbitrary. One can make G the null and test it against the alternative F. There is a range of F(X)/G(X) such that F is not rejected against G and G is not rejected against F.
Also for outcomes in this range, it is easy to see which hypothesis receives weak support from the data. If G(X)/F(X) > 1, then the data support G even if the value is not large enough to reject at, say, the 3.14159 % level.
OK so why am I totally losing it ?
First there are the two Remdesivir clinical trials. They give a point estimate of the proportianal hazard of improvement on the same scale of sickness of 1.2 and 1.31 (or maybe 1/0.69 I swear no one makes this clear). These are similar results pointing in the same direction. Both say it helps but it isn’t a home run knockout punch magic bullet.
I will not bother to provide the hundreds of links to claims that the two results are very different, point in opposite directions an contradict each other. The reason for all this nonsense is that people look at the p level not the parameter estimate. In the Chinese trial the p-level that the ratio would be 1.2 or higher was greater than 5%. In the US trial it was less than 5%. The result is that many people said the Chinese trial showed that Remdesivir does not work (because 1.2 >1) that Remdesivir failed etc etc.
It was considered something between a technical detail and special pleading to discuss the power of the trial. This is nonsense. When one discusses size, one should also discuss power. The idea that the main result is the p-level and the power against various altnernatives is a technical detail is just wrong. For any data set, it is always possible to present a valid test which does not reject a null (the first step in a valid test can include throwing away 99% of the data). The only protection against this is conventions about what is done which don’t amount to much and aren’t based on mathematical statistics.
This is not the first time I have lost it over misunderstanding of the Neyman Pearson framework. There is also the Oregon Medicaid lotttery result
OK now what if we don’t reject the null. Does that, at least, provide some support for the null ? No.
If the null is that something is zero and the alternative is that it is positive, does failure to reject the null at least show it is small? No
Does it at least suggest that it is smaller than the researchers thought when they designed the test? No.
Should we at least have some qualitative sense that the data are good for the null and bad for the alternative? No.
Can we conclude anyting at all? No.
Does it mean that the empirical research was wasted effort ? No.
The last is because the data are not just a p-level. One can look at the point estimate. If if it is greater than the mean of the subjective prior, then the data should cause us to have a higher posterior mean than prior mean.
But what the hell am I talking about ? There is no way to understand without knowing what the parameter means, what it means if it is high or low, what level is high and all that.
This is the way things should be. You should not be able to say what a study shows without some idea of what is being studied.
The idea that p-value > 0.05 at least meanst that the effect is small (or smaller than expected or smaller than hoped) must be a mistake. If you have no idea what is being studied, you should not be able to draw any conclusions from the study.
The appeal of p-values is that it makes it possible to say something which sounds sophisticated even if one has no clue as to what the data mean at all.
Enabling people to talk about phenomena with no clue at all about what is being discussed is not good thing.