Nate Silver performed Much Better than I guessed

Andrew C. Thomas  asks about Silver’s uncertainty estimates.  (via @nikete) My very confident guess was that Silver’s forecasts were too cautious — that his probability estimates were too close to 50 %, that he underestimated the precision of his point forecasts.

It is actually easy to check my hypothesis (not that I did) by looking at the p-values of outcomes.  For each state Silver had an estimate of the expected Obama minus Romney percent of vote (I will call this state_x below so Ohio_x was positive) and an estimate of the standard error of that estimate.  Given his approach of averaging lots of numbers, he was fairly confident that the forecast errors would be normally distributed.  This means that he could calculate the probability of Romney winning Ohio that is the proability that Ohio_x F_Ohio_NS (Ohio_x) is uniform from 0 to 1 that is U(0,1).  Thomas calculates 51 statistics of the form F_state_NS(state_x)  (where Washington DC gets temporary statehood for notational convenience).   I have gone through all this tiresome notation to define the Nate Silver calculated p-values of state outcomes.  If he is right not just about the mean but also about the spread around that mean, these 51 statistics should be distributed U(0,1)0>

 Are they uniformly distributed ?  Damned if the distribution doesn’t look as much like a uniform distribution as any uniform distribution I’ve ever seen

“the 538 distribution is nearly uniform. The closer the points are to the diagonal, the better the fit to the uniform:”

This is much much much more impressive than just calculating that the probability of an Obama win was over 50% for each state Obama won and under 50% for each state he lost. 

Wow. 

Now this isn’t the whole story.  The overall probability of Obama victory does not just depend on the 51 univariate distributions.  The key question is how highly correlated across states are deviations outcome minus expected outcome.  Here there are three components 
1) sampling error (which should be independent across polls and therefore over time and across states) 
2) true change in population voting intentions from the time of the poll to the time.  This is complicated but all changes should be positively correlated across states. Silver had a monster estimate/guess of the correlation based on demographics some how.  I insist on reviewing subcases

a) true change in opinions about who should be president (very hard to understand but not a big deal on Nov 5)
b) undecided voters deciding (again tricky but there weren’t that many of them and there is a clear pattern of deciding pretty close to 50-50 in Presidentials and definitely more for the challenger in Congressional races)
c) moderately likely voters deciding whether to vote or not plus the effects of get out the vote efforts.  This is horribly complicated but can be considered a part of the real problem 3

3) pollster bias.  Notably before the election Silver said that Romney’s simulated wins were almost all in simulations for which the computer guessed a large pro Obama average bias
a) from sampling (as in robo-pollers don’t sample cell phone only household) and weighting.
b) from a likely voter filter which doesn’t correctly find likely voters and the implicit rounding of probabilities of voting to 0 or 1.

Silver can’t know very much about this. In particular he has almost no information on the average pollster bias.
 He only gets information on the bias of a particular pollster when there is an actual election. Mid terms aren’t at all like Presidentials so really one observation every 4 years.  There are new pollsters which had no performance record at all.  It is easy to measure the difference in bias of two pollsters (that is the difference in the house effects).  It is very hard to calculate the average bias or the joint probability density that two pollsters are say biased 1% for Romney or something.  I think this must be a guess and that Thomas’s beautiful graph doesn’t show that Silver guessed right.

But it is a beautiful graph.  He did the same thing for Votamatic.  There he saw that too many p values were near ‘0 and near 1, that is too many outcomes were judged extreme by Drew Linzer. “the Votamatic predictions turned out to be too overly precise.”  That is Linzer had excessive confidence in his point estimates and underestimated the remaining uncertainty.  Here I stress that the graph for Votamatic is pretty damn good looking.  Performance not as good as Silver’s is perfectly consistent with amazingly good performance.