I dare to disagree with Nate Silver

Robert Waldmann

Nate Silver, like essentially all election handycappers, ignores internal polls — polls financed by one of the candidates or by the party of one of the candidates. Unusally, he explained why noting that even if the polls are unbiased there is extreme publication bias as campaigns release the polls if and only if they like the results. I think that, if this is the explanation of the known bias in published internal polls, then internal polls should be included in averages of polls.

update: This post is basically a critique of a post by Nate Silver to which I didn’t link. The assumptions I use below are based on things he conceded for the sake of argument. He presents a simple formula which I critigue. All quotes from the post are updates.

I’m not sure why people take polls released by campaigns at face value. This does not mean that campaigns don’t have very good pollsters working for them. But the subset of polls which they release to the general public is another matter, and are almost always designed to drive media narrative.

So the polls are OK when conducted, but the choice to release the polls is strategic and therefore the polls should not be taken “at face value.”

What we’ve found is that is that polls commissioned by campaigns and released to the public show, on average, a result that is about 6 points points more favorable to their candidate’s standing than nonpartisan polls released at the same time. (Other analysts have found similar results.) So, just as a first cut, you might take a Democratic internal poll that shows a tied race and “translate” it into nonpartisan terms by adding 6 points to the Republican’s margin.

I note the weasel phrase “as a first cut” and argue below that a better first cut is to just include internal polls unadjusted if, as Silver semi asserts in the first quoted passage, the problem is just publication bias. The rough 6% correction is definitely not the best first cut. I say 0% is best but 6% is definitely too much because of a universally accepted argument discussed below.

end update

All campaigns appear to believe that reporting a poll which would be good for the candidate if accurate is definitely good for the candidate (they believe in a bandwaggon effect. So the publication bias hypothesis makes sense. I am going to make absolutely key unsupported assumptions in order to write down a model. Then I will pretend the model is reality.

There are two candidates A and B.
Assume that there is an unbiased agreed upon average of consensus based on public polls — A ahead by dpublic. Assume each campaign conducts one unbiased poll and that these polls are of equal quality (same sample size etc). Assume A releases A’s poll if it shows A ahead by more than dpublic and B releases B’s poll if it shows A ahead by less than dpublic.

Claim 1: Then the average including internal polls is an unbiased estimate of the expected value of the actual vote.

Proof: everything is symmetric around A ahead by dpublic so the A’s expected lead including the internal polls is dpublic which is assumed to be unbiased.

Claim 2: the variance of the average including the internal polls will be lower.

Non proof: This is obvious it is an average with an equal or larger denominator.

So what are the key assumptions ? First I don’t really need to assume one internal poll each, but I do need to assume equal numbers of internal polls run by both campaigns. Second I need more for symmetry. I need that the internal polls are all unbiased and that they have equal variance. Also I need that the rule for publishing them is the same for both campaigns.

Silver seems inclined to assume, for the sake of argument, that internal polls are generally reputable polls with bias due to publication bias. It seems to me that the key issue is the assumption that each side actually performs the same number (really approximately the same number).

This is clearly not true if a well financed campaign competes with a poorly financed campaign. This means that results including internal polls will depend partly on current public opinion and partly on which campaign has more money. The same is true of election outcomes. I’d consider that a feature not a bug.

My reasoning is used in informal discussions. The fact that a campaign hasn’t released an internal is interpreted as bad news for that campaign. It is assumed that they ran polls and didn’t like the results. Consider what Steve Singiser just typed.

Meanwhile, Louisiana looks competitive for the first time in months, at least according to a Dem poll. As of this afternoon, we haven’t seen the retaliatory GOP internal poll, suggesting that the Dem poll might at least be in the wheelhouse.

No one seems to doubt that this argument is valid, yet it is excluded from Silver’s model and from the more primitive polling averages presented by others.

update 2: Silver’s proposed 6% correction ignores this factor. Now ignoring all internal polls will not introduce bias, but including internal polls with the 6% correction will introduce bias in my example in the case in which only one internal poll is reported.

end update 2.

I think my alleged proof is rigorous, but it is very brief and I will tell stories to explain.

There are four possibilities
1. Both internals give better results for A than the consensus based on independent polls
2. Both internals give better results for B
3. The poll conducted by A gives better results for A and the poll conducted by B gives better results for B
4.The poll conducted by A gives better results for B and the poll conducted by B gives better results for A

In case 4, no internal polls are released so the average including (hypothetical reported internals) is just the average based on independent polls.
In case 3 each internal poll is published. Each is biased (by selection) each is equally biased in opposite directions. The final average is unbiased and based on a larger sample (with a larger denominator) so it has lower variance than the average independent poll.

The interesting cases are 1 and 2. I will explain case 1, since they are symmetric. In case 1 one poll is included in the average. It’s distribution is biased by selection. However, the fact that the Poll run by campaign B gave an lead for A greater than dpublic is not included at all in calculating the average. The two biases exactly cancel. It is sufficient but not necessary to note that the distributions of (internal poll minus average of independent polls) are symmetric around 0. It is necessary that the two internal polls have the same distribution (the bit about equal quality).

update 3: I haven’t shown you the proof of this new claim. It actually adds something to the obvious claim based on symmetry. It means that not only is there no unconditional bias, but also there is no bias conditional on being in case 1 and similarly there is no bias in case 2. This means that Silver’s proposed “first cut” is not the correct formula given Silver’s data. The fact that one campaign is not releasing internal poll results means something and the 6% correction is based on the assumption that it means exactly nothing. I’d say a better first cut is 0%, but I am absolutely sure that a 5% correction is better than a 6% correction (I’m willing to bet a modest amount of money on it using exactly the same data Silver used to calculate the 6%)

end update 3.

Now someone at DailyKos noted (I think it was Steve Slingiser to whom I linked above) that averages including internals performed better than averages of independent polls in past cycles. I claim that I have an explanation of that fact which is perfectly consistent with the very significant bias in publicly reported internals.

Second, my argument for including internals will cease to hold if it is adopted. If it is known that internal polls are believed, the incentive to deliberately bias them increases until the effect of including them becomes a bias in favor of the less honest campaign. No campaign is very honest, but they do differ, so, if indeed they are not cooking the internal polls now, it would be better if people ignored the internal polls at the cost of not knowing so well how elections will turn out.

Not being Kant, I will include internal polls in my averages.

update 4: I haven’t proven any of my mathematical claims after the first (at most) but I will add one more for the patient reader. Even if internal polls are biased (not just reported internal polls but also the numbers which might or might not be biased) they should be included if both campaigns have the same bias and the same symmetric distribution around that biased mean, both know the bias and both report the poll if it is better for their candidate than they expected given the bias.

as in the non proof of the cliams of no bias in cases 1 and 2 works just as well.

A simpler way to put the new claim is that if half of internal polls are reported, and those are the half which are better for the campaign which financed the polling, then the best approach is to include internal polls in the average just as if they were independent. This is true even if internal polls are biased (and it’s not just that published internal polls are biased due to publication bias).

If more than half of internal polls are reported on average, then the simple average UNDERestimates the expected vote share of the campaign which reports the internal poll. The fact that the other campaign didn’t report it’s poll contains lots and lots of information if campaigns usually report their internal polls. More generally for many internal polls run but equal numbers by each campaign this means that my 0% correction for internal polls is too high if most internal polls are reported.

This is always assuming a lot of symmetry, importantly symmetry in the number of internal polls run and symmetric bias (indeed mirror image distributions) and symmetric rules for what to publish.