Data Integrity Requires Personal Integrity and Vetting
Anyone who has ever worked with data knows that making certain the information is “clean” is much more important than what you do with it. Brilliant analysis of inaccurate data may make heroes (Chamley, Prescott, etc.), but it doesn’t make sensible policy. Witness the release of “public” data from the state of Texas.
Jeremy once commented that the transition from a public to a private university was the choice between everyone knowing your earnings(public data) and everyone knowing your student reviews.
What happens when (1) everyone knows your salary but (2) what they know isn’t true:
“My salary on the spreadsheet is $30,000 higher than my annual contract salary,” a lecturer in the University of Texas system wrote in an e-mail message. He did not want to be named because of his status as a non-tenure-track employee. “Other details are correct, but the number 99 percent of people will want to see is not. The spreadsheet says that it’s me, but it is not me.”
So why would this happen? Because public universities in Texas aren’t allowed to treat their data as if they are private companies:
System officials said the decision to release the data even though it was in draft form and included many gaps resulted from receiving multiple requests for it from news-media outlets under the state’s open-records laws. After noting the data’s shortcomings, in a disclaimer cautioning that the information was “incomplete and has not been fully verified or cross-referenced,” the system released it “in the spirit of openness and transparency” because much of the data was already public anyway, said Anthony P. de Bruyn, director of public affairs.
Except, of course, that the data wasn’t already public. Public data is vetted; this had not been. So why was it released?
Apparently, because no one asked what the legal consequences would be if they made certain it was accurate first:
Thomas Kelley, a spokesman for the Texas attorney general’s office, said in an e-mail that the state’s Public Information Act “applies to records available on the date of request,” even if the university system thought the records being requested were incomplete. Mr. Kelley said system officials had two options: release the data available at the time or request a ruling from the attorney general’s office about whether the incomplete data must be released.
And what was wrong? Well, almost anything:
Mr. de Bruyn said the data, which spans the system’s nine academic campuses, was collected mostly at the institutional level. Yet professors have noticed some mistakes in the data that seem to point to a more-distant process.
For instance, Renee Rubin, an associate professor in the department of language, literacy, and intercultural studies at the University of Texas at Brownsville, said she and her department chair were listed in the wrong department. “So then you begin to wonder what else is wrong,” Ms. Rubin says. [emphasis mine]
Is it more worrisome if Mr. de Bruyn is correct, or if he is not?
I’m think about this more intensely now in part because of the pending end of the publication of the invaluable Statistical Abstract of the United States. As Kieran Healy noted, “When it comes to the United States, the print and online versions of the SA are a peerless source of information for all your bullshit remediation needs.”
Unlike the Texas imbroglio, the Statistical Abstract has a 133 year publication history and well-established reputation for accuracy. Destroying that reputation would have taken only one major incident; will anyone ever trust public data released in Texas again?
But the Obama Administration’s “transparency initiatives” appear to be failing here as well, as an Unforced Error. (paging Brad DeLong) And the consequence will be something that more resembles the Texas debacle than accurate, independent policy analysis.
Unless the Administration considers providing peacemeal, unstandardized information to be a feature, it appears to be a substantive bug.
I have done a lot of research work using public record laws, and one of the quirky spots is whether or not draft materials are subject to discovery. I think it varies by state and sometimes within a state by interpretation.
I’ve found it increasingly hard to get data that was routinely available on line from the Federal government. Of course, that data frequently said that unemployment was worse and inflation higher than widely reported by the (mis)Administration.
A big step in this degradation of honest statistics came when Ronald reagan or his handlers hanged the definition of uneployment in 1986. All subsequet data is not comarabl to pst data andgreatly a) understates he problem, and b) undersates the effecto increase employment as it just brings discouraged workers back to the marketplace.
Some of the past data seems to have become more difficult to find. When Krugman reports on data and ends it in the 50s avoding data on umemployment fro the 40s that voids his conclusion he’s hoodwinked by faulty data ses from the Feds. (Ryan’s budget did not seek the lowest unemployment ever, merely the lowest level since WWII).
When “data” beomes politicized crap s sleading rather than useless. Its “use” is to bamoozle the public. Oamahas reaced et anothenew low in hs sorry Presdency. Corporate whore!
Yeah, I’m getting a bit worried. At work we use a lot of data from one gov’t source which is cutting back in a big way. None of the data affected is stuff that I can’t avoid using at least for a while. I estimate it will take no more than two years to degrade the quality of some of the models I built.
Incidentally… as I recall it wasn’t that long ago (a few years) that this started happening at Statistics Canada which is kind of a combination of Census/Bureau of Economic Analysis/etc. for our friends up North. These days, it seems Statistics Canada produces less data, and charges you for everything to boot.
Its awfully hard to hold the line against the narrative spread so widely by those employed to tell the just so story at places like Heritage and Hoover and Mercatus, but if the data goes away, we won’t even be able to counter with the facts.
“Anyone who has ever worked with data knows that making certain the information is “clean” is much more important than what you do with it.”
You know that expression – “this isn’t even wrong”? Well, there you go. “Anyone who has ever” and “much more important than” appear to be the sources of the problem, but it’s hard to tell when there is some little meaning to the statement.
Qui numerare incipit, errare incipit
He who begins to count begins to err
[Oskar Morgenstern though I expect he was not the originator]
Simon Kuznets also said much the same re. economic statistics in particular.
Getting data and ‘getting’ ‘clean’ [muchless accurate] data are two different things and place additional weight on use of optimal theory.