Data Integrity Requires Personal Integrity and Vetting

Anyone who has ever worked with data knows that making certain the information is “clean” is much more important than what you do with it. Brilliant analysis of inaccurate data may make heroes (Chamley, Prescott, etc.), but it doesn’t make sensible policy. Witness the release of “public” data from the state of Texas.

Jeremy once commented that the transition from a public to a private university was the choice between everyone knowing your earnings(public data) and everyone knowing your student reviews.

What happens when (1) everyone knows your salary but (2) what they know isn’t true:

“My salary on the spreadsheet is $30,000 higher than my annual contract salary,” a lecturer in the University of Texas system wrote in an e-mail message. He did not want to be named because of his status as a non-tenure-track employee. “Other details are correct, but the number 99 percent of people will want to see is not. The spreadsheet says that it’s me, but it is not me.”

So why would this happen? Because public universities in Texas aren’t allowed to treat their data as if they are private companies:

System officials said the decision to release the data even though it was in draft form and included many gaps resulted from receiving multiple requests for it from news-media outlets under the state’s open-records laws. After noting the data’s shortcomings, in a disclaimer cautioning that the information was “incomplete and has not been fully verified or cross-referenced,” the system released it “in the spirit of openness and transparency” because much of the data was already public anyway, said Anthony P. de Bruyn, director of public affairs.

Except, of course, that the data wasn’t already public. Public data is vetted; this had not been. So why was it released?

Apparently, because no one asked what the legal consequences would be if they made certain it was accurate first:

Thomas Kelley, a spokesman for the Texas attorney general’s office, said in an e-mail that the state’s Public Information Act “applies to records available on the date of request,” even if the university system thought the records being requested were incomplete. Mr. Kelley said system officials had two options: release the data available at the time or request a ruling from the attorney general’s office about whether the incomplete data must be released.

And what was wrong? Well, almost anything:

Mr. de Bruyn said the data, which spans the system’s nine academic campuses, was collected mostly at the institutional level. Yet professors have noticed some mistakes in the data that seem to point to a more-distant process.

For instance, Renee Rubin, an associate professor in the department of language, literacy, and intercultural studies at the University of Texas at Brownsville, said she and her department chair were listed in the wrong department. “So then you begin to wonder what else is wrong,” Ms. Rubin says. [emphasis mine]

Is it more worrisome if Mr. de Bruyn is correct, or if he is not?

I’m think about this more intensely now in part because of the pending end of the publication of the invaluable Statistical Abstract of the United States. As Kieran Healy noted, “When it comes to the United States, the print and online versions of the SA are a peerless source of information for all your bullshit remediation needs.”

Unlike the Texas imbroglio, the Statistical Abstract has a 133 year publication history and well-established reputation for accuracy. Destroying that reputation would have taken only one major incident; will anyone ever trust public data released in Texas again?

But the Obama Administration’s “transparency initiatives” appear to be failing here as well, as an Unforced Error. (paging Brad DeLong) And the consequence will be something that more resembles the Texas debacle than accurate, independent policy analysis.

Unless the Administration considers providing peacemeal, unstandardized information to be a feature, it appears to be a substantive bug.