The Harvard Journal of Law & Technology recently released its Fall 2011 issue, now available online. Jane Yakowitz, author of “Tragedy of the Data Commons” has written an abstract of her article for the Digest, presented below.
- The Digest Staff
JOLT Print Preview: Tragedy of the Data Commons
The data that fuels most of the quantitative health and policy research in this country is publicly available data that has undergone some sort of anonymization process. This is the data commons, and unwittingly, we are all in it. Our tax returns, medical records, and school records, among other things, seed its pastures and facilitate a wide range of empirical studies.
In theory the data commons gives us the best of both worlds by allowing researchers to test hypotheses and produce generalizable results without exposing anybody’s personal information. But in practice, we all shoulder some risk that a bad actor might use auxiliary information to reidentify us, and discover our private information. The looming policy question, raised by Paul Ohm and the Federal Trade Commission, is whether current data privacy policies in the United States strike the right balance between the risks of reidentification attacks and the utility of data-sharing. Paul Ohm and other scholars believe the risk is too high, that we need stronger privacy laws to protect data subjects. This article comes to the exact opposite conclusion: the utility of public research data is so great, and the realistic risks so small, that the law should foster the sharing of anonymized data.
The value of the data commons is frequently overlooked. As soon as publicly available research data produces a useful study, little attention is paid to the provenance of the underlying data sources. Public data has been used to support and dispel a wide range of theories about education, welfare reform, and capital punishment. Anybody who has gotten a flu shot or waited at a traffic light has benefited, indirectly, from anonymized data. The data commons has played a particularly critical role in the exposure and redress of race and sex discrimination. The data commons is vital to what George Duncan calls “information justice.” It reveals what cannot be discerned through any one individual’s experience alone.
But against this abundant utility, the risks to data subjects must be balanced. The most compelling evidence that anonymized data poses great risk are the Netflix de-anonymization study and the reidentification of Governor Weld.
These studies demonstrate that malfeasors might be able to link the values in an anonymized research database to information reported in publicly available identified records— e.g. voter registration records or IMDb profiles. The media accepted uncritically the conclusions and purported implications of the de-anonymization attack literature despite significant limitations and flaws in the studies. Paul Ohm, too, interprets these studies to forebode the formation of a “database of ruin” composed of previously deidentified pieces of information. As a result, the public labors under a false impression that reidentification attacks on anonymized data are accurate and scalable.
The evidence of impending risk is wanting. Reidentification attacks would be costly and riddled with false match error, which no doubt explains why a wide-scale reidentification attack has not happened. Moreover, an intruder can exploit lower-hanging fruit—data security systems to be breached and personal computers to be hacked. The marginal value of anonymized data even to a truly nefarious actor is trivial.
Since the social utility of research data greatly outweighs the risks, law ought to encourage the flow of probative research data by creating a safe harbor for data producers who share research data responsibly. The proposal advanced by this Article has three aspects to its design. First, federal regulations should clarify what a data producer is expected to do in order to anonymize a dataset sufficiently and avoid the dissemination of Personally Identifiable Information (“PII”). Second, federal law should immunize the data producer from privacy-related liability of most (not all) sorts, and third, law should penalize any recipient of anonymized data who reidentifies a data subject in the dataset and further discloses the subject’s PII.
These proposals run against tide of data privacy scholarship, which generally seeks to curb the spread of data and to give consumers more control over its flow. To be sure, the preservation and growth of the data commons will appeal only to those who are enthusiastic about health and social science research. Since this research as critical to sound policymaking, the article concludes that we have a civic responsibility to contribute our personal information to the data commons—the digital fields that describe none of us and all of us at the same time.