Impact Stories
December 30, 2021 – Free Speech & Peace

The Unmet Potential of Open Data & Benefits of Democratizing Big Data

The Unmet Potential of Open Data & Benefits of Democratizing Big Data

Open Data holds great promise — and more than thought leaders appreciate. 

Open access to data can lead to a much richer and more diverse range of research and development, hastening innovation. That’s why scientific journals are asking authors to make their data available, why governments are making publicly held records open by default, and why even private companies provide subsets of their data for general research use. Facebook, for example, launched an effort to provide research data that could be used to study the impact of social networks on election outcomes. 

Yet none of these moves have significantly changed the landscape. Because of lingering skepticism and some legitimate anxieties, we have not yet democratized access to Big Data.

There are a few well-trodden explanations for this failure — or this tragedy of the anti-commons — but none should dissuade us from pushing forward.

The most obvious impediment is data collectors’ proprietary interests. If data held by a private company is useful for the development of competing products or services, or if it could be used to criticize a company, businesses will be reluctant to share. Thus, in practice, data-sharing is typically limited to partners who are contractually obligated to limit use to purposes that don’t interfere with the company’s economic interests. 

In theory, we could pierce this dynamic through public policy that strips companies of any legal claim to proprietary interests over their data, or that would vest rights of access to others. In some limited sense, proposals to create a right of “data portability” do just this, allowing customers to compel a company to provide a usable version of the customer’s data so it can be accessed by competitors. Additionally, a more expansive version of data portability could require firms to make a deidentified version of data available to any third party for nearly any purpose (with exceptions for security and privacy). This approach could dovetail with U.S. competition policy, which is currently undergoing transformation as federal regulators debate what antitrust and consumer welfare looks like in a digital economy. 

But an open data mandate would require great care in its design to ensure it doesn’t undermine companies’ incentives to invest in new innovation. For this reason, my colleague Andrew Woods and I are exploring the value of a compulsory licensing system that would allow researchers and potential competitors to access data after compensating the initial data collector (and possibly the data subject as well). Others, including Stanford’s Institute for Human-Centered Artificial Intelligence, have floated proposals to encourage individuals to join data cooperatives to enable more beneficial uses of the data that we passively produce across different platforms.

We also must tackle another problem: privacy. When it comes to personal data, open data initiatives run directly contrary to privacy law goals of user consent and data minimization. 

Historically, this tension has been managed through anonymization, but this practice has been broadsided by fears that deidentified data can easily be linked back to individuals using auxiliary information. The most influential voices in the debates insist research data can only be considered safe if it complies with a highly technical guarantee against attack by a hypothetical, nearly-omniscient attacker. This standard, called Differential Privacy, has already wreaked havoc at the Census Bureau, which attempted to implement the standard for its decennial census. And Facebook’s attempt to use Differential Privacy for its open source research data repository resulted in a long delay and large reduction in data usability

Nevertheless, because public guidance related to data anonymization continues to be very risk-averse, the public data commons is losing ground. Entities holding personal data are reluctant to become the poster children for perceived poor data stewardship out of fear they will receive the sort of public criticism that Netflix did when it attempted to make a database openly available for research. This is one reason that existing open data infrastructure, such as Figshare, house primarily non-human subjects datasets.

The path forward will require a more practical approach to data anonymization — one that offers a good deal of protection from real world risk without guaranteeing the elimination of all possible risk. My preferred approach to measure reidentification risk offers an option. To further reduce privacy risks, those wanting access to open data can be required to pre-register, and can be restricted from de-anonymizing the datasets. This can be further buttressed by law that prohibits the malicious reidentification of deidentified datasets, as the Uniform Law Commission’s draft privacy statute does. All of these requirements would facilitate “openish data” — data that’s open to most people for most uses. 

It may not meet the Creative Commons standards for Open Data, but we should not let the perfectly Open be the enemy of the Good.

Finally, creating the infrastructure required to clean data, link it to other data sources, and make it useful for the most valuable research questions will not happen without a significant investment from somebody, be it the government or a private foundation. As Stefaan Verhulst, Andrew Zahuranec, and Andrew Young have explained, creating a useful data commons requires much more infrastructure and cultural buy-in than one might think. 

From my perspective, however, the greatest impediment to the open data movement has been a lack of vision within the intelligentsia. Outside a few domains like public health, intellectuals continue to traffic in and thrive on anecdotes and narratives. They have not perceived or fully embraced how access to broad and highly diverse data could radically change newsgathering (we could observe purchasing or social media data in real time), market competition (imagine designing a new robot using data collected from Uber’s autonomous cars), and responsive government (we could directly test claims of cause and effect related to highly salient issues during election time). 

With a quiet accumulation of use cases and increasing competence in handling and digesting data, we will eventually reach a tipping point where the appetite for more useful research data will outweigh the concerns and inertia that have bogged down progress in the open data movement.

Jane Bambauer is a professor of law at the University of Arizona. Her research assesses the social costs and benefits of Big Data, and questions the wisdom of many well-intentioned privacy laws.

This viewpoint is part of an ongoing series, Driving Discovery. In this series we amplify the voices of a diverse group of scholars, nonprofit leaders, and advocates who offer unique perspectives on how openness drives human progress.