Impact Stories
February 13, 2020 – Technology & Innovation

Social Science Meets Social Media

How independent researchers constructed one of the largest datasets in the social sciences for the academic community


Social Science Meets Social Media

Academics can now analyze 38 million Facebook URLs in more than 25,000 categories to study the effect of social media on civic discourse and open exchange of ideas—thanks to the work of Social Science One, a partnership between academia and private industry supported by private philanthropy.

Harvard political scientist Gary King and Stanford Law professor Nate Persily, who head up the project, talked to us about what they hope to learn about the impact of social media on democracy, the broader uses of this dataset, and how their work is an unprecedented opportunity to understand the role of emerging technologies in people’s lives.

How did you become interested in the bigger picture question of social media’s impact on democracy?

King: We were interested in the broader issue of making data available to academicsWe have more data than we ever did before in the social sciences, which has created spectacular discoveries and tremendous progress. Most of that data, however, is tied up inside private companies, governments, and other organizations. We wanted to liberate that data for academic analysis to create public good. One of the topics we wanted to analyze was the effect of social media on elections and democracy.

Persily:  The opportunity to investigate whether the conventional wisdom is right about the effect of social media on democracy around the world presented a valuable and unique opportunity. There is an enormous amount of commentary out there based on anecdotes about the destructive effect of the internet, in general, and social media, in particular, on democracy, whether because of polarization, disinformation, hate speech, political advertising, foreign election interference, or any number of other phenomena. Most of the research on these phenomena has been based off of Twitter, because its data are more broadly available.  Conclusions derived from behavior on that platform, however, are necessarily drawn from a small and (probably) biased sample of social media users.  Facebook use is more widespread and pervasive, and analysis of Facebook data can help researchers describe exactly what is going on in the information ecosystem.

What is your ultimate goal with the project?

Persily: We entered this project to provide a bridge between the academic community and Facebook.  We hoped that this initiative could serve as a model for what could be possible for all kinds of companies that have data that would be of relevance to social scientists and scientists in general. We started with what was one of the most controversial topics, the impact of social media on democracy, because that was viewed as most urgent from a societal perspective. The hope is that we could expand to other firms and other questions.

It’s been almost two years since the project was announced. How has the state of the field advanced since then?

Persily: What we’ve seen is a retreat from broader academic access in some settings. The closing down of APIs at Facebook, the greater restriction on academic access on Twitter, and the almost nonexistent academic access to Google/YouTube outside of the normal APIs dealing with search and other items. We’re at an inflection point as to whether academics will be allowed even the meager access they have relied upon up till now. Some of this is the result of scandals, most prominently the Cambridge Analytica scandal. Company general counsels have reacted by trying to restrict access across the board. Some of it is because of legislation like GDPR in Europe and the FTC Consent Decree in the US. So the importance of our project has grown since we began our work. There’s a lot riding on our success to prove that we can provide a trusted privacy-protected method for academic access to company data.

What’s new about the URL full data set? What can researchers study now that they wouldn’t have had access to before?

King: This data set has aspects that researchers have never had access to before. First of all, it has 38 million URLs addresses. It includes all URLs that were clicked on and shared publicly at least 100 times worldwide. For those URLs, we have two types of information. One is the characteristics of the URL, like whether the URL is news, whether it was fact-checked, or whether it was declared by users to be fake news or hate speech. Secondly is data on what people were exposed to, what types of information they saw among these URLs, and what kinds of actions they took.

For instance, did they view something, click on it, read it, or share it? Did they share it without reading it themselves? This information can be really valuable in learning about the effects of different types of news, how fast content spreads, and all kinds of other topics we haven’t yet even imagined.

Given the range of data points you just laid out, what are the types of questions scholars are posing in anticipation of this data set being available?

Persily: Because the fact-checking determinations are in this data set, one can make some rough analyses of the virality, popularity, and exposure of people to different types of links in different countries. If you have a set of fake news URLs or ones that have been marked fake news, you can see the sort of traffic on those sites through Facebook and engagement in ways that you never could before. You have exposure data here which is something that has never been produced to our knowledge. You can also do some interesting comparative work between the countries that are in the data set to get a sense of what is going viral in different places, what’s more prevalent in different places. You do similar analyses with respect to hate speech labeled by users, and begin to answer questions about the structure of engagement and exposure to URLs on Facebook.

King: The data include URLs from 38 countries. And Facebook’s committed that, if a researcher has legitimate research questions that touches other countries, they will add other countries to the data. And that’s pretty remarkable because the data set is already huge. It has 639 billion types of categories into which people and URLs were put.

Privacy has been a major factor of this project given the breadth and scale of the data set that researchers will have access to. What protections exist to protect information about individuals?

King: In the past researchers crossed out the names and the personal identifiers to protect people’s privacy, but that’s not always sufficient. If you were very clever, you can sometimes re-identify people in data that had supposedly been de-identified. Now we can add privacy protective procedures to the data set itself or to the results of the analysis, such as special types of random noise. Together these procedures are called “differential privacy”.

That means if you have a dataset — think of it as a gigantic Excel spreadsheet — you had the real data with random numbers added to each cell value. The random numbers are sometimes too big and sometimes too small but on average are about zero. If you were looking up some person in the data set, you would never really know who was who because none of the numbers were necessarily real.

But if you were looking for patterns, like in a simple case the average age, you can average those. And the noise would cancel itself out and your calculation would produce approximately the right average age. That’s the idea of differential privacy. You can learn about patterns, which we’re interested in as social scientists, but not about individuals. Differential privacy also makes sure that when the researchers analyze data, they don’t get biased results about the social science insights we seek.

This technology solves political problems technologically. If there’s concern about whether researchers at legitimate institutions doing legitimate research are to be trusted, it is another way of absolutely protecting what is shared. We wanted to make sure that when researchers analyze these data, not only were individuals protected, but society is protected. When researchers analyze the data, we have to make sure they don’t get biased results. We have to make sure that they’re not misled by the noise into drawing the wrong conclusions. So we developed new statistical procedures to make it possible to analyze the data even in the presence of noise and other aspects of differential privacy, all without bias. We’ve done that. And we’ve made available papers and code that researchers can use.

Persily: Yes, differential privacy is a technical solution to a political problem. When we started this project I thought I’d be using the political science side of my brain, whereas, it’s been more the law professor side. That’s because we’ve been trying to find a way through the various regulations around the world, particularly the GDPR, General Data Privacy Regulations of the European Union, to find a way to provide researchers with privacy-protected data access.

What’s the relationship between privacy laws like GDPR and related actions on academic research more broadly?

Persily: To some extent, we’re living in the worst of all worlds right now. We have a legal regulation that no one really knows what it means in the context of academic research. The European Data Privacy Supervisor did issue some guidance lately, but for lawyers at the companies staring down potential billion-dollar fines, the guidance is insufficient.

The key for researchers is this: If you tell us the rule, we’re going to find a way to do research within that rule. But if you just have a vague prescription, the lawyers inside these companies naturally take the most risk-averse approach, even if it’s one that was not intended by the law.

This is not just in the context of social media research on democracy, but you’re seeing this with respect to medical and genetics research in the US and Europe. They’re worried about running into legal liability given the restrictions of GDPR.

But if we can prove the model under these conditions, then maybe we can get the right people in the room to provide clarity about what kind of research can be done going forward.

How do you envision this project fitting into our larger understanding of the impact of social media?

Persily: I think the cross-platform research that this dataset facilitates is going to prove to be quite critical to understanding how information flows on social media. Survey research is another area we want to explore. If we had a series of surveys joined with Facebook feeds from users who consent to turn over, we could get a lot of information that would be useful to answer basic questions of political science.

For scholars who are intrigued by the dataset and want to get connected with Social Science One, what’s are the next steps?

King: We’ll post an RFP process on our website starting on Thursday, February 13. It’s relatively easy to apply, and it won’t take long to get researchers approved and get them access.

Where are scholars who are most interested in the project based?

Persily: It was quite important to us that if this was going to be a research program, it would be international in scope. We’re seeing great interest from around the world.