The Pitfalls of Using Online and Social Data in Big Data Analysis
New research suggests using big data, particularly social media data, can lead to a biased representation of the data based on societal factors.
Topics
Competing With Data & Analytics
Striking new research out of Princeton University’s Center for Information Technology Policy and the University of North Carolina at Chapel Hill suggests that inferences based on how people use social media platforms like Twitter and Facebook should be reconsidered. The reason? These platforms represent skewed samples from which it is difficult to draw accurate conclusions.
In her draft paper, Big Data: Pitfalls, Methods and Concepts for an Emergent Field, UNC professor and Princeton CITP fellow Zeynep Tufekci (@zeynep) compares the methodological challenges of developing socially-based big data insights using Twitter to biological testing on Drosophila flies, better known as fruit flies. Drosophila flies are usually chosen because they’re relatively easy to use in lab settings, easy to breed, have rapid and “stereotypical” life cycles, and the adults are pretty small. The problem? They’re not necessarily representative of non-lab (read: real-life) scenarios. Tufekci posits that the dominance of Twitter as the “model organism” for social media in big data analyses similarly skews analysis:
Each social media platform carries with it certain affordances which structure its social norms and interactions and may not be representative of other social media platforms, or general human social behavior …
Twitter is used by about 10% of the U.S. population, which is certainly far, far from a representative sample. While Facebook has a wider diffusion rate, its rates of use are structured by race, gender, class and other factors and are not representative. Using these sources as “big data” model organisms raises important questions of representation and visibility as demographic or social groups may have different behavior — online and offline — and may not be fully represented or even sampled via current methods.
Tufekci says that one of the biggest methodological dangers of big data analysis is “insufficient understanding of the underlying samples.” In her words,
It’s not enough to understand how many people have “liked” a Facebook status updated, clicked on a link, or “retweeted” a message, without having a sense of how many people saw and chose to — or not to — take that option. That kind of normalization is rarely done, or may even be actively decided against because the results start appearing more complex or more trivial.