The Pitfalls of Using Online and Social Data in Big Data Analysis

New research suggests using big data, particularly social media data, can lead to a biased representation of the data based on societal factors.

Striking new research out of Princeton University’s Center for Information Technology Policy and the University of North Carolina at Chapel Hill suggests that inferences based on how people use social media platforms like Twitter and Facebook should be reconsidered. The reason? These platforms represent skewed samples from which it is difficult to draw accurate conclusions.

In her draft paper, Big Data: Pitfalls, Methods and Concepts for an Emergent Field, UNC professor and Princeton CITP fellow Zeynep Tufekci (@zeynep) compares the methodological challenges of developing socially-based big data insights using Twitter to biological testing on Drosophila flies, better known as fruit flies. Drosophila flies are usually chosen because they’re relatively easy to use in lab settings, easy to breed, have rapid and “stereotypical” life cycles, and the adults are pretty small. The problem? They’re not necessarily representative of non-lab (read: real-life) scenarios. Tufekci posits that the dominance of Twitter as the “model organism” for social media in big data analyses similarly skews analysis:

Each social media platform carries with it certain affordances which structure its social norms and interactions and may not be representative of other social media platforms, or general human social behavior …

Twitter is used by about 10% of the U.S. population, which is certainly far, far from a representative sample. While Facebook has a wider diffusion rate, its rates of use are structured by race, gender, class and other factors and are not representative. Using these sources as “big data” model organisms raises important questions of representation and visibility as demographic or social groups may have different behavior — online and offline — and may not be fully represented or even sampled via current methods.

Tufekci says that one of the biggest methodological dangers of big data analysis is “insufficient understanding of the underlying samples.” In her words,

It’s not enough to understand how many people have “liked” a Facebook status updated, clicked on a link, or “retweeted” a message, without having a sense of how many people saw and chose to — or not to — take that option. That kind of normalization is rarely done, or may even be actively decided against because the results start appearing more complex or more trivial.

On the conceptual side of the big data analysis challenge, Tufekci posits that more in-depth research needs to be done in order to deepen the understanding of exactly what a social media footprints mean — and what can legitimately be inferred from big data analysis of those footprints.

A case in point: while retweets or mentions are often equated as a measure of “influence,” the meaning of a retweet could actually be something far different than influence, ranging from “affirmation to denunciation to sarcasm to approval to disgust.”

Tufekci makes three additional points regarding conceptual analysis of big data that can be applied in a business setting:

  • All networks don’t operate the same way.

    Are social media networks similar to airline networks? Methodologies need to rely on more than “they’re both networks” as a basis of comparison; it’s crucial to examine the specific properties of nodes, edges, connectivity, flow, interaction and structure in different networks to understand which methods can be carried over from one type of network to another.

  • Humans do not interact only in networks.

    Human social information flows do not occur only through node-to-node networks, but also through field effects — large-scale societal events that impact a large group … through changes within whole social, cultural and political fields — that must be taken into consideration.

  • You name it, humans will game it.

    People will create false hashtag trends. They will ‘subtweet” as a way of talking about a topic or person and deliberately misspell something, or leave out the @ sign, in order to not be visible in a measurable way. They will game algorithms and metrics. This should be expected in all analysis.

When I asked Tufekci how she thinks her research applies to business managers using online and social media data, she said it’s important to keep in mind that more data does not necessarily mean more insight.

“A lot of big data research is done in an isolated, one-shot, single-method manner with no way to assess, interpret or contextualize the findings,” she said. “There is great potential for error and misunderstanding; worse, with a lot of money flowing into this space, there is a lot pressure to produce “results” and overlook the fact that methods that were not developed to study humans, and do not necessarily work the same way, but are being applied widely.

“The online imprints that create these large, aggregate datasets are not just mere ‘mirrors’ of human activity; rather, they are partial, filtered, distorted and complex reflections.”