The Subtle Sources of Sampling Bias Hiding in Your Data

Plummeting data acquisition costs have been a big part of the surge in business analytics. We have much richer samples of data to use for insight. But more data doesn’t inherently remove sampling bias; in fact, it may make it worse.

When a group of Boston College studentsi started an analytics project using data about UFO sightings, they thought they might learn more about visits from little green creatures — or at least something about how weather or movie releases influence human perceptions of those extraterrestrials, similar to The Economist’s findings about sightings being linked to times when people are “nursing their fourth beer” leading to a “close encounter of the slurred kind.”

Instead, the students learned about sampling bias.

UFO sighting reports have increased exponentially since the National UFO Reporting Center first started tracking them in 1974. But this might not mean that we are in fact getting more visitors from outer space.

When the reporting center first began, communicating a sighting required making a telephone call to file a report. When the internet became publicly available, people could report a sighting using an online form. Unsurprisingly, increases in the number of sightings correspond with public adoption of the internet. When reporting UFOs became easier and cheaper, more data about sightings became available. Increased access to data fundamentally changes the sample.

Decreased costs of data collection are great; we have much more data than ever before. But managers must be careful to understand how the data was generated to be able to make good decisions.

Unfortunately, in business, the sources of bias can be far subtler than in this UFO example. What should managers consider as they work to gain insight from increasingly available data?

Understand the history behind your data: Social media, for example, is an amazing source of detail data about activity. Businesses now have unprecedented, detailed, individual access to information about their customers. But not everyone uses social media. And — shockingly — not everyone is honest online, and therefore data obtained via social media may show different trends than customer data from prior sources such as phone or written surveys, not because the customers are behaving differently, but because they respond differently based on the medium.

This rich, new, low-cost data is fundamentally different in ways that managers must understand in order to make informed decisions. In the infamous Chicago Daily Tribune example that incorrectly concluded that “Dewey Defeats Truman,” the shift from manual polling to telephone polling changed the sample. Similarly, naïvely relying on Foursquare and Twitter data about Hurricane Sandy might lead businesses to direct supplies to less affected (but noisier) locations because they didn’t recognize that locations hit harder would be inherently less noisy because of the severity of the storm’s impact. Understanding the history behind data sets is increasingly important for organizations seeking to become more data driven.

More data may not mean new data: Historically, variation within a sample has been used to infer sampling error. With increased data volumes, statistical significance is trivial to find but distracts from the larger point of sampling error; the sample may be internally consistent but not reflect the desired population. Data volume then gives false comfort. In the best case, managers can fall into the trap of thinking they have “more” data but they just have heavier weighting of the prior data. In a worse case, the increased data volume will hide sampling error deeper in a haystack of hoarded data.

Old data sources were imperfect, too: While it may be tempting to benchmark new data sources against old sources, the old data sources had sampling bias, too. In the UFO example, most aliens seem to visit Washington state. But it turns out that reporting a sighting by phone required, in the days before internet, a long-distance phone call, which — unless you live in Washington state — could be expensive and may have made cost-conscious viewers of flying saucers think twice about phoning in. With internet-based reporting, that particular source of bias should dissipate. Similarly, social media does provide data that reduces sampling bias from older sources.

Intuition remains important: The changes in sampling bias provide yet another example of the importance of blending intuition and data. Understanding potential sampling bias has not gone away; instead, with the increasing amount and variety of data, from an increasing number of sources, domain expertise may become even more important. As data becomes increasingly ubiquitous and plentiful, domain expertise can be a source of differentiation.

Widespread ability to collect data at low cost is leading to closer encounters through data. For UFO encounters, the Hynek scale classifies proximity to aliens as increasing “kinds” — from distant UFO sightings (first kind) to observations of physical effect (second kind) to contact with actual creatures (third kind). Likewise, businesses are getting increasingly closer to data processes through mobile, social, internet of things, etc. — and our closer encounters with data mean that sampling bias will become increasingly important to understand well.