The Subtle Sources of Sampling Bias Hiding in Your Data
Plummeting data acquisition costs have contributed to a surge in business analytics. But more data doesn’t inherently remove sampling bias — and in some cases, it could make it worse.
When a group of Boston College students started an analytics project using data about UFO sightings, they thought they’d learn something about visits from spaceships and alien creatures — such as how weather and movie releases influence sightings. The Economist had done something similar, finding that most UFO reports are made during what it called “drinking hours” (5 to 11 p.m.), when people could be “nursing their fourth beer” — a possible connection that the publication dubbed “close encounters of the slurred kind.”
Instead, the students learned about sampling bias.
UFO sighting reports in the United States have increased substantially since the National UFO Reporting Center, a private organization based in Davenport, Washington, started tracking them in 1974. But this might not mean that we are getting more visitors from outer space.
When the reporting center first opened, communicating a sighting required making a telephone call to file a report. Once the internet became publicly available and people could make reports using an online form, the number of sightings began to rise. This easier and cheaper collection system provided more data about sightings. But the increase in the availability of data fundamentally changed the sample set — and any change in data affects the conclusions we can draw from that data.
Looking beyond the world of UFOs, lower costs of data collection provide value in many ways: We have much more data to work with and learn from than ever before. But managers must be careful to understand how the data was generated and how that might influence its value. The sources of bias in data sets can be far subtler than the ones that could be at play in the UFO data. What’s more, the task of interpreting data is falling on the shoulders of more people in organizations. What biases should managers be on the lookout for as they work to gain insight from increasing amounts of available data? And how can managers help their employees become better at spotting such biases?
Here are four practices that can help:
Get Updates on Leading with AI and Data
Monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.
Please enter a valid email address
Thank you for signing up
Understand the history behind your data. New data can be fundamentally different from older data in ways that managers must understand. In the infamous Chicago Daily Tribune “Dewey Defeats Truman” example, when the newspaper prematurely printed an incorrect headline about the winner of the U.S.