Plummeting data acquisition costs have contributed to a surge in business analytics. But more data doesn’t inherently remove sampling bias — and in some cases, it could make it worse.
When a group of Boston College students started an analytics project using data about UFO sightings, they thought they’d learn something about visits from spaceships and alien creatures — such as how weather and movie releases influence sightings. The Economist had done something similar, finding that most UFO reports are made during what it called "drinking hours" (5 to 11 p.m.), when people could be “nursing their fourth beer” — a possible connection that the publication dubbed “close encounters of the slurred kind.”
Instead, the students learned about sampling bias.
UFO sighting reports in the United States have increased substantially since the National UFO Reporting Center, a private organization based in Davenport, Washington, started tracking them in 1974. But this might not mean that we are getting more visitors from outer space.
When the reporting center first opened, communicating a sighting required making a telephone call to file a report. Once the internet became publicly available and people could make reports using an online form, the number of sightings began to rise. This easier and cheaper collection system provided more data about sightings. But the increase in the availability of data fundamentally changed the sample set — and any change in data affects the conclusions we can draw from that data.
Looking beyond the world of UFOs, lower costs of data collection provide value in many ways: We have much more data to work with and learn from than ever before. But managers must be careful to understand how the data was generated and how that might influence its value. The sources of bias in data sets can be far subtler than the ones that could be at play in the UFO data. What’s more, the task of interpreting data is falling on the shoulders of more people in organizations. What biases should managers be on the lookout for as they work to gain insight from increasing amounts of available data? And how can managers help their employees become better at spotting such biases?
Here are four practices that can help:
Understand the history behind your data. New data can be fundamentally different from older data in ways that managers must understand. In the infamous Chicago Daily Tribune “Dewey Defeats Truman” example, when the newspaper prematurely printed an incorrect headline about the winner of the U.S. presidential election, the paper had based its conclusion on telephone polling, rather than door-to-door polling, which had been used in the past. That turned out to be a mistake because the underlying demographics of telephone owners at that time differed substantially from those of the national electorate as a whole.
In a similar vein, businesses today must be savvy about how they interpret the rich, low-cost data from online forums, keeping in mind that what they learn about customers using social media may show different trends than data from prior sources, such as phone or written surveys. Social media is an amazing new source of detailed data about consumer activity; it gives businesses access to unprecedented amounts of information about individuals. But not every customer uses social media. And not everyone is honest online: People intentionally shape their images on social media. Customers may not necessarily be behaving differently, but they could be responding differently based on the medium or the visibility of their response.
Organizations seeking to become more data driven must make it easy for managers to understand data’s lineage — its origins, the systems involved in collecting it, and intermediate processing steps. Managers must understand the need to ask: What do we know about where the data we are using comes from? What might have changed the data since its origin?
Acknowledge that more data may not mean new data. With increased amounts of data, statistically significant results are easier to find but distract from the larger problem of sampling errors — that the sample may be internally consistent but not reflect the desired population. Data volume can give false comfort: Managers may fall into the trap of thinking they have “better data” when they just have heavier weighting of the prior data. In a worse case, the increased data volume hides sampling errors deeper in a haystack of information.
It is here that the “big” of Big Data can fail us. Managers risk falling into thinking that an enormous number of data points just can’t be wrong, that the sample is too big to fail. What’s more, ingesting and processing new information may require substantial processing to transform unstructured data into structured data. Or the new data may require that business processes change to incorporate real-time feeds. Each of these tasks takes resources, time, and effort. Failure comes from generating more data that doesn’t add to what the organization already knows.
Before embarking on projects to acquire more data, managers need to assess what new information the additional data will bring — or do a pilot to find out. They should ask: What insights are we looking for? And if we don’t know, how can we find out without large investments?
Recognize that old data sources were imperfect, too. While it may be tempting to benchmark new data sources against old sources, the old data sources had sampling bias, too. In the UFO example, most reported sightings were in Washington state. But in the days before widespread use of the internet, reports of sightings had to be made by phone, and anyone outside of the local Davenport, Washington, calling area would have had to make a long-distance call to reach the National UFO Reporting Center — a factor that may have made cost-conscious people think twice before picking up the phone.
So how do you know which data source is best? It is likely that managers understand their older data sources better than their new data sources, since managerial experience grows over time. Just like it took time to get to know the old data, it will take time to learn the new. There are trade-offs in the biases of one source versus another, and understanding those biases will take experience. Using both old and new data sources can sometimes provide more insight than either alone, since each can help illuminate the sampling bias in the other.
To acknowledge this challenge, managers should ask: How is our existing data limited? Can a new data source get around that limitation?
Remember that intuition remains important. With an increasing volume of data
from an ever-expanding variety of sources, blending information with intuition and understanding potential sampling bias has become more vital than ever.
Understanding sampling bias is an inherently human task. It requires knowing what is not in the data — and the data itself cannot tell you what it’s missing. Despite the rise of artificial intelligence and machine learning, human domain expertise is still needed to look at the big picture and understand which portion of that picture a particular data source shows, as well as what it doesn’t show. Human domain expertise is still needed to understand ongoing trends that likely began before the new data source existed. And human domain expertise is still needed to know what might happen in the business context that affects the sampling.
As data becomes increasingly ubiquitous, expertise in applying data to specific business problems will become a key resource — and a potential source of differentiation for both individuals and the organizations that employ them. Managers should ask: What do I know about my business that the data does not? expertise can be a source of differentiation.
Combining Two Types of Knowledge
Unfortunately for organizations, the burden of understanding sampling bias falls squarely on whichever staff members try to use and interpret analytical results. Just because an organization can produce analytics, that does not mean everyone (or anyone!) in the organization will be able to apply those results well. Getting value from data requires deep knowledge of the data-generating processes. But in many organizations, this information is not attached to the analytical results. That presents a crucial, growing challenge: conveying information about data lineage through inherently distributed processes.
As use of analytics becomes more pervasive throughout organizations, an increasing number of people will need to become savvy consumers of analytical results. To succeed as data consumers, managers must combine two vastly different types of knowledge. First, they must know the details behind the data-generating processes to understand what the data can and cannot say. Second, they must have a broad understanding and general knowledge of their business.
The burden of understanding sampling bias cannot be handled centrally. It is a challenge that will be felt throughout organizations, requiring many individuals to learn more to compensate for it. To address this challenge, managers need to ask: How can I develop employees who combine a broad knowledge of the business with an ability to interpret detailed data accurately?
This article was originally published on May 30, 2017. It has been updated to reflect edits made for its inclusion in our Fall 2017 print edition.