Magazine Fall 2017 Issue Frontiers

The Subtle Sources of Sampling Bias Hiding in Your Data

Plummeting data acquisition costs have contributed to a surge in business analytics. But more data doesn’t inherently remove sampling bias — and in some cases, it could make it worse.

Sam Ransbotham May 30, 2017 Reading Time: 7 min

Topics

Frontiers

An MIT SMR initiative exploring how technology is reshaping the practice of management.

Get Updates on Leading with AI and Data

Monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Understand the history behind your data. New data can be fundamentally different from older data in ways that managers must understand. In the infamous Chicago Daily Tribune “Dewey Defeats Truman” example, when the newspaper prematurely printed an incorrect headline about the winner of the U.S. presidential election, the paper had based its conclusion on telephone polling, rather than door-to-door polling, which had been used in the past. That turned out to be a mistake because the underlying demographics of telephone owners at that time differed substantially from those of the national electorate as a whole.

In a similar vein, businesses today must be savvy about how they interpret the rich, low-cost data from online forums, keeping in mind that what they learn about customers using social media may show different trends than data from prior sources, such as phone or written surveys. Social media is an amazing new source of detailed data about consumer activity; it gives businesses access to unprecedented amounts of information about individuals. But not every customer uses social media. And not everyone is honest online: People intentionally shape their images on social media. Customers may not necessarily be behaving differently, but they could be responding differently based on the medium or the visibility of their response.

Organizations seeking to become more data driven must make it easy for managers to understand data’s lineage — its origins, the systems involved in collecting it, and intermediate processing steps. Managers must understand the need to ask: What do we know about where the data we are using comes from? What might have changed the data since its origin?

Acknowledge that more data may not mean new data. With increased amounts of data, statistically significant results are easier to find but distract from the larger problem of sampling errors — that the sample may be internally consistent but not reflect the desired population. Data volume can give false comfort: Managers may fall into the trap of thinking they have “better data” when they just have heavier weighting of the prior data. In a worse case, the increased data volume hides sampling errors deeper in a haystack of information.

It is here that the “big” of Big Data can fail us. Managers risk falling into thinking that an enormous number of data points just can’t be wrong, that the sample is too big to fail. What’s more, ingesting and processing new information may require substantial processing to transform unstructured data into structured data. Or the new data may require that business processes change to incorporate real-time feeds. Each of these tasks takes resources, time, and effort. Failure comes from generating more data that doesn’t add to what the organization already knows.

Before embarking on projects to acquire more data, managers need to assess what new information the additional data will bring — or do a pilot to find out. They should ask: What insights are we looking for? And if we don’t know, how can we find out without large investments?

Recognize that old data sources were imperfect, too. While it may be tempting to benchmark new data sources against old sources, the old data sources had sampling bias, too. In the UFO example, most reported sightings were in Washington state. But in the days before widespread use of the internet, reports of sightings had to be made by phone, and anyone outside of the local Davenport, Washington, calling area would have had to make a long-distance call to reach the National UFO Reporting Center — a factor that may have made cost-conscious people think twice before picking up the phone.

So how do you know which data source is best? It is likely that managers understand their older data sources better than their new data sources, since managerial experience grows over time. Just like it took time to get to know the old data, it will take time to learn the new. There are trade-offs in the biases of one source versus another, and understanding those biases will take experience. Using both old and new data sources can sometimes provide more insight than either alone, since each can help illuminate the sampling bias in the other.

To acknowledge this challenge, managers should ask: How is our existing data limited? Can a new data source get around that limitation?

Remember that intuition remains important. With an increasing volume of data
from an ever-expanding variety of sources, blending information with intuition and understanding potential sampling bias has become more vital than ever.

Understanding sampling bias is an inherently human task. It requires knowing what is not in the data — and the data itself cannot tell you what it’s missing. Despite the rise of artificial intelligence and machine learning, human domain expertise is still needed to look at the big picture and understand which portion of that picture a particular data source shows, as well as what it doesn’t show. Human domain expertise is still needed to understand ongoing trends that likely began before the new data source existed. And human domain expertise is still needed to know what might happen in the business context that affects the sampling.

As data becomes increasingly ubiquitous, expertise in applying data to specific business problems will become a key resource — and a potential source of differentiation for both individuals and the organizations that employ them. Managers should ask: What do I know about my business that the data does not? expertise can be a source of differentiation.

Combining Two Types of Knowledge

Unfortunately for organizations, the burden of understanding sampling bias falls squarely on whichever staff members try to use and interpret analytical results. Just because an organization can produce analytics, that does not mean everyone (or anyone!) in the organization will be able to apply those results well. Getting value from data requires deep knowledge of the data-generating processes. But in many organizations, this information is not attached to the analytical results. That presents a crucial, growing challenge: conveying information about data lineage through inherently distributed processes.

As use of analytics becomes more pervasive throughout organizations, an increasing number of people will need to become savvy consumers of analytical results. To succeed as data consumers, managers must combine two vastly different types of knowledge. First, they must know the details behind the data-generating processes to understand what the data can and cannot say. Second, they must have a broad understanding and general knowledge of their business.

The burden of understanding sampling bias cannot be handled centrally. It is a challenge that will be felt throughout organizations, requiring many individuals to learn more to compensate for it. To address this challenge, managers need to ask: How can I develop employees who combine a broad knowledge of the business with an ability to interpret detailed data accurately?

This article was originally published on May 30, 2017. It has been updated to reflect edits made for its inclusion in our Fall 2017 print edition.

Topics

Frontiers

An MIT SMR initiative exploring how technology is reshaping the practice of management.

About the Author

Sam Ransbotham is an associate professor of information systems at the Carroll School of Management at Boston College and the MIT Sloan Management Review guest editor for the Data and Analytics Big Idea Initiative. He can be reached at sam.ransbotham@bc.edu and on Twitter @ransbotham.

Acknowledgments

The author thanks Boston College students Matthew Frederick, Puneet Nayyar, Amanda Valdes, Alexa Villalobos, and Valeria Yanes for insights from their analytics project about UFO sightings.

Tags:

Reprint #:

59126

Topics

Frontiers

Get Updates on Leading with AI and Data

Combining Two Types of Knowledge

Topics

Frontiers

About the Author

Acknowledgments

Tags:

Reprint #:

More Like This

Add a comment Cancel reply