The failure rate of data science initiatives — often estimated at over 80% — is way too high. We have spent years researching the reasons contributing to companies’ low success rates and have identified one underappreciated issue: Too often, teams skip right to analyzing the data before agreeing on the problem to be solved. This lack of initial understanding guarantees that many projects are doomed to fail from the very beginning.
Of course, this issue is not a new one. Albert Einstein is often quoted as having said, “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.”
Get Updates on Leading With AI and Data
Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.
Please enter a valid email address
Thank you for signing up
Consider how often data scientists need to “clean up the data” on data science projects, often as quickly and cheaply as possible. This may seem reasonable, but it ignores the critical “why” question: Why is there bad data in the first place? Where did it come from? Does it represent blunders, or are there legitimate data points that are just surprising? Will they occur in the future? How does the bad data impact this particular project and the business? In many cases, we find that a better problem statement is to find and eliminate the root causes of bad data.
Too often, we see examples where people either assume that they understand the problem and rush to define it, or they don’t build the consensus needed to actually solve it. We argue that a key to successful data science projects is to recognize the importance of clearly defining the problem and adhere to proven principles in so doing. This problem is not relegated to technology teams; we find that many business, political, management, and media projects, at all levels, also suffer from poor problem definition.
Toward Better Problem Definition
Data science uses the scientific method to solve often complex (or multifaceted) and unstructured problems using data and analytics. In analytics, the term fishing expedition refers to a project that was never framed correctly to begin with and involves trolling the data for unexpected correlations. This type of data fishing does not meet the spirit of effective data science but is prevalent nonetheless. Consequently, defining the problem correctly needs to be step one. We previously proposed an