Framing Data Science Problems the Right Way From the Start

Data science project failure can often be attributed to poor problem definition, but early intervention can prevent it.

Reading Time: 5 min 


The failure rate of data science initiatives — often estimated at over 80% — is way too high. We have spent years researching the reasons contributing to companies’ low success rates and have identified one underappreciated issue: Too often, teams skip right to analyzing the data before agreeing on the problem to be solved. This lack of initial understanding guarantees that many projects are doomed to fail from the very beginning.

Of course, this issue is not a new one. Albert Einstein is often quoted as having said, “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.”

Consider how often data scientists need to “clean up the data” on data science projects, often as quickly and cheaply as possible. This may seem reasonable, but it ignores the critical “why” question: Why is there bad data in the first place? Where did it come from? Does it represent blunders, or are there legitimate data points that are just surprising? Will they occur in the future? How does the bad data impact this particular project and the business? In many cases, we find that a better problem statement is to find and eliminate the root causes of bad data.

Too often, we see examples where people either assume that they understand the problem and rush to define it, or they don’t build the consensus needed to actually solve it. We argue that a key to successful data science projects is to recognize the importance of clearly defining the problem and adhere to proven principles in so doing. This problem is not relegated to technology teams; we find that many business, political, management, and media projects, at all levels, also suffer from poor problem definition.

Toward Better Problem Definition

Data science uses the scientific method to solve often complex (or multifaceted) and unstructured problems using data and analytics. In analytics, the term fishing expedition refers to a project that was never framed correctly to begin with and involves trolling the data for unexpected correlations. This type of data fishing does not meet the spirit of effective data science but is prevalent nonetheless. Consequently, defining the problem correctly needs to be step one. We previously proposed an organizational “bridge” between data science teams and business units, to be led by an innovation marshal — someone who speaks the language of both the data and management teams and can report directly to the CEO. This marshal would be an ideal candidate to assume overall responsibility to ensure that the following proposed principles are utilized.

Get the right people involved. To ensure that your problem framing has the correct inputs, you have to involve all the key people whose contributions are needed to complete the project successfully from the beginning. After all, data science is an interdisciplinary, transdisciplinary team sport. This team should include those who “own” the problem, those who will provide data, those responsible for the analyses, and those responsible for all aspects of implementation. Think of the RACI matrix — those responsible, accountable, to be consulted, and to be informed — for each aspect of the project.

Recognize that rigorously defining the problem is hard work. We often find that the problem statement changes as people work to nail it down. Leaders of data science projects should encourage debate, allow plenty of time, and document the problem statement in detail as they go. This ensures broad agreement on the statement before moving forward.

Don’t confuse the problem and its proposed solution. Consider a bank that is losing market share in consumer loans and whose leadership team believes that competitors are using more advanced models. It would be easy to jump to a problem statement that looks something like “Build more sophisticated loan risk models.” But that presupposes that a more sophisticated model is the solution to market share loss, without considering other possible options, such as increasing the number of loan officers, providing better training, or combating new entrants with more effective marketing. Confusing the problem and proposed solution all but ensures that the problem is not well understood, limits creativity, and keeps potential problem solvers in the dark. A better statement in this case would be “Research root causes of market share loss in consumer loans, and propose viable solutions.” This might lead to more sophisticated models, or it might not.

Understand the distinction between a proximate problem and a deeper root cause. In our first example, the unclean data is a proximate problem, whereas the root cause is whatever leads to the creation of bad data in the first place. Importantly, “We don’t know enough to fully articulate the root cause of the bad data problem” is a legitimate state of affairs, demanding a small-scale subproject.

Do not move past problem definition until it meets the following criteria:

  1. It does no harm. It may not be clear how to solve the defined problem, but it should be clear that solving it will lead to a good business result. If it’s not clear, more refinement may be needed. Consider the earlier bank example. While it might be easy enough to adjust models in ways that grant more loans, this might significantly increase risk — an unacceptable outcome. So the real goal should be to improve market share without creating additional risk, hence the inclusion of “propose viable solutions” in the problem statement above.
  2. It considers necessary constraints. Using the bank example, we can recognize that more sophisticated models might require hiring additional highly skilled loan officers — something the bank might be unwilling to do. All constraints, including those involving time, budget, technology, and people, should be clearly articulated to avoid a problem statement misaligned with business goals.
  3. It has an accountability matrix (or its equivalent). Alignment is key for success, so ensure that those who are responsible for solving the problem understand their various roles and responsibilities. Again, think RACI matrix.
  4. It receives buy-in from stakeholders. Poorly defined or controversial problem statements often produce resistors within the organization. In extreme cases, they may become “snipers,” attempting to ensure project failure. Work to develop a general (not necessarily unanimous) consensus from leadership, those involved in the solution, and the ultimate customers (those who will be affected) on the problem definition.

Taking the time needed to properly define the problem can feel uncomfortable. After all, we live and work in cultures that demand results and are eager to “get on with it.” But shortchanging this step is akin to putting the cart before the horse — it simply doesn’t work. There is no substitute for probing more deeply, getting the right people involved, and taking the time to understand the real problem. All of us — data scientists, business leaders, and politicians alike — need to get better at defining the right problem the right way.


More Like This

Add a comment

You must to post a comment.

First time here? Sign up for a free account: Comment on articles and get access to many more articles.

Comments (2)
Tathagat Varma
While problem definition looks like a very obvious thing to do, in my view the key reason most teams are so bad at it is because there is no one single view of the problem itself. The marketing, sales, R&D, operations, data science....everyone sees it from their narrow lens and believe they have defined the problem correctly. Perhaps so, but they are only like the five blind men interpreting the elephant in their own limited ways. Unless they can agree on one single view, the disconnect will likely continue.
yi yao
Totally agree. We should make a SMART problem statement at the very beginning.