Why Big Data Isn’t Enough

There is a growing belief that sophisticated algorithms can explore huge databases and find relationships independent of any preconceived hypotheses. But in businesses that involve scientific research and technological innovation, the authors argue, this approach is misguided and potentially risky.

As “big data” becomes increasingly integrated into many aspects of our lives, we are hearing more calls for revolutionary changes in how researchers work. To save time in understanding the behavior of complex systems or in predicting outcomes, some analysts say it should now be possible to let the data “tell the story,” rather than having to develop a hypothesis and go through painstaking steps to prove it. The success of companies such as Google Inc. and Facebook Inc., which have transformed the advertising and social media worlds by applying data mining and mathematics, has led many to believe that traditional methodologies based on models and theories may no longer be necessary. Among young professionals (and many of the MBA students we see), there is almost a blind faith that sophisticated algorithms can be used to explore huge databases and find interesting relationships independent of any theories or prior beliefs. The assumption is, the bigger the data, the more powerful the findings.

As appealing as this viewpoint may be, we think it’s misguided — and could be risky for companies. For example, what if the data appears to support a new drug design or a new scientific approach when there actually isn’t a causal relationship? Although we acknowledge that data mining has enabled tremendous advances in business intelligence and in the understanding of consumer behavior — think of how Amazon.com Inc. figures out what you might want to buy or how content recommendation engines such as those used by Netflix Inc. work — it’s important for executives who oversee technical disciplines to be thoughtful about how they apply this approach to their areas.

Recently, we looked at several fields where massive amounts of data are available and collected: drug discovery and pharmaceutical research; genomics and species improvement; weather forecasting; the design of complex products like gas turbines; and speech recognition. (See “About the Research.”) In each setting, we asked a series of broad questions, including the following: How do data-driven research approaches fit with traditional research methods? In what ways could data-driven research extend the current understanding of scientific and engineering problems? And what cautions do managers need to exercise about the limitations and the proper use of statistical inference?


1. D. Simchi-Levi, “OM Research: From Problem-Driven to Data-Driven Research,” Manufacturing & Service Operations Management 16, no. 1 (February 2014): 2-10.

2. C.M. Reinhart and K.S. Rogoff, “Growth in a Time of Debt,” American Economic Review 100, no. 2 (May 2010): 573–578.

3. T. Herndon, M. Ash, and R. Pollin, “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff,” Cambridge Journal of Economics 38, no. 2 (March 2014): 257–279.

4. A. Hero and B. Rajaratnam, “Large Scale Correlation Mining for Biomolecular Network Discovery,” technical report no. 2015-02, Stanford Department of Statistics, Stanford, California, January 2015; and C. Rudin, D. Dunson, R. Irizarry et al., “Discovery With Data: Leveraging Statistics With Computer Science to Transform Science and Society,” white paper, American Statistical Association, Alexandria, Virginia, June 2014.

5. J. Aleksic, S.H. Carl, and M. Frye, “Beyond Library Size: A Field Guide to NGS Normalization,” June 19, 2014, http://dx.doi.org/10.1101/006403.

6. J.G. Lombardino and J.A. Lowe 3rd, “The Role of the Medicinal Chemist in Drug Discovery — Then and Now,” Nature Reviews Drug Discovery 3, no. 10 (October 2004): 853-862.

7. Google discontinued this program after a failure that missed the peak of the 2013 flu season by 140%; see, for example, D. Lazar and R. Kennedy, “What We Can Learn From the Epic Failure of Google Flu Trends,” October 1, 2015, www.wired.com.

8. L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE 77, no. 2 (February 1989): 257-286.

9. For a more extensive discussion of this topic, see, for example, G. Hinton, L. Deng, D. Yu et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine 29, no. 6 (November 2012): 82-97; X.-W. Chen and X. Lin, “Big Data Deep Learning: Challenges and Perspectives,” IEEE Access 2 (May 2014): 514-525; and L. Deng and N. Jaitly, “Deep Discriminative and Generative Models for Pattern Recognition,” chap. 1.2 in “Handbook of Pattern Recognition and Computer Vision,” 5th ed., ed. C.H. Chen (Singapore: World Scientific Publishing, 2016).

10. C. Zhang, “Madden-Julian Oscillation,” Reviews of Geophysics 43, no. 2 (June 2005): 1-36.

11. P. Neilley, interview with authors, Aug. 17, 2015.

12. P.E. Grafton and D.R. Stome, “Analysis of Axisymmetrical Shells by the Direct Stiffness Method,” AIAA Journal 1, no. 10 (1963): 2342-2347; and M.J. Turner, R.W. Clough, H.C. Martin, and L.J. Topp, “Stiffness and Deflection Analysis of Complex Structures,” Journal of the Aeronautical Sciences 23, no. 9 (September 1956): 805-823.

13. P.A.M. Dirac, “Quantum Mechanics of Many — Electron Systems,” Proceedings of the Royal Society A 123, no. 792 (April 6, 1929): 714-33.

14. T. Hey, S. Tansley, and K.M. Tolle, eds., “The Fourth Paradigm: Data-Intensive Scientific Discovery” (Microsoft Research, Redmond, Washington, 2009).

i. Hey, Tansley, and Tolle, “The Fourth Paradigm.”

2 Comments On: Why Big Data Isn’t Enough

  • Russ Gentile | November 22, 2016

    Thank you, for something to think about. We all are certainly rushing into Big Data, but are reminded by this article to go slow, to move fast. Fact based, accepted research, including dissenting opinions, is a necessary pre-requisite of any new AI research.

    -R. Gentile
    Purdue Krannert MBA 2018

  • Pavel Barseghyan | January 26, 2017

    This is a very interesting direction of research. Thank you.

    Three years ago I published two research papers on the same topic:

    1. Pavel Barseghyan (2013) “Equilibrium and Extreme Principles in Discovering Unknown Relationships from Big Data: Part 1: Methods of Advanced Data Analytics in Light of the Equations of Mathematical Physics” PM World Journal Volume 2, Issue 12 December 2013 – 9 pages. http://pmworldjournal.net/wp-content/uploads/2013/12/pmwj17-dec2013-barseghyan-big-data-part1-advanced-data-analytics-FeaturedPaper.pdf
    2. Pavel Barseghyan (2014) “Equilibrium and Extreme Principles in Discovering Unknown Relationships from Big Data: Part 2: Non-statistical Mathematical Methods in Project Management” PM World Journal Volume 3, Issue 1 January 2014 – 14 pages. http://pmworldjournal.net/wp-content/uploads/2014/01/pmwj18-jan2014-barseghyan-big-data-part2-non-statistical-mathematical-methods-FeaturedPaper.pdf

    Excerpt from paper 1
    “Any area of the classical quantitative science, including mechanics and electrodynamics, can be represented as an area of Big Data. For doing that, it is sufficient to collect a large amount of data of electrodynamic or mechanical nature from the surrounding environment.
    Having the scientific representations of some quantitative area in the form of fundamental equations on the one side, and in the form of Big Data on the other side, a natural question arises of whether there is a correspondence and an agreement between these two views?

    Also, if there is such a correspondence between them, are unambiguous mutual transitions between the fundamental equations and the Big Database possible?

    In particular, is it possible to obtain the well-known fundamental equations or some of their equivalents for a quantitative field of knowledge from its Big Data in a statistical or semi statistical way? ”
    The main topic of paper 2 is that the derivation of the relationships between the parameters of projects in an analytic way.
    This derivation is based on the state equation of projects as human systems and on the principle of headcount gradient.
    Pavel Barseghyan

Add a comment