Why Big Data Isn’t Enough
There is a growing belief that sophisticated algorithms can explore huge databases and find relationships independent of any preconceived hypotheses. But in businesses that involve scientific research and technological innovation, the authors argue, this approach is misguided and potentially risky.
As “big data” becomes increasingly integrated into many aspects of our lives, we are hearing more calls for revolutionary changes in how researchers work. To save time in understanding the behavior of complex systems or in predicting outcomes, some analysts say it should now be possible to let the data “tell the story,” rather than having to develop a hypothesis and go through painstaking steps to prove it. The success of companies such as Google Inc. and Facebook Inc., which have transformed the advertising and social media worlds by applying data mining and mathematics, has led many to believe that traditional methodologies based on models and theories may no longer be necessary. Among young professionals (and many of the MBA students we see), there is almost a blind faith that sophisticated algorithms can be used to explore huge databases and find interesting relationships independent of any theories or prior beliefs. The assumption is, the bigger the data, the more powerful the findings.
As appealing as this viewpoint may be, we think it’s misguided — and could be risky for companies. For example, what if the data appears to support a new drug design or a new scientific approach when there actually isn’t a causal relationship? Although we acknowledge that data mining has enabled tremendous advances in business intelligence and in the understanding of consumer behavior — think of how Amazon.com Inc. figures out what you might want to buy or how content recommendation engines such as those used by Netflix Inc. work — it’s important for executives who oversee technical disciplines to be thoughtful about how they apply this approach to their areas.
Recently, we looked at several fields where massive amounts of data are available and collected: drug discovery and pharmaceutical research; genomics and species improvement; weather forecasting; the design of complex products like gas turbines; and speech recognition. (See “About the Research.”) In each setting, we asked a series of broad questions, including the following: How do data-driven research approaches fit with traditional research methods? In what ways could data-driven research extend the current understanding of scientific and engineering problems? And what cautions do managers need to exercise about the limitations and the proper use of statistical inference?
References
1. D. Simchi-Levi, “OM Research: From Problem-Driven to Data-Driven Research,” Manufacturing & Service Operations Management 16, no. 1 (February 2014): 2-10.
2. C.M. Reinhart and K.S. Rogoff, “Growth in a Time of Debt,” American Economic Review 100, no. 2 (May 2010): 573–578.
3. T. Herndon, M. Ash, and R. Pollin, “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff,” Cambridge Journal of Economics 38, no. 2 (March 2014): 257–279.
4. A. Hero and B. Rajaratnam, “Large Scale Correlation Mining for Biomolecular Network Discovery,” technical report no. 2015-02, Stanford Department of Statistics, Stanford, California, January 2015; and C. Rudin, D. Dunson, R. Irizarry et al., “Discovery With Data: Leveraging Statistics With Computer Science to Transform Science and Society,” white paper, American Statistical Association, Alexandria, Virginia, June 2014.
5. J. Aleksic, S.H. Carl, and M. Frye, “Beyond Library Size: A Field Guide to NGS Normalization,” June 19, 2014, http://dx.doi.org/10.1101/006403.
6. J.G. Lombardino and J.A. Lowe 3rd, “The Role of the Medicinal Chemist in Drug Discovery — Then and Now,” Nature Reviews Drug Discovery 3, no. 10 (October 2004): 853-862.
7. Google discontinued this program after a failure that missed the peak of the 2013 flu season by 140%; see, for example, D. Lazar and R. Kennedy, “What We Can Learn From the Epic Failure of Google Flu Trends,” October 1, 2015, www.wired.com.
8. L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE 77, no. 2 (February 1989): 257-286.
9. For a more extensive discussion of this topic, see, for example, G. Hinton, L. Deng, D. Yu et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine 29, no. 6 (November 2012): 82-97; X.-W. Chen and X. Lin, “Big Data Deep Learning: Challenges and Perspectives,” IEEE Access 2 (May 2014): 514-525; and L. Deng and N. Jaitly, “Deep Discriminative and Generative Models for Pattern Recognition,” chap. 1.2 in “Handbook of Pattern Recognition and Computer Vision,” 5th ed., ed. C.H. Chen (Singapore: World Scientific Publishing, 2016).
10. C. Zhang, “Madden-Julian Oscillation,” Reviews of Geophysics 43, no. 2 (June 2005): 1-36.
11. P. Neilley, interview with authors, Aug. 17, 2015.
12. P.E. Grafton and D.R. Stome, “Analysis of Axisymmetrical Shells by the Direct Stiffness Method,” AIAA Journal 1, no. 10 (1963): 2342-2347; and M.J. Turner, R.W. Clough, H.C. Martin, and L.J. Topp, “Stiffness and Deflection Analysis of Complex Structures,” Journal of the Aeronautical Sciences 23, no. 9 (September 1956): 805-823.
13. P.A.M. Dirac, “Quantum Mechanics of Many — Electron Systems,” Proceedings of the Royal Society A 123, no. 792 (April 6, 1929): 714-33.
14. T. Hey, S. Tansley, and K.M. Tolle, eds., “The Fourth Paradigm: Data-Intensive Scientific Discovery” (Microsoft Research, Redmond, Washington, 2009).
i. Hey, Tansley, and Tolle, “The Fourth Paradigm.”
Comments (2)
Pavel Barseghyan
Russ Gentile