At what point does a company’s data collection become self-defeating?
As organizations everywhere increasingly embrace analytics, it is tempting to think that additional data will provide the crucial insight, reveal the overlooked explanation, or crisply discern key solutions within a morass of muddled information. But “more data” is not the answer to every problem.
Organizations that add data indiscriminately run the risk of becoming data hoarders instead of data collectors. An analyst working in a large financial services institution offered this useful distinction: “Hoarders store everything and don’t know how to determine what is important. Collectors know exactly what is valuable and prioritize what to keep.”
As data storage costs continue to plummet, why not just save everything? Why not be a hoarder? The answer is: hoarding wastes resources and, paradoxically, reduces the usefulness of existing data.
First, costs still exist; while storage costs have decreased, they are not zero. The sheer volume of data produced by modern information technologies adds up quickly and relentlessly. Calvin Smith, principal manager of global innovation at EMC Corporation, observes that “… ‘big data’ does not describe a Holy Grail data set that some companies ‘get’ and other companies don’t … big data could really include all data … and it’s not easy or cheap to attempt to collect and store all the data out there.”
Furthermore, storage costs themselves are a small part of data’s expense; maintenance costs (such as provisioning, backing up, verifying, and recovering) can be substantial and require expensive staff involvement.
Second, hoarding data interferes with existing data since it diverts scarce analyst and managerial resources that may be better applied elsewhere. If actionable insights are the proverbial needle in the haystack, adding more data may just make the haystack bigger, and the needle that much harder to find. The financial services analyst notes, “Even if data is free to store, high-priced data scientists will still waste time looking at it and try to find spurious patterns or incorporate the data into models to no avail. There is still an opportunity cost to looking at the wrong data and not having a strong sense for what questions are important to answer.”
Yet additional data can be valuable. What distinguishes collecting behavior from hoarding behavior? In the spirit of Isaac Asimov’s Three Laws of Robotics, I suggest Three Laws (plus a “Zeroth” Law) to guide your company’s approach to adding more data.
Law 1: “More data” should not obscure desirable information or, through distraction, allow ongoing analyses to come to harm.
Additional data should have a defined purpose in the analysis. Does it inform about a potential source of customers missing from current analysis? In what way does it reduce uncertainty or allow more precise estimates of an important measure? How does it improve discrimination, segmentation, or causal inference?
Data collectors use experience or sampling to add data rich with potential to further the purpose of the analysis; data hoarders add data fearfully and speculatively. The fear of making a wrong decision leads to keeping everything — and this can sometimes be counterproductive. As an example, in the context of searching for terrorist activity, security expert Bruce Schneier observed, “Piling more data onto the mix makes it harder, not easier. […] the last thing you want to do is increase the amount of hay you have to search through.”
Law 2: “More data” should be added only if other data will not suffice and its addition does not conflict with the First Law
All data is not created equal; there is considerable variation in quality and usefulness. As such, alternate sources of data might achieve similar purposes. Data collectors consider multiple available alternatives to meet the purpose of the analysis; data hoarders add additional measures to the stack.
For example, in investigating recent trouble with engine stalls, GM’s OnStar system did not capture data specifically about the ignition switches. However, rather than perform additional data collection, GM was able to use existing data about engine status and shifter position as a proxy for the missing information about ignition switches. While an additional direct measure could improve accuracy, cost must be balanced against incremental accuracy.
Law 3: “More data” should be added only if its addition does not exacerbate existing biases in the data, and its addition does not conflict with the First or Second Law.
All data is biased in some way. Adding more data can add more bias, undermining overall data quality. Social media, for example, is a rich source of information, but comes with demographic and socio-economic sampling. Analysis of Twitter and Foursquare activity during Hurricane Sandy offers fascinating and unprecedented insights into human behavior. However, one commentator noted that adding more data from other social media sites would not add more information about affected people without power or Internet access. Data collectors seek novel perspectives; data hoarders pile on convenient data and reinforce bias.
Law 0: “More data” must not harm the overall analytical process.
For analytical organizations, analyses are not isolated projects, but rather part of a process. Using data is like having a child — the process should not be orphaned and requires ongoing maintenance. As additional data is incorporated into analysis, what will the process be for incorporating updates to the data, ensuring quality, etc.? Is the process sustainable? Do the long-term costs of maintenance sufficiently offset the value added in order to create a positive return on investment? Data collectors integrate data into ongoing processes; data hoarders add data that require new processes that interfere with other important organizational processes.