Does Data Have a Shelf Life?

Organizations are collecting and utilizing a massive amount of data. New research asks the question, “When is the right time to refresh that data?”

Creating insights from data is an important, and costly, issue for many companies. They spend time and effort collecting data, cleaning it, and using resources to find something meaningful from it. But what about after the insights have been generated? Do insights have a shelf life? If so, when should knowledge gleaned from old data be refreshed with new data?

In their new research paper, When is the Right Time to Refresh Knowledge Discovered from Data? Xiao Fang and Olivia R. Liu Sheng of the University of Utah, in collaboration with Paulo Goes of the University of Arizona, suggest that for real-world Knowledge Discovery in Databases (KDD) — applications like customer purchase patterns or public health surveillance — new data is imperative:

It could bring in new knowledge and invalidate part or even all of earlier discovered knowledge. As a result, knowledge discovered using KDD becomes obsolete over time. To support effective decision making, knowledge discovered using KDD needs to be kept current with its dynamic data source.

However, staying current with data sources is a fundamental challenge, rated as one of the top three management issues cited by knowledge management practitioners, according to the researchers:

A trivial solution is to run KDD whenever there is a change in data. However, such solution is neither practical, due to the high cost of running KDD, nor necessary, since it often results in no new knowledge discovered. On the other hand, running KDD too seldom could lead to significant obsoleteness for the knowledge in hand. Therefore, it is critical to determine when to run KDD so as to optimize the trade-off between the obsoleteness of knowledge and the cost of running KDD.

Their solution? Model an optimal knowledge refresh policy.

With research areas spanning machine learning and data mining (Fang), data mining and optimization techniques (Sheng), and modeling of complex production and information systems (Goes), the model developed by the researchers is — for those in management who don’t have a PhD in information systems — hopelessly bound in mathematical equations.

So I asked the authors to explain in lay terms how their research — and model — can help managers decide when the right time is to refresh knowledge discovered from data.

“The model itself aims at deciding when to run KDD to refresh knowledge such that the combined cost of knowledge loss and running KDD is minimized,” wrote Fang in an email.

He explained that knowledge loss refers to the phenomenon that knowledge discovered by a previous run of KDD becomes obsolete gradually, as new data are continuously added after the KDD run. Knowledge loss has impacts on several levels: if KDD is run too infrequently, for instance, customers may not respond to promotions that are based on obsolete customer purchase patterns; yet there is a personal cost of managing the KDD process, and there are computation costs of running KDD, regardless of whether it’s run in-house or in the cloud, so running it frequently isn’t the answer.

“We model the problem of deciding when to refresh knowledge as a Markov decision process and establish that it is always beneficial to run KDD such that the combined cost of knowledge loss and running KDD is the lowest when the amount of new data reaches a tipping point,” says Fang. “Such a tipping point can be computed based on the arrival rate of new data, the frequency of decisions that utilize knowledge discovered from data, as well as the estimated costs of knowledge loss and running KDD.”

Fang said that in using their solution, organizations can effectively decide when to refresh knowledge simply by monitoring the amount of new data accumulated against the computed tipping point.

“According to our empirical study in a major e-commerce company, our solution can reduce cost by at least 30%, compared to current industry practice of refreshing knowledge periodically (e.g., daily or weekly),” says Fang. “While many enterprises are deploying big data analytics, our solution will greatly sustain the value of these fast growing big data analytics applications.”

2 Comments On: Does Data Have a Shelf Life?

  • Andy Capaloff | September 10, 2013

    Is the old Data literally removed? It seems that trend analysis is also very important, so Data should somehow be designated as ‘not current’, as opposed to truly removed. Of course that makes Big Data bigger, but it also aids in predictive analysis. Perhaps I missed a point somewhere here.

  • Bryan Kolterman | December 30, 2013

    Hi Andy,

    I don’t think old data is literally removed rather than refreshed as new data becomes available. With the low cost of data storage, companies can archive a majority of old data so it is never truly lost. I completely agree with you that all data, old and new, is helpful for predictive analysis. Their examples of using KDD for customer purchase patterns and public health surveillance may add some clarity to your question. These are probably two applications where old and new data would predict vastly different trends, so the constant refresh is necessary to predict the next best move for the company. This is very ‘time dependent’ data, especially customer purchase patterns- consumer behavior is constantly changing. With applications where trend analysis shows relative stability over time, old data and new data might be equally effective at what you are trying to predict. In this situation, it could be beneficial for the company to forgo the cost of running KDD and losing the knowledge.

Add a comment