Organizations are collecting and utilizing a massive amount of data. New research asks the question, “When is the right time to refresh that data?”
Creating insights from data is an important, and costly, issue for many companies. They spend time and effort collecting data, cleaning it, and using resources to find something meaningful from it. But what about after the insights have been generated? Do insights have a shelf life? If so, when should knowledge gleaned from old data be refreshed with new data?
In their new research paper, When is the Right Time to Refresh Knowledge Discovered from Data? Xiao Fang and Olivia R. Liu Sheng of the University of Utah, in collaboration with Paulo Goes of the University of Arizona, suggest that for real-world Knowledge Discovery in Databases (KDD) — applications like customer purchase patterns or public health surveillance — new data is imperative:
It could bring in new knowledge and invalidate part or even all of earlier discovered knowledge. As a result, knowledge discovered using KDD becomes obsolete over time. To support effective decision making, knowledge discovered using KDD needs to be kept current with its dynamic data source.
However, staying current with data sources is a fundamental challenge, rated as one of the top three management issues cited by knowledge management practitioners, according to the researchers:
A trivial solution is to run KDD whenever there is a change in data. However, such solution is neither practical, due to the high cost of running KDD, nor necessary, since it often results in no new knowledge discovered. On the other hand, running KDD too seldom could lead to significant obsoleteness for the knowledge in hand. Therefore, it is critical to determine when to run KDD so as to optimize the trade-off between the obsoleteness of knowledge and the cost of running KDD.
Their solution? Model an optimal knowledge refresh policy.
With research areas spanning machine learning and data mining (Fang), data mining and optimization techniques (Sheng), and modeling of complex production and information systems (Goes), the model developed by the researchers is — for those in management who don’t have a PhD in information systems — hopelessly bound in mathematical equations.
So I asked the authors to explain in lay terms how their research — and model — can help managers decide when the right time is to refresh knowledge discovered from data.
“The model itself aims at deciding when to run KDD to refresh knowledge such that the combined cost of knowledge loss and running KDD is minimized,” wrote Fang in an email.
He explained that knowledge loss refers to the phenomenon that knowledge discovered by a previous run of KDD becomes obsolete gradually, as new data are continuously added after the KDD run. Knowledge loss has impacts on several levels: if KDD is run too infrequently, for instance, customers may not respond to promotions that are based on obsolete customer purchase patterns; yet there is a personal cost of managing the KDD process, and there are computation costs of running KDD, regardless of whether it’s run in-house or in the cloud, so running it frequently isn’t the answer.
“We model the problem of deciding when to refresh knowledge as a Markov decision process and establish that it is always beneficial to run KDD such that the combined cost of knowledge loss and running KDD is the lowest when the amount of new data reaches a tipping point,” says Fang. “Such a tipping point can be computed based on the arrival rate of new data, the frequency of decisions that utilize knowledge discovered from data, as well as the estimated costs of knowledge loss and running KDD.”
Fang said that in using their solution, organizations can effectively decide when to refresh knowledge simply by monitoring the amount of new data accumulated against the computed tipping point.
“According to our empirical study in a major e-commerce company, our solution can reduce cost by at least 30%, compared to current industry practice of refreshing knowledge periodically (e.g., daily or weekly),” says Fang. “While many enterprises are deploying big data analytics, our solution will greatly sustain the value of these fast growing big data analytics applications.”