The processing required to prepare unstructured data for analysis can be cumbersome and prone to error. That’s why companies should do more to organize their data before it is ever collected.
The Smart Way to Deal With Messy Data
Unstructured data — data that is not organized in a predefined way, such as text — is now widely available. But structure must be added to the data to make it useable for analysis, which means significant processing. That processing can be a problem.
In a form of modern alchemy, modern analytics processes now transmute “base” unstructured data into “noble” business value. Systems everywhere greedily salt away every imaginable kind of data. Technologies such as Hadoop and NoSQL store this hoard easily in its native unstructured form. Natural language processing, feature extraction (distilling nonredundant measures from larger data), and speech recognition now routinely alchemize vast quantities of unstructured text, images, audio, and video, preparing it for analysis. These processes are nothing short of amazing, working against entropy to create order from disorder.
Unfortunately, while these processing steps are impressive, they are far from free or free from error. I can’t help but think that a better alternative in many cases would be to avoid the need for processing altogether.
We all know how each step in a process mangles information. In the telephone game, as each person whispers to the next player what they think was said to them, words can morph into an unexpected or misleading final message. In a supply chain, layers exacerbate distortion as small mistakes and uncertainty quickly compound.
By analogy, organizations are playing a giant game of telephone with data, and unstructured data makes the game far more difficult. In a context where data janitorial activities consume 50% to 80% of scarce data scientist resources, each round of data telephone costs organizations in accuracy, effort, and time — and few organizations have a surplus of any of these three.
Within organizations, each processing step can be expensive to develop and maintain. But the growth in importance of data sharing between organizations magnifies these concerns. Our recently published report, “Analytics Drives Success with IoT,” associates business value with sharing data between organizations in the context of the internet of things. And, to foreshadow our report to be released in January, we observe similar results in the broader analytics context. But with every transfer of data, more processes need to be developed and maintained.
If this processing were unavoidable, then it would just be a cost of data sharing within or between organizations. A disconcerting point, however, is that there is (or could be) structure in the ancestry of much of the data that is currently unstructured. For example, for every organization that generates a web page based on data in a database, there are likely multiple organizations scraping that data (either sanctioned or unsanctioned) and then processing it to try to regain that structure. In the best case, that’s a lot of thrashing just to end up with data in its original form. In the worst case, it’s a lot of effort to put toward obtaining data with many errors.
This page contains a form, you can see it here
All these layers of processing may not be necessary. Can they be removed? Disintermediation removes intermediate steps; in effect, it skips steps that may have once added value beyond their cost, but no longer do. While we typically think of disintermediation as removing “middleman” companies between the producers and consumers of products, the concept can work for data as well.
Organizations can disintermediate their data processes in at least two ways.
- Make structured data available to others through a formal application program interface (API): By providing data in easy-to-consume formats, organizations can save downstream effort. For example, the first product provided by The Echo Nest was their developer API. This allowed others to benefit easily from the analytics that The Echo Nest offered. Recent research finds evidence of increased income associated with businesses that offer APIs.
- Structure data at the source. As the growth of unstructured storage technologies (such as the Hadoop Distributed File System and many NoSQL variations) has made unstructured storage easier, it is tempting to design processes that take advantage of these by storing unstructured data with the idea that structure can be added at some nebulous later time. But the further data gets from the source, the more difficult it may be to add that structure later. Why not structure the data at the source when possible?
In his work at the online labor-matching platform oDesk, NYU Stern’s John Horton observed that systems originally allowed people to share information about their technology product experience in unstructured entries. As a result, users created more than 100,000 lexically distinct entries with numerous errors (e.g., “MS Powerpoint,” “Microsoft Power Point,” “PowerPoint,” “MSFT PPT,” “Microsoft PowerPoint,” “Microsoft PwerPoint,” etc.). Although only about 14% of the entries were used more than once, many were actually duplicates, intended to be the same. Worse, other systems and organizations that used that data each had to build processes to interpret the intent. Instead, oDesk redesigned data entry to support a structured list, which greatly enhanced both the company’s internal processes to analyze data and the processes of organizations using the oDesk API.
The big payoff for oDesk was that it really improved recommendation quality. “Once everyone was speaking the same language vis-à-vis skills,” says Horton, “we could build a bunch of new features without having to do lots of machine learning to figure out what people were talking about. This kicked off a more general attempt in the company, named ‘Project Babel’ to find important unstructured data and then structure it.”
Our collective proficiency at adding structure to unstructured data at each step of analytics processes is impressive and improving. But this alchemy is hard and, like the telephone game, adds error, cost, and time. Disintermediate to avoid it.