Big Data: Service From the Cloud

A number of companies are starting to look to the cloud, not only for storing massive amounts of data, but for analysis and deriving insights, too.

Google's Ju-kay Kwek, in a keynote speech at THINKstrategies’ Cloud Analytics Summit last month, talked about how Google is using the cloud to analyze massive amounts of data for itself — and for customers.

That the era of big data is upon us is no longer a question. What to do with all that data appears to be the next big hurdle. Particularly in certain industries — life sciences, for example — that collectively generate so much data, researchers are unable to delve into any real insights, with any real speed.

A recent New York Times article, “DNA Sequencing Caught in the Deluge of Data,” discussed China-based BGI, the world’s largest genomics research company. BGI produces so much data — the equivalent of 2,000 human genomes a day — that it can’t transmit results to clients or collaborators over the Internet. To do so would take weeks. Instead, BGI puts the data on disks and sends it out via FedEx. Really.

The crux, according to The Times, is that “the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.”

(A single human genome sequence consumes about two terabytes of data, according to Simon Robinson, research director, storage, at 451 Research, an analyst group focused on the business of enterprise IT innovation. Once you do some processing and analysis of that genome, it can quickly turn into seven terabytes of data. For a single genome. “That’s why this is such a problem,” Robinson told me in a phone interview. “It’s a step change in the sheer volume of data that is created.”)

So what kinds of solutions might be possible? A number of companies are starting to look to the cloud, not only for storing massive amounts of data, but for analysis and deriving insights, too. Companies like Amazon, IBM and Rackspace offer flexible cloud storage models that, in the words of Rackspace, provide “dynamic scaling at a moment’s notice.” Other companies, including Google, are developing new methods for analyzing all that data — as a service.

In a recent Sandhill blog, “Big Data and Insight As a Service,” Evangelos Simoudis, senior managing director of Trident Capital, outlined two types of cloud-based, big data analytics services. One type operates on data that is primarily managed behind the firewall, such as transactional applications (ERP, HCM, CRM) that can be supplemented with syndicated or open source data. Another operates on the data generatedby the software.

At THINKstrategies’ Cloud Analytics Summit in April, Ju-kay Kwek, product manager for Google’s cloud platform, talked about what Google has done internally to manage and analyze its big data, and what it is doing to help other companies do the same.

To set the stage, Kwek provided a few metrics: He said that there are 60 hours of video uploaded to YouTube every minute (just two days ago YouTube updated that number to 72 hours of video a minute); Google’s index is 100 million gigabytes — 10,000 terabytes — which doesn’t included content; and there are 350 million active Gmail users.

“If I were a product manager on Gmail, how do I know what features are important to users, what features user are having problems with or what features are trending?” said Kwek in his keynote address. He added: “250 milliseconds is the average time a search takes to execute on Google. All that video, all those users, that huge index — how do you turn that into a quarter second search query that gives you the information you want? That is the kind of challenge that Google was forced to deal with to build the business we want.”

Kwek also discussed Google’s cloud services. Among these was Google BigQuery, a managed data analytics service in the cloud. “The really interesting thing about BigQuery is it is really, really great about analyzing big sets of data,” Kwek said. “You can literally run query analysis on big data sets — 5 to ten terabytes — and get a response back pretty quickly, in a matter of seconds, ten to twenty seconds. That’s pretty game changing.”

In limited release, BigQuery has been used by a number of businesses and developers for real-time big data analytics, according to a post at Kwek’s Google Developer’s blog. Claritics, for example, has built an application for game developers to gather real-time insights into gaming behavior. Another company, Crystalloids, built an application to help a resort network “analyze customer reservations, optimize marketing and maximize revenue.”

As for all that genome data? Google is tackling that, too. Its venture capital arm recently invested in DNAnexus, a company that specializes in providing scalable storage — and analysis — as a service to DNA sequencing centers and the researchers working with sequence data.