Competing With Data & Analytics
What to Read Next
With the bird’s eye view of an analyst, Simon Robinson has paid a lot of attention in the last 12 years to how companies are collecting and transmitting increasingly enormous amounts of information.
Since 2000, Robinson has been with 451 Research, an analyst group focused on enterprise IT innovation. Today he is research vice president, running the Storage and Information Management team. Based in 451 Research’s London office, Robinson and his team specialize in identifying emerging trends and technologies that are helping organizations optimize and take advantage of their data and information, and meet ever-evolving governance requirements. (He’s on Twitter at @simonrob451.)
“Storage is very complex,” Robinson says. And indeed, not only does it entail managing capacity and figuring out the best collection and retrieval methods, it also means synching with both the IT and the business teams and paying attention to complex security and privacy issues.
In a conversation with Renee Boucher Ferguson, a researcher and editor at MIT Sloan Management Review, Robinson discussed the changing storage landscape in the era of big data and cloud computing.
Describe the problems you see the data deluge creating in terms of storage.
We’re getting to this stage for many organizations — large and small — where finding places to put data cost-effectively, in a way that also meets the business requirements, is becoming an issue. It’s certainly a top five issue for most organizations on an IT perspective, and for many it’s in their top two or top three.
In the past, it was always sufficient just to buy more storage, buy more disc. But we’re at the point where two things are happening.
First, the capital cost of buying more capacity isn’t going down. It continues to grow, along with the operational aspects of managing that capacity and the processes. Storage is very complex, with lots of different skills required. So you’ve got that on the operational response side.
Second, there’s an opportunity to really put that data to work in driving some kind of value for the business. The value could be in terms of being more efficient and responsive, or creating new revenue streams, or better mining customer insight to tailor products and services more effectively and more quickly.
So there are two sides of it. There’s the operational side, but also this analytic side.
Even thinking about how to integrate the two sides of the challenge sounds like a new skill set.
EMC is using the term “the data scientist.” They’re putting a lot of investment into trying to popularize this notion that every organization is likely to have somebody who is an expert in managing and manipulating data. We are just at the start of this step change in what we can do with data and in the platform that you need to perform analytics at scale. That’s where this notion of the data scientist starts to come into play.
“Data scientist” is interesting. I saw the term “total data landscape” on a 451 Research blog. What does that term mean?
Total data is Matt’s phrase [Matthew Aslett, 451 Research’s research manager for data management and analytics]. He says, “Everybody calls it big data, but big data doesn’t really do it justice,” so he came up with total data. Just to confuse the industry even more.
Ok. So tell us, how extreme are the storage issues?
Take something like big data in the life sciences space, something like sequencing the human genome. For a single sequence of one individual human, that sequence operation consumes about two terabytes of data. But that’s just to start. A data scientist from a life sciences company told me that once you begin doing additional processing and analysis on that data, that two terabytes can quickly turn into seven terabytes. For a single genome. This is a step change in the sheer volumes of data that is being created.
The really significant issue for the life science guys is that not only are they creating these huge amounts of data, they also want to keep it for as long as possible. Because the more data you have, the more chance you can generate some insight through that historical aspect.
Video is a similar example. Obviously video takes up a lot of space, but once you move to high definition and 3D, then suddenly it’s non-linear. It’s just the order of magnitude, more data to handle. That’s why big data is bubbling up as such a big issue in some industries.
This must create massive security issues, too, just to back up the data properly.
Definitely. It’s been a catalyst for innovation within the storage area. Hard disc drives always fail, and once you get to petabyte scale storage systems, then the traditional mechanisms that you use to protect that data, like RAID [redundant array of independent disks], start to fall down. They start to become much less effective and efficient. That’s been an engine for some new innovations that we’re seeing in the storage space.
What are some of those innovations in storage?
There are things like object storage systems that manage data as an individual object rather than a file or a volume, which is how storage is typically viewed. There are things like erasure coding, which is an evolution of RAID. It’s a different sort of algorithm that allows you to store more data more cost-effectively than with traditional RAID.
Companies such as Cleversafe, Amplidata, Scality, and EMC with Atmos are developing these more efficient means of storing huge volumes of data. This isn’t something where you can just “buy an individual system of storage.” This is about racks and racks and racks of data. It’s got to be one, overall scale-out system. EMC, for instance, bought a company called Isilon that does scale-out in NAS [network attached storage].
What about the cloud? What’s happening there in terms of managing and storing big data? I’m aware of Amazon’s Elastic Compute Cloud, which has been out for a while. Are there a lot of projects like that underway?
It’s early days. We spend a lot of time speaking to end users, surveying them on what their focuses are and what their spending intentions are. When we talk to the IT infrastructure team about big data, most of them say, “It’s not really a focus for us.” When we speak to the business team, they say, “Oh, big data is huge.” So there is kind of disconnect.
Most big data projects are very much driven by a line of business. It’s outside of the bounds of operational IT, and it’s basically the archetypal Ph.D. guy in the corner who’s been given a task to go out and do some playing around with a data storage framework like Hadoop or something to see if it’s worth the company making any real commitment in.
I think the next stage that we’re going to see is that many of those initial experiments will come back with the conclusion, “Yes, this is something worth investing more in.” That will move the process into operational IT, into the core data center. And it’s there that the cloud will start to play a part.
If your understanding of cloud is basically that it’s a proxy for having an efficient and flexible infrastructure, then absolutely you need to have that as a starting point for these types of big data projects. If you want big data infrastructure purely as a service, that’s where some things like the Amazon cloud and other projects might come into play.
Just transmitting data to the cloud initially must be a challenge.
Yes. One of the challenges with the public cloud model is it works great once your data is there. But think about it: If your application is creating a two-terabyte file for each sequence operation, like with that life sciences example, you’ve got to move two terabytes up into the cloud — multiplied by hundreds or thousands of times.
You also want to be able to get your data off the cloud quickly. One of its big use cases is back-up, if your primary data center fails or burns down or is flooded. If that happens, you need to reconstitute your data somewhere else as quickly as possible.
That’s a lot of data to move backwards and forwards, and that’s one issue that hasn’t really yet been addressed. There are a variety of ways to tackle it, but it’s an issue that plenty of folks in the industry are working on.
Last November, the New York Times had an article, DNA Sequencing Caught in Deluge of Data,” that talked about a genomics company in China that generates so much data that it can’t be transmitted electronically. What they do is back it up on a disc and actually FedEx the discs to collaborators.
Yeah. There are work-arounds. Amazon has an import-export where, as you say, you transfer your data to a hard drive and FedEx it. The issue there is that by the time you’ve done that FedEx from your facility to a cloud facility, the data might have changed. There also potentially are application consistency issues.
In terms of moving lots of data, what are some solutions that the industry is working on?
It depends very much on the application. Maybe you just need a fatter pipe, or maybe it’s a case of needing better optimization at either end. I spoke recently with a guy at a cloud storage company called Nirvanix that has set up something called a cloud competency center. The specific goal of that center is to find some creative solutions to this data movement issue. So, there’s stuff happening for sure.
You mentioned video. Can say a little more about the issues in that space, and about which other industries that are looking at big data storage and big data transmission head on?
The ones that come up all the time are HPC, meaning High Performance Computing, which is big in media and entertainment. This is the kind of thing that companies creating the next “Avengers” movie in high def with world distribution are thinking about. General content distribution is an issue. Video-on-demand is part of many applications now, and most websites have some kind of embedded video feature.
In the oil and gas industry, the sensors used to drill for and look for oil and gas throw out vast, vast amounts of data. Financial is starting to bubble up — obviously they do an awful lot of computation for their trading platforms, but they are increasingly required to retain more of their electronic processes and documents for longer. Healthcare is also a big one.
Really, it’s pretty much every industry that is being digitized. Many are being required to retain more information for longer. And they all deal with customers, so there is customer data to be mined for whatever reason.
Where there is customer data involved, there is always the question of privacy or security, particularly with cloud storage. Is that an issue that’s easily overcome?
Well, it depends on the vertical and it depends on the region. In general, our position there is, “yes, it’s been an issue, but it’s less of an issue.” Sometimes companies are dealing with real technical restriction, and sometimes they’re dealing with an emotional response that people have: “Oh, I don’t want my data to be somewhere else where I don’t trust it to be, where I don’t have direct control over the infrastructure.”
You do hear the security and privacy card flagged up, especially over here in Europe. There are some specific issues within Europe itself where the cloud model generally doesn’t have the same economies of scale. In Germany, for example, certain data can’t leave the borders of the country. There are limitations with payment card industry data, as well, within Europe. It’s a factor, and it’s not going to completely go away. But it’s going to be less of an issue as the industry just gets more comfortable with the model.
Is it the European Union that is implementing restrictions or is it a country-by-country issue, or both?
The European Commission is actually trying to simplify and make it less of an issue, so the challenge of the minute is that you’ve got 28 states within the EU, so you’ve basically got 28 individual policies. The EU is trying to confer some kind of standardization across that. It’s debatable about how successful they will be, but they’re certainly looking to address it.
On the emotional side, I’d make the comparison to10 years ago, when we were all a bit nervous about using our credit cards online. And now we all do it all the time. There is a comfort issue that will be addressed just through maturity of the model. In storage, two years ago we were struggling to find examples of enterprises using Amazon web services for their IT. Now Amazon has dozens and dozens of reference customers and examples.