Gone Fishing — For Data
GE’s Vince Campisi describes the company’s expedition to the data lake.
Topics
Competing With Data & Analytics
You’d have to be off the grid to have missed General Electric’s plugging of the industrial Internet — the use of sensors to collect data about things like turbines and jet engines and factory floors. For GE, the industrial Internet means being able to sell services to its customers based on detailed analysis of data streaming from its equipment and the ability to predict failures and other key events.
Doing this required the building of a “data lake” — a storage system to hold enormous amounts of raw data in its native format for future use — as well as the meshing of longstanding industrial culture with a newfangled approach to data and the hiring or development of a variety of analytics talent.
At the helm of much of the technology development is Vince Campisi, a 16-year veteran of GE who became CIO of GE Software at the launch of the GE Center of Software Excellence. Campisi spoke with MIT Sloan Management Review contributing editor Michael Fitzgerald.
What’s been your favorite example of what the data lake can really do for you as a company?
Last November [2013], we set a goal that said, 90 days from now, we want to demonstrate the power of a data lake and stand it up in order to demonstrate what we’re trying to prove with industrial Internet and industrial big data. We set out to connect with 25 airlines, collect and manage machine data from 3.4 million flights, and ingest all that information [toward] helping improve time on wing, which means revenue generated per engine.
We got it done in 70 days. We created the data lake, ingested and connected the full flight data coming off an engine, blending it with shop visits and parts information, and getting it in the hands of our data science community to look at things that were reducing time on wing for customers. For example, based on some of the analytics we solved for, things like washing an engine more consistently and frequently improves its reliability and efficiency.
So we created the technology stack, made it accessible and put it in the hands of people who could take that data and put it into action in a quarter of the time.
How many bodies did it take?
Our mantra was “a two-pizza team,” meaning, a team no bigger than the number of people you could feed off of two pizzas. That was the level of resourcing that went into building the platform.
You came from another GE division, GE Water, where you had done some work to improve how it used data. How different is the task of what you’re doing now from what you’ve done before?
It’s quite different. We’re in a different era. Historically, you think about solving things through traditional data warehousing concepts and business intelligence, but I think with this whole push of the “Internet of things,” which we would call the Industrial Internet, there’s this whole thought process around the information coming off of machines. Being able to scale to support the amount of information that we were seeing coming from the operational technology space — the technology that relates to a power plant, or a rail provider — that sort of backdrop has driven us to think about it in a way we haven’t had to before.
How is this a different operational challenge?
With a data lake, you collect the data and keep it in its raw format. You don’t model it up front like you would in a data warehouse. In fact, it’s predicated on the presumption we don’t understand the relationships that matter and don’t understand fully what we’re going to find when we bring all of these data sets together. The data lake approach is collecting data in its raw format, pumping it into one place in order to break down data silos, and modeling the data based on the outcome you’re trying to solve for.
With a data warehouse, generally you spend a lot of time up front trying to understand the relationships that matter, figuring out how you’re going to structure the information before you put it in your data warehouse — that way, you can get to it quickly and easily. I’d say the shift in technology and the shift in the outcomes we’re trying to drive have enabled us to go at this in an entirely different way, and one we’re pretty excited about based on some of the early results we’ve had so far with our aviation business as well as a couple of others.
This thought process of, “I’m going to pump information in and then I’m going to let people model it based on their outcomes, on the fly,” requires people be savvy, able to interact with data in its raw format and in a very unstructured fashion. You’ll hear a lot of talk about data science, and we’re excited about that. It’s a place where we have experience and skills.
We have a domain history of having strong machine-physics-based domain knowledge. We have lots of people who have built really powerful models over time to understand and predict how assets will perform. This is now converging computer science and statistics-based models, using the data to understand how things will perform and what can be expected.
So there’s an interesting rhetorical difference between data warehouse and data lake. We know we aren’t just supposed to walk into a warehouse. But a lake is open for us to swim in, however. How open is the data lake to groups at GE?
Today, we see three communities and usage patterns. The first is your data science community. They’ll have a very specific item or outcome they are trying to solve, or a question they are trying to answer. Why is part X failing in this model asset for this particular part of the world? That’s where a data science group would leverage a data lake to look for the answer to this problem and be able to stitch data together in order to get to it. When an insight is found and an outcome is identified, how do we then codify that, and add a new alert, to detect when conditions are meeting the ones that drove us to find that relationship?
Your second usage pattern would be your software engineering community, who will program something based on the data coming from the lake. They’ll take a model that was developed by data science and operationalize it into an analytic app.
The third, which we’re finding more and more to be the case, is your traditional business intelligence community, which is using it as a way to connect to the information in the data lake in order to unlock and answer questions that are more traditional in nature.
If you look at aviation, we started by trying to collect all the engine performance data coming off an aircraft engine in order to improve asset performance and other attributes. Within weeks of bringing that data into the lake, the first question that we were asked was, “Hey, could you bring in our ERP information so we can connect engine performance with shop visits and parts utilization?” And once you bring in that sort of data set, the sourcing group and others come in and say, “Hey, now that they’ve got all that parts information in there, can I start to model and do direct material deflation?”
What we’re starting to work towards is creating a semantic modeling layer on top of it, where eventually this turns into a search bar where I can ask a question in natural language. That’s when this becomes more accessible to a broader community of folks who are not as technically savvy and don’t have to use all these sophisticated tools in order to get value and answer questions coming from the data lake.
These will be people bringing in their Excel spreadsheets and laying them down on the sand?
(Laughs.) We’ll see. Excel serves a purpose. Over time, the belief is this data lake will largely disrupt many of the ways we manage data and do reporting in our companies. But the near term focus has been this place where you have this OT [operational technology] and IT [information technology] information blending together to unlock new outcomes that we haven’t been able to see before.
You’ve mentioned scale multiple times. What is the scale we’re talking about?
In the aviation example, I’d say 70% of all commercial aircraft have GE technology on it, [specifically] our engines. I think it’s two of every three flights take off with a GE technology on it. That’s about 25,000 flights a day. Each engine can produce, let’s say, half a terabyte of information per flight. So once you start adding that all up — just the engine performance data coming off aircraft engines alone — the data sets dwarf what we’ve had to handle before. Multiply that by all the different asset types that are out there, in addition to all of the other information we’ve been asked to manage today.
Where do you find people to “dig” the data lake?
In between the data science group and your data management IT group, there’s data engineering. That is the bridge between raw technical computing and storing and IT infrastructure, and your data scientists, who are spending time in R and other languages to build models. Data engineering is a discipline that sits in between the two, makes data more accessible and provides the tools a data scientist would want to have. It allows the data scientist to focus more on developing the model, developing the insight, not on how to stitch the information or stitch the toolset to make it productive.
How do you find people? Are they in IT already? Are they business analysts?
Technologies like Hadoop, Hive, Scoop, and so on are newer, so you don’t find as many engineers who have experience with them. You have a pool of people out here on the west coast, at some of the consumer Internet giants or cloud-native companies. Then you have your math-based and stats-based folks, who can be really effective and productive if they already have a software engineering orientation.
The third space that we’re starting to focus on is how effective we can be at converting our traditional data analysts and IT professionals into this domain — people who can learn this new paradigm.
These data engineers — I think the title I’ve heard is “data custodians” — how do you establish that role, and where does it fit into the maintenance of the data lake?
You have folks who architect and construct the architectures and the technologies for the data lake itself. That would be more your data management profiles. The data science community includes folks who leverage these data sets to do the work they’re doing. In between that, you have a combination of skills — folks who need to be very proficient in the tooling and the technology for how data management is evolving, but not necessarily building; they don’t have the domain context to build the models themselves.
That’s where this group, this data custodian/data engineer profile, really takes shape. It’s a hybrid between software engineer types of personalities. Not building the models directly, not having the domain context to build the models, they can be more of a horizontal capability to allow you to do data science at scale. But then you’re going to want to have domain-specific experts who are actually going to do your blending of physics and statistics at the data science level.
Did you find that you need data custodians because otherwise your data scientists are spending a lot of time getting the data into shape to be useful for the models?
Yes. You go out and hunt for these coveted data scientists and bring them in, only to frustrate them. They spend 80% of time trying to organize the information. One of our first use cases, before using our current approach with the data lake plus data engineering we went through 10 months of organizing data and figuring out where it existed and breaking down silos, in order for someone to actually go after the outcome. It’s not effective. Let them focus on what they’re passionate about. There’s a different breed of folks who are passionate about how you scale this stuff and make it accessible horizontally. It’s kind of a hand-and-glove relationship between the two.
Which is the hardest of those jobs to fill?
I wouldn’t claim one is harder than the other. The data management space is super hot. Anybody who can spell “Hadoop” is heavily recruited. It’s hard to find people who’ve really done it at the scale we’re talking about and looking to do it, so even in the data management space, it’s hard to find talent at the levels we’re constantly searching for.
Data engineering is a newer term, so finding folks who’ve had that title before may not be as likely, but I think we’ve had a fair amount of success finding people who are able to do it. They sit in the middle. They don’t have to be super conversant in how the latest and greatest technology is working at the deepest level. They just need to be proficient enough to understand how it works and how to use it. For data science, we’re fortunate in GE that for decades we have had people who’ve been building advanced models with lots of domain expertise. This is now giving them the contemporary tools and technology to allow them to revitalize those models and make them be more consumable across a wider variety of use cases.
Do you do special things to foster communications between data scientists and IT and the business side?
The industrial Internet is an exciting space. We tend to have people coming together and coalescing around the topic. This is such a new space and a new way of looking at things and bringing new data sets together, that it’s drawing people to come together. The silos of the past aren’t holding us up. Most of the time, we have an outcome in mind, so there’s something we’re trying to answer the question to, and it’s either like a problem that we’re trying to address or an opportunity we’re trying to explore. That’s usually a great forcing function to shape what the goal is and what we need to do to in order to achieve it.
So what is the thing that frustrates you the most about this operationalizing process?
The space is so new, technically, and changing so quickly, we’re having to solve for things in a new way. When you think about how you handle data governance and data sovereignty and how policies are evolving around how data can be shared across borders, those are the things that we’re having to spend a tremendous amount of time thinking about and trying to solve for. Some of these technologies are so new that we’re having to solve them for the first time. Our ability to scale and go fast will be dictated by our ability to solve those things consistently.
If I’m not GE, what lesson can I draw from the way you’ve gong about organizing this? Are other parts of GE doing this differently than the software center?
We’re doing this in a “One GE” approach. Predix [GE’s industrial Internet platform] is how we’re staying connected and helping customers get better outcomes. Predix is how we do that at scale across our divisions. We’ve seen consumer Internet giants, retail and to some degree financial services become the earliest adopters of these ways of thinking. We’ve been trying to provoke and encourage industrial companies to move at the rate of the rest of the market. This paradigm shift toward the industrial Internet is taking shape.
Comments are closed.