Big Idea: Competing With Data & Analytics

Business Quandary? Use a Competition to Crowdsource Best Answers

Top data scientists often share three characteristics: they are creative, they are curious and they are competitive. Anthony Goldbloom, CEO of Kaggle, a company that hosts data prediction competitions, has figured out how to tap all three of these characteristics to help companies crowdsource their analytics problems.

Anthony Goldbloom (Kaggle Inc.), interviewed by Renee Boucher Ferguson November 27, 2012 Reading Time: 11 min

Topics

Competing With Data & Analytics

How does data inform business processes, offerings, and engagement with customers? This research looks at trends in the use of analytics, the evolution of analytics strategy, optimal team composition, and new opportunities for data-driven innovation.

More in this series

Permissions and PDF Download

Twitter Facebook Linkedin

During a 2008 internship at the Economist, economist-turned-journalist Anthony Goldbloom made an important discovery: many organizations that want to use predictive analytics don’t have the skills to do it well, if at all. In response, he came up with an interesting business idea: create a company that allows anyone anywhere to compete to solve other business’s analytics problems.

Today, hundreds of organizations, both public and private, have submitted their datasets and business problems to Goldbloom’s company, Kaggle Inc. Sixty five thousand data scientists from around the globe have signed up with Kaggle to compete to develop the best algorithm for a given problem, from predicting click-through rates on ads to predicting who will be admitted to a hospital within the next year (a current competition with a $3 million purse). Ongoing results are displayed in real time on a virtual leaderboard.

The leading data scientists are a motley crew. They include a French born actuary, a computational neuroscientist from Harvard, an Oxford physicist and a Russian cybernetics professor.

In a conversation with MIT Sloan Management Review data & analytics contributing editor Renee Boucher Ferguson, Kaggle founder and CEO Goldbloom talks about the motivation behind Kaggle and the outcomes of its data science competitions.

So where did the idea come from to make data science into a sports competition?

I used to be an econometric modeler at the Australia Treasury, and after that at the Reserve Bank of Australia. Then in early 2008, I won an essay competition. The prize was a three-month internship at the Economist magazine, writing for the finance and economic section.

I pitched a piece on predictive analytics and it turned out to be a fabulous way to do market research. Pretty senior people were saying, “yes, predictive modeling, I know we need to be doing more of this.” And my frustration was that I was talking through some really interesting problems with reasonably senior people from companies, and always in the back of my mind was, “wow, I do a bit of programming. I can understand business problems and also like playing with data and statistics. I could do a really good job on these problems.” But I realized if in a different context I applied for a job to tackle some of these issues, I wouldn’t get one. They’d look at me and say, “you’re an economist, what do you know about predictive modeling?”

Anthony Goldbloom, CEO of Kaggle

That was the emotional rationale behind wanting to build a meritocratic labor market for data science and predictive models. Because I figured there would be more people like me who thought they could solve some of these problems, but they weren’t being given the chance.

There was also an intellectual, or commercial, reason why I thought it was a good idea. The quality of the data scientist is crucial in determining the quality of a data science solution. Data science is extremely high leverage, so a data set and an algorithm in the hands of a great data scientist can generate a $100 million return on investment, while the same data in the hands of somebody inexperienced can be extremely costly.

How do Kaggle’s big data competitions work?

We basically back-test people’s algorithms on what we call a holdout data set.

So let’s say you’re a bank and you want to predict who’s going to default on a loan and you’ve got 20 years’ worth of data on how loan applicants have behaved, who have paid back their loans and who hasn’t. What we do is we take that historical data, split it in two, give data scientists in our competition access to half of that data set with the answers, like “John Smith, age 62, with an income of X, defaulted.” For the other half of the customers, we give information on their characteristics — “Jack Smith, age 61, with income X” — but we withhold the information on whether or not he defaulted.

What participants do is train their algorithms on the first set, and then apply those algorithms to the second set, and we score them against the actual answers to see who’s getting the lowest error rate or the highest number of loan applicants correct. That’s what’s very neat about data science and predictive modeling algorithms. You can actually objectively score who is good and who isn’t.

Who are the data scientists that participate in your public competitions and how can you tell who does good work and who doesn’t?

We have 65,000 data scientists signed up to Kaggle, and just like with golf tournaments, we have them all ranked from 1 to 65,000. And the top data scientists come from a whole lot of backgrounds. Number one is a French-born actuary named Xavier Conort who runs his own consulting firm. Number two is an Oxford PhD in physics called Jason Tigg. He runs his own hedge fund. As an intellectual challenge, he competes in Kaggle competitions at night. Number three is a professor of cybernetics from Moscow State University called Alexander D’yakonov. Number four, Sergey Yurgenson, is a professor of computational neuroscience at Harvard Medical School.

The background of these top data scientists is all over the map, but they have a handful of characteristics in common. They are all creative and curious. In order to build a really good data science algorithm, it’s actually much less about the complex mathematics. You have to be really good at math, but the complexity of the mathematics doesn’t very well correlate with how good a data science algorithm is. It’s more about the kind of features you’re able to find in the data.

That’s really interesting. Can you give an example?

Sure. We ran a competition for a big chain of used car dealers. They sell used cars all over the country. They’ll go to auctions of secondhand cars, purchase cars and sell them throughout the dealer network. Because they offer a warranty on any car they sell — and they don’t want to sell lemons — they wanted to predict which cars were good buys and which cars were likely to be lemons.

It turned out that the auction catalogues have all these things, like the make of the car, the model of the car, the number of miles, how many owners. Do you want to hazard a guess as to what the most predictive feature was?

Hmmm, I don’t have a clue — the color of the car?

Funnily enough, you’re 100 percent right! It was the color of the car. It turns out that if you group car color into usual color and unusual color, the unusual color cars were much more likely to be reliable than usual color cars. The intuition there is that if you’re the first buyer of an orange car — orange was the most reliable car, incidentally — you probably were an enthusiast because that’s kind of an odd color car and you would have had to go out of your way to find an orange-color car. Therefore, you probably looked after it better, so by the time it goes up for sale in a secondhand auction, it’s in much better condition than a standard color car that somebody didn’t particularly care that much about.

In order to find that in the data you’ve got to use some mathematics to discover that correlation, but really it’s all about having clever creative ideas and then testing those ideas against the data and seeing whether they hold true or not. That’s one of the characteristics that make a great data scientist.

What are some other characteristics you’ve found that make a great data scientist?

You have to be able to use good coding practices or programming practices. If you set up a framework at the beginning of a predictive modeling problem that allows you to iterate very quickly and try lots of ideas, you’ll be able to try more ideas than somebody else that you’re competing against, and so you’ll be able to discover more in the data.

Also being tenacious. Typically in a competition, if you’re on top of the leader board and then you log in the next day and somebody’s ahead of you, you have to be tenacious and determined enough to keep going until you get ahead of them. And then they leapfrog you and you have to keep working till you leapfrog them.

When a company hosts a competition, are they looking for just those top five data scientists? Does anybody ever want to utilize those data scientists that rank lower?

We used to think that one of the reasons that we got really great results is because you get a lot of these offbeat ideas. We ran a problem for NASA, and somebody in glaciology made the first breakthrough. Now, we would not typically turn to a glaciologist to solve a problem for NASA. What we found, perhaps a little bit unromantically, is that the same people keep winning or performing well again and again and again, so there is this kind of core data science skill that allows somebody to do well, irrespective of the precise nature of the problem.

Kaggle’s business model is basically that once we discover who these top talents are, we put them to work on commercial problems. We do this in one of two ways. The first is through a private competition, so you might be a bank, you want to build an algorithm but you can’t release PII data, data that identifies individual people. So we can host a private competition where we’ll invite five to ten of our very best data scientists, and they compete behind the scenes on your problem, under NDA. That way, the sensitivity of your data is not revealed.

The other thing we do is for problems that are less well specified, we have a partnership with EMC Greenplum whereby any Greenplum customer can come into Kaggle and rent one of our top data scientists for a month, for instance, to build them a predictive modeling algorithm.

How often in the competition is IP [intellectual property] put into production?

It really depends. Typically, when we work with startups, for example, or early-stage companies, the algorithms get put into production very quickly. Often when we’re working in regulated industries, it takes a lot longer. For example with Allstate, the first model we built for them is a claims prediction model. Any new claims model has to go through regulators, so there’s a very long lead time between an algorithm being developed and it being put into production.

The way a lot of companies are set up at the moment is you’ll have one group that’s responsible for data and handling the data warehouse, and then you’ll have analysis group that’s responsible for building these algorithms, and then you’ll have a third group, an IT group, that’s responsible for putting those algorithm into production. It’s sometimes challenging for the IT group to put sophisticated algorithms into production, and so have a service where we will host winning algorithms on behalf of companies.

When should an organization think about hosting a competition? Does it need to have big datasets or big cash? And what sorts of problems are best suited for competition?

I would say there are two situations where a company should be looking at a competition as opposed to a single data scientist. The first one is when the problem is very valuable.

What we find with a competition is that, because we’re showing people in real time how they’re performing against others on a leaderboard, what ends up happening is you might make an entry that you’re really happy with and, if you were working inside a company or by yourself in isolation, you would have stopped there. But then somebody else gets ahead of you on the leaderboard and you keep working until you get ahead of him or her. And they keep working until they get ahead of you.

This leapfrogging continues until the point where you reach what we call the limit of what’s possible given the information and the dataset. Each dataset only contains so much information and, when you put people together in a competition and you give them feedback on a live leaderboard, they’ll drive each other to the point where they’ve literally extracted everything there is to get out of the dataset. Having a live leaderboard in front of people means you get all the value there is to get out of a dataset.

If you’re building an extremely valuable algorithm, you want to squeeze all the juice out of it, because there could be half a billion dollars worth of ROI in your dataset, you want to make sure you get every little bit of it.

The second scenario to host a competition is when you have a very challenging problem. You might be dealing with unstructured data, such as text. Or you might be dealing with high dimensional data or sparse data with lots of missing values. Or social graph data. In situations where the dataset is very difficult, we’ve noticed that the difference between what a company can do themselves and what they’ll get out of our data scientists is much bigger.

At the bottom end, competitions will give you a ten percent lift over what you’re already doing. At the top end, it’ll be in the order of three to four hundred percent. We’ve noticed that generally the more complex the data, the more challenging the data, the greater the lift.

Topics

Competing With Data & Analytics

About the Author

Renee Boucher Ferguson is a researcher and editor at MIT Sloan Management Review.

Tags:

Reprint #:

54222

Topics

Competing With Data & Analytics

Topics

Competing With Data & Analytics

About the Author

Tags:

Reprint #:

More Like This

Add a comment Cancel reply