What to Read Next
Suppose you work on Google’s advertising team and need to decide whether ads should have a blue background or a yellow background. You think that yellow would attract the most clicks; your colleague thinks that blue is better. How do you make the decision?
In Google’s early days, the two of you might have debated the issue until someone caved or you both agreed to kick the decision up to the boss. But ultimately, it dawned on leaders throughout Google that many of these debates and decisions were unnecessary.
“We don’t want high-level executives discussing whether a blue background or a yellow background will lead to more ad clicks,” Hal Varian, Google’s chief economist, told us. “Why debate this point, since we can simply run an experiment to find out?”
Research Updates From MIT SMR
Get weekly updates on how global companies are managing in a changing world.
Please enter a valid email address
Thank you for signing up
Varian worked with the team that developed Google’s systematic approach to experimentation. The company now runs experiments at an extraordinary scale — more than 10,000 per year. The results of these experiments inform managerial decisions in a variety of contexts, ranging from advertising sales to search engine parameters.
More broadly, an experimental mindset has permeated much of the tech sector and is spreading beyond that. These days, most major tech companies, such as Amazon, Facebook, Uber, and Yelp, wouldn’t make an important change to its platforms without running experiments to understand how it might influence user behavior. Some traditional businesses, such as Campbell Soup Co., have been dipping their toes into experiments for decades.1 And many more are ramping up their efforts in experimentation as they undergo digital transformations. In a dramatic departure from its historic role as an esoteric tool for academic research, the randomized controlled experiment has gone mainstream. Startups, international conglomerates, and government agencies alike have a new tool to test ideas and understand the impact of the products and services they are providing.
Experiments complement intuition and guesswork with evidence-based insight. We’ve seen them lead to large returns for organizations — such as when an experiment alerted eBay that it was wasting millions of advertising dollars per year. (See “The Returns on Experimentation.”) The rise of experiments is generally a good thing: good for companies, which can use results to improve their people and operating practices and customer offerings; good for governments, which can use the results to better design and deliver essential services; and, when carefully designed and monitored, good for customers and citizens as well.
At the same time, it takes judgment to effectively incorporate experimental results into decisions and to determine when and how to experiment. For example, when behavioral economist Katy Milkman and psychologist Angela Duckworth, both at the University of Pennsylvania, designed experiments aimed at identifying ways to increase attendance rates for the gym chain 24 Hour Fitness, they set out to find strategies leading to long-term gym attendance by tracking behaviors such as number of gym visits per week and length of visits over months. Their approach highlights the value of central design choices in experiments: Had Milkman and Duckworth tracked gym visits only over a few weeks (it’s common for experiments to last only a few days), they would have reached very different conclusions — and 24 Hour Fitness might have implemented changes that were useful in the short run only to see effects peter out in a few months.
What’s more, experiments come with a certain amount of risk. They raise transparency concerns — especially given that the world’s stock of experiments and results is mostly sitting in corporate databases, hidden from the public and being used for unknown purposes. They can feel invasive and Big Brother-ish and even elicit outrage, such as when Facebook ran an experiment to see whether sad or negative posts affected the emotions of its users.2 Moreover, to the extent that companies rely on customer data for experiments, this concern also becomes entwined with and exacerbated by data privacy and security concerns.
Experiments also can create win-lose situations. StubHub, for instance, ran an experiment that led it to tack on ticket fees at the last minute, just before a purchase, rather than making them transparent during the shopping process. The research provided important insight into the decisions companies make around so-called fee-shrouding and the impact this practice has on customer outcomes. The results showed that the short-term value of shrouding fees was clear: More customers bought tickets. But the long-term value to the company was dubious, because fewer customers returned over the next year.3
We are in the early days of business experiments. Designing and interpreting them can be challenging tasks, requiring a combination of strong technical skills, analytical skills, and managerial judgment. As experiments become commonplace in companies, managers will increasingly need frameworks for running them effectively and for going from results to decisions in a manner that is productive and valuable for the business and customers alike. In this article, we will discuss four lessons that we have drawn from our interviews with managers in dozens of major companies that use experiments and from many other experts in experimentation, our own insights gained designing and teaching courses on experimentation and managerial decision-making at the graduate level, and our experiences developing and testing interventions based in behavioral economics for corporate and government clients.
Systematically embed experiments in your company’s decision-making processes.
It’s important for managers to be engaged in the design and interpretation of experiments, not only so the experiments can benefit from their insights and experience but also so managers develop a “testing” sensibility and become more inclined to check their assumptions and seek evidence in their decisions day-to-day. Moreover, companies should establish an infrastructure that allows all of that to happen across the organization so that experimentation is not siloed.
Consider digital travel platform Booking.com’s approach, which has caught the attention of managers and business scholars alike, in part because of the way it brings experiments closer to managerial decisions. At Booking, experiments are run in all parts of the business — both customer-facing platform experiments and partner-facing ones. Experiments are run by about 1,500 employees — many of whom have management or engineering backgrounds but had limited prior knowledge of statistics and experimentation. Any Booking product manager can easily run an experiment to test new features before rolling them out to customers. Indeed, approximately 80% of Booking’s product development teams are actively running experiments.4
Booking implemented a variety of managerial changes, such as having in-house data scientists (including director of experimentation Lukas Vermeer, who gave us an overview of the current process) run training sessions and providing bespoke analytical support for individual experiments. These changes have made it easier for teams to design and launch effective experiments, bolstering the company’s decision-making prowess.
Once an experiment is run, the results are logged into a centralized repository that allows people throughout the company to peruse them along with results from prior experiments. Everyone can see what features were tested, regardless of whether they were ultimately implemented. A standard set of metrics is reported by default, but teams can also customize the metrics for the particular experiments they are running.
Booking’s approach allows companies to infuse team decisions with experimental evidence, and teams retain control over how they move from experimental results to decisions. Teams also have some flexibility regarding which outcomes they analyze and how much weight to accord each. An integrated system also distributes a basic understanding of experimental methods and concepts — including hypothesis testing, practical significance (that is, the magnitude of effects), statistical significance, and the strengths and limitations of different metrics (and of experiments more broadly) — across the workforce.
Push past whether things work in your experiments to explore why they work.
For instance, to boost sales, Alibaba was allowing sellers to offer deep discounts to customers who had left certain “promoted products” in their shopping carts for over 24 hours. But did the coupons actually change customer behavior? That is the question that company leaders set out to answer.
In early 2016, a team of management researchers partnered with Alibaba to conduct a randomized field experiment on two of the company’s platforms, aimed at measuring the precise impact of the shopping cart promotion program.5 The researchers randomly selected 1 million Alibaba customers who had been targeted by at least one promotion on the platform between March 12, 2016, and April 11, 2016, and assigned them to either the control condition or the treatment condition. Customers in the treatment condition were shown coupons for certain promoted products that they had left in their shopping carts for more than a day. The control group, meanwhile, received no coupons.
Alibaba and the research team focused on two sets of outcomes: (1) whether shoppers were more likely to purchase the items in their carts when they received a coupon and (2) whether those shoppers went on to buy more stuff overall from Alibaba than shoppers who did not receive the coupons.
The results of the experiment suggested that the shopping cart program was a bit of a wash for sellers: good for some, bad for others. But they also revealed that the program didn’t seem to increase shopping on Alibaba overall. Based on this, the company decided not to expand its shopping cart coupon program.
But note the question being addressed here. Alibaba’s experiment essentially asked, “What is the impact of the program as it exists?” as opposed to, “How should this program be designed?” This is an important distinction, because the coupons may have generated only minor effects simply because they were poorly designed, not because such discounts are inherently flawed. Although this experiment was valuable in helping to figure out that the existing program was not very effective, we’re left to wonder whether an alternative version of the program might have worked better.
Specifically, there are several elements of the shopping cart program that Alibaba might have considered testing. First, the promotion suffered from relatively low exposure — one-third of users did not revisit their shopping carts before the coupon expired, so they never knew the discount was being offered. Second, the program allowed sellers to choose the size of the coupon discount. This prohibited the experimenters from under-standing the causal effect of coupon value on purchasing behavior. Third, the impact of the coupon program on user satisfaction and long-term engagement with Alibaba remains unclear and could have been explored.
Alibaba’s experiment highlights both the promise and the challenges of using experiments to guide managerial decisions. The company replaced intuition with data, and the researchers ran a useful experiment. But, like any single experiment, this is only one piece of the puzzle. Ultimately, running successful experiments in organizations is as much about asking the right questions — and asking enough questions — as it is about answering them. Managers should focus not only on testing an offering as is, but on developing frameworks that will help them make decisions about other potential offerings. This means teasing out boundary conditions and mechanisms — in this case, not just whether one particular form of pricing “works,” but also why and how it works, and how it might work better.
Managerial judgment is required to determine which questions experiments can answer. During a conversation with Nobel laureate Al Roth, who did pioneering work in the field of experimental economics, he told us that in his opening chapter of The Handbook of Experimental Economics, he purposely referred to “series of experiments” instead of individual experiments.6 Roth’s rationale was simple: He didn’t want his readers to think that a single experiment could answer all their questions. Likewise, managers need to think about how a broader evidence base can help to inform their decisions. Moreover, they need to consider experiments with an understanding of their limitations and in the context of other forms of data-based evidence.
Test for broad-based and long-term outcomes.
Too often, managers focus on overly narrow or short-term outcomes in their experiments, because those are top of mind and easily measured. But they should also take a long view, measuring effects over several months or more. That means they must be explicit about the long-term outcomes that matter most to the business — and then incorporate that understanding into the design and interpretation of experiments.
When Milkman and Duckworth, directors of the University of Pennsylvania’s Behavior Change for Good (BCFG) initiative, partnered with 24 Hour Fitness, their goal was to discover interventions that would lead people to exercise more. This appealed to 24 Hour Fitness, which has 3.7 million members across more than 420 locations in 13 states. After all, people exercising more was likely to mean more revenue and profit.
In May 2017, Milkman and Duckworth convened the first meeting of BCFG team scientists. They introduced the scientists to StepUp, a web-based platform created to host their exercise interventions. Using StepUp as an experimentation platform, they could test and track specific interventions, adding content and questions to the registration process, and customizing emails and texts that participants would receive.
The preliminary results revealed that many of the interventions, such as signing pledges and making exercise plans, had some immediate benefits, but for some of the interventions, the gains were less impressive when looked at over multiple-month spans. In the words of Duckworth, “Behavior changes are really &#$@ing hard.”7
This is an important lesson and a chronic problem for experimentalists. Had BCFG only tracked short-run effects, the results would have looked very different. And the leaders of businesses such as 24 Hour Fitness need to think about both short- and long-term results in their managerial decisions. Almost always, it is easier (and, by definition, faster) to capture short-term outcomes, but they are not always enduring. Worse, they can undercut more valuable long-term outcomes.
In 2018, Uber was also thinking about measurement challenges when it had to decide whether to launch Uber Express Pool, a carpooling service. With Express Pool, riders would wait a little longer and walk a little farther to get to their destinations. While waiting, a pool of riders (to be picked up at designated “express” meeting spots) would be matched by an algorithm. The result would be a carpool experience that might take longer but should save riders money.
Duncan Gilchrist, a Ph.D. economist and manager on Uber’s rider-pricing and marketplace-experimentation data science team, was tasked with understanding the value of Express Pool. Gilchrist is quick to point out the challenges involved in this. Say, for example, that Uber were just to roll out the new product all at once, without first testing it. Even if the product became popular, it would be hard to tell whether it was cannibalizing other Uber products (Uber X, Uber Pool, and so on). Now suppose that Uber were to offer half of Bostonian riders Express Pool. The other half of Bostonians would be a poor control group for this experiment, since their experiences (including wait times) are affected by the demand created by other riders (including the treatment group). This type of spillover effect is common in experimentation.
To avoid spillover effects and understand how Uber Express Pool would affect its entire portfolio, Uber selected a set of large markets and rolled out the product in six cities that were randomly selected from this set. Leveraging the experimental tool kit — including a method that allowed Uber to use a weighted combination of other cities to form a more suitable “synthetic” control group — Gilchrist and his team were able to tease out the ways in which the rollout was influencing Uber usage.
Unsurprisingly, Express Pool created new kinds of trip matches. But the experiment also accounted for the effect of Express Pool on existing Uber products and made clear that launching it would make good business sense. As a result, Uber was able to confidently introduce Express Pool to many of its major markets. This confidence, and the finding that inspired it, would not have been possible without broad-based experimentation.
Aim for transparency.
Companies run tens of thousands of experiments per year — linking their results to detailed accounts of user behavior and using them to shape the way their customers engage with their products and services. Yet, little is publicly known about these experiments, in part because of concerns that customers will feel manipulated and become angry.
Consider the large-scale experiment that Facebook ran several years ago aimed at testing whether and how the emotional tone of posts affected the mood of its users. To get at this, the company manipulated the feeds of a randomly selected set of users, who were shown fewer positive posts than others. The researchers published the results in a paper that concluded, “Emotions expressed by others on Facebook influence our own emotions, constituting experimental evidence for massive-scale contagion via social networks.”8
The mood-altering effects revealed in the experiment were small, but the results were eye-opening nonetheless. Facebook faced a considerable backlash from its users, who may not have been aware that they had given permission to be experimented on when they signed up on the platform. The experiment also garnered significant and mostly negative media attention.
Looking back at this experiment, much of the criticism focused on Facebook making product design changes with the understanding that they might affect emotions. But people shouldn’t have been shocked by this — of course Facebook’s choices affect emotions. The more upsetting part, to many, was the fact that Facebook was quietly running experiments to understand how its users react to changes in their feeds. It felt like spying.
Perhaps unsurprisingly, Facebook didn’t react to the backlash by trying to make the platform a happier or more neutral place, which suggests that having a positive emotional impact wasn’t the company’s main goal. However, the company did react to the concerns about the experiment in other ways. In the immediate aftermath, it shut down some of its outward-facing research and became less forthcoming about the experiments it was running internally. To its credit, Facebook did set up an internal research ethics board of five employees who review a select set of its experiments (although its workings are not entirely transparent either).9
Facebook and other companies would benefit from increasing their transparency around experiments. Instead of hiding their experimentation (or burying explanations in user agreements that very few of us read), companies should establish a transparent process to clarify their engagement with the experimental method. Such efforts should include the following goals:
- Acknowledge, in broad terms, that the company frequently creates experiments to gauge people’s responses (and, yes, takes those responses to the bank).
- Explain the value of experiments to the company and its customers and other stakeholders.
- To the extent possible, encourage discussion of the company’s experimental aims and methods.
- When appropriate, be transparent about the experiments’ findings.
Experiments are valuable not only to businesses but also to their customers, who presumably don’t want low-quality services as a result of gut decisions that could easily have been improved through data. This means that customers should be open to experimentation as long as their privacy and other interests are respected, and companies should stop shrouding the process in secrecy.
Customers are smart: They know that companies tinker with products and services and assess customers’ reactions. But companies should be held to high standards for transparency and ethics in experimentation. In turn, if they are candid about their research, they will be able to defuse suspicion, enlist support, and perhaps even deepen customer engagement.
We are at the beginning of the experimental revolution. Many companies are running experiments, some at impressive scales. But they are still learning how to do them effectively. Being systematic about experiments and decision-making, deliberate about test design and the outcomes that are being sought, thoughtful about the long-term implications of outcomes, and transparent with stakeholders will create value for companies and customers alike.
1. J.O. Eastlack, Jr. and A.G. Rao, “Advertising Experiments at the Campbell Soup Company,” Marketing Science 8, no. 1 (February 1989): 57-71.
2. R. Meyer, “Everything We Know About Facebook’s Secret Mood Manipulation Experiment,” The Atlantic, June 28, 2014, www.theatlantic.com.
3. T. Blake, S. Moshary, K. Sweeney, et al., “Price Salience and Product Choice,” working paper 25186, National Bureau of Economic Research, Cambridge, Massachusetts, October 2018.
4. R. Kaufman, J. Pitchforth, and L. Vermeer, “Democratizing Online Controlled Experiments at Booking.com” (presentation at the MIT Conference on Digital Experimentation, Cambridge, Massachusetts, Oct. 27-28, 2017).
5. D.J. Zhang, H. Dai, L. Dong, et al., “How Do Price Promotions Affect Customer Behavior on Retailing Platforms? Evidence From a Large Randomized Experiment on Alibaba,” Management Science 27, no. 12 (December 2018): 2343-2345.
6. “The Handbook of Experimental Economics,” eds. J.H. Kagel and A.E. Roth (Princeton, New Jersey: Princeton University Press, 1995).
7. More information on Behavior Change for Good’s research and experiments is available at https://bcfg.wharton.upenn.edu.
8. A.D.I. Kramer, J.E. Guillory, and J.T. Hancock, “Experimental Evidence of Massive-Scale Emotional Contagion Through Social Networks,” Proceedings of the National Academy of Sciences 111, no. 24 (June 17, 2014): 8788-8790.
9. Z. Boka, “Facebook’s Research Ethics Board Needs to Stay Far Away From Facebook,” Wired, June 23, 2016, www.wired.com.
i. M. Ostrovsky and M. Schwarz, “Reserve Prices in Internet Advertising Auctions: A Field Experiment,” working paper 2054, Stanford University Graduate School of Business, Stanford, California, November 2016.
ii. T. Blake, C. Nosko, and S. Tadelis, “Consumer Heterogeneity and Paid Search Effectiveness: A Large-Scale Field Experiment,” Econometrica 83, no. 1 (January 2015): 155-174.
iii. N. Bloom, B. Eifert, A. Mahajan, et al., “Does Management Matter? Evidence From India,” working paper 16658, National Bureau of Economic Research, Cambridge, Massachusetts, January 2011.
iv. E.P. Bettinger, B.T. Long, P. Oreopoulos, et al., “The Role of Application Assistance and Information in College Decisions: Results From the H&R Block Fafsa Experiment,” Quarterly Journal of Economics 127, no. 3 (August 2012): 1205-1242.