How to Build Good AI Solutions When Data Is Scarce
Data-efficient AI techniques are emerging — and that means you don’t always need large volumes of labeled data to train AI systems based on neural networks.
Conventional wisdom holds that you need large volumes of labeled training data to unlock value from powerful AI models. For the consumer internet companies where many of today’s AI models originated, this hasn’t been difficult to obtain. But for companies in other sectors — such as industrial companies, manufacturers, health care organizations, and educational institutions — curating labeled data in sufficient volume can be significantly more challenging.
There’s good news on this front, however. Over the past few years, AI practitioners and researchers have developed several techniques to significantly reduce the volume of labeled data needed to build accurate AI models. Using these approaches, it’s often possible to build a good AI model with a fraction of the labeled data that might otherwise be needed.
Get Updates on Leading With AI and Data
Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.
Please enter a valid email address
Thank you for signing up
Assembling lots of labeled data is expensive and difficult. Imagine that you’re the CEO of a manufacturer of home-office furniture. Your customers post reviews of your products on e-commerce sites and social media, and some of these reviews provide valuable insights into potential product defects and improvements.
Since your business is growing rapidly, the volume of review content has grown to a level where it is impossible to manually read through each piece and glean its potential for product improvement. You decide that you need to build an AI model that can “read” each review and assess whether it contains a defect, an improvement idea, or neither. With such a model in place, you can route the relevant reviews to the right teams for follow-up.
The traditional AI approach to solving this problem consists of these steps: (1) Assemble a data set of reviews; (2) design a process for labeling each review with “improvement,” “defect,” or “neither”; (3) recruit a team of labelers and train them to label data accurately; (4) label thousands (if not tens of thousands) of reviews; and (5) with this review-and-label data set, build a series of AI models in a trial-and-error process until you arrive at one that can classify reviews with acceptable accuracy.
Steps 3 and 4 can be more difficult and expensive than it might appear on the surface. Unlike looking at an image and deciding whether it’s a dog or a cat, determining whether a review has a product improvement idea could be quite difficult.
References
1. M. Bernstein, “Labeling and Crowdsourcing,” Data-Centric AI, accessed June 13, 2022, https://datacentricai.org.
2. N. Diamant, E. Reinertsen, S. Song, et al., “Patient Contrastive Learning: A Performant, Expressive, and Practical Approach to Electrocardiogram Modeling,” PLOS Computational Biology 18, no. 2 (Feb. 14, 2022): 1-16.
3. S. Brown, “Why It’s Time for ‘Data-Centric Artificial Intelligence,’ ” MIT Sloan School of Management, June 7, 2022, https://mitsloan.mit.edu.
4. M.D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in “Computer Vision — ECCV 2014,” eds. D. Fleet, T. Pajdla, B. Schiele, et al. (Zurich: Springer, 2014), 824.
5. A. Kolesnikov, L. Beyer, X. Zhai, et al., “Big Transfer (BiT): General Visual Representation Learning,” arXiv, May 5, 2020, https://arxiv.org.
6. T. Chen, S. Kornblith, M. Norouzi, et al., “A Simple Framework for Contrastive Learning of Visual Representations,” arXiv, Feb. 13, 2020, http://arxiv.org.
7. Ibid.
8. F. Chollet, “Deep Learning With Python,” 2nd ed. (Shelter Island, New York: Manning Publications, 2021).
9. A. Ng, “MLOps: From Model-Centric to Data-Centric AI,” PDF file (Palo Alto, California: DeepLearning.AI, June 2021), www.deeplearning.ai.
10. Chollet, “Deep Learning With Python.”
11. DeepLearning.AI, “A Chat With Andrew: MLOps: From Model-Centric to Data-Centric AI,” YouTube video, 1:00:10, March 4, 2021, www.youtube.com.
12. D. Wang, A. Khosla, R. Gargeya, et al., “Deep Learning for Identifying Metastatic Breast Cancer,” arXiv, June 18, 2016, https://arxiv.org.
13. Chollet, “Deep Learning With Python.”