Know Your Data to Harness Federated Machine Learning

A collaborative approach to training AI models can yield better results, but it requires finding partners with data that complements your own.

Reading Time: 11 min 

Topics

Permissions and PDF Download

Sam Falconer/theispot.com

Nowadays, deploying artificial intelligence no longer guarantees a competitive edge. What truly sets companies apart is access to diverse, extensive, high-quality data that enhances their AI system’s performance compared with that of their competitors. But concerns over data privacy can limit the use of unique, relevant data for analysis.

This problem can be alleviated by means of privacy-preserving federated learning. This technique, in combination with a special type of encryption, enables an AI model or any other type of algorithm to be trained using data from multiple, decentralized servers controlled by different organizations — all while respecting the privacy of the individuals or organizations whose data is being used for the training.1 Simply put, federated learning entails sending the algorithm to the data rather than sending the data to the algorithm.

This is how Switzerland-based Zurich Insurance Group was able to improve a predictive algorithm with data from Orange, a British telecommunications company. Using a commercial federated learning platform, Zurich’s algorithm could be trained, and its predictive capabilities improved, without the need for Orange to release any data. The collaboration led to a 30% improvement in the AI system’s predictions, which translated into a significant revenue increase for Zurich. For Orange, it represented a new way of monetizing its data while still preserving its privacy.

Federated Learning Across and Within Industries

Real applications of federated learning are now rapidly emerging as organizations search for more data on which to train the AI systems they hope will deliver competitive advantage. For example, a large bank’s credit unit used the approach to fine-tune its algorithm for predicting loan defaults, using data owned by one of the largest global telecommunications companies, and improved prediction accuracy by about 10%.

The value of such collaborations stems from the ability to train AI systems on much richer data sets than any one organization could assemble on its own. To do this, organizations need to identify partners whose data could be used in a federated learning approach to improve their AI systems’ performance.

While it might appear to be more logical for organizations from different industries to collaborate with one another than companies in the same industry, federated learning can facilitate cooperation within industries, including between direct competitors.

Certain pathology departments within competing private hospitals that have struggled to compile robust data sets on their own are doing this. The departments train a shared diagnostic algorithm on a combination of their respective data sets — for example, images of physical tissue that are to be analyzed for disease. As a result, all participating hospitals benefit from an algorithm with much higher performance in providing valuable, potentially lifesaving insights than would be possible if they relied on smaller, in-house data sets.

The privacy-preserving collaboration that federated learning enables can result in a new type of data-driven business model for hospitals. Ownership shares of the collaboratively developed, collectively owned algorithm are determined based on the contribution of each hospital’s data to its accuracy. That algorithm can also be commercialized and rented out to third parties, such as other hospitals, and the resulting revenue can be shared based on hospitals’ respective ownership shares.

The privacy-preserving collaboration that federated learning enables can result in a new type of data-driven business model.

The financial sector takes another approach that enables competing organizations to leverage the benefits of federated learning. Using it with an encryption system called multiparty homomorphic encryption, each participating bank can check whether a client has been flagged by a competing bank as an unreliable or otherwise problematic customer — without informing one another about the clients they do or do not have. This reduces the cost of the “know your customer” process — which currently accounts for approximately 3% of banks’ operational costs globally — and results in a better experience for the customer.

Several hospitals are using the same kind of encrypted computing platform to securely train a machine learning model for diagnostics in dermatology across multiple data sets that reside in different countries. Other hospitals are using it to run simpler distributed analytics to produce descriptive statistics, allowing them to extract personalized reference intervals for healthy and sick patients based on the latest available data. This enables doctors to use personalized thresholds to diagnose their patients, without having to rely on a one-size-fits-all reference for the diagnosis.

The Need for Data Complementarity and Volume

When assessing which external organizations to partner with, business leaders need to consider the nature of the data that their own organization would be able to bring to the collaboration. The data used to train a given model must contain a large number of samples — clients, patients, insurance policies, or whatever it is that the system aims to understand — and a large number of features (variables) within each sample.

The number of samples and the number of features per sample are the two dimensions along which organizations can benefit from using federated learning techniques. The first, horizontal federated learning, entails increasing the number of samples with which the model is trained. The second, vertical federated learning, involves increasing the number of features the model analyzes for each of the samples. These two dimensions give rise to four situations in which business leaders might find themselves when exploring the federated learning landscape. (See “The Four Types of Data for Federated Learning.”)

A data set with a large number of samples but few features of each is referred to as vertical data and requires vertical learning if it is to become rich data. Similarly, horizontal data has a small number of samples but many features of each and requires horizontal learning in order to become rich data. In poor data, the number of samples is low, as is the number of features for each of those samples.

Business leaders who are unaware of the distinguishing properties of these four types of data risk establishing federated learning collaborations that can take them from poor data to either horizontal or vertical data but stop short of becoming rich data. Starting with poor data and finishing with vertical or horizontal data may seem appealing, but it may not result in improved performance. For example, a hospital could transform its poor data into vertical data by incorporating data on many more patients without adding new relevant features of those patients. In this situation, it may not be able to improve the predictive power of its diagnostics AI tool, simply because it will not be able to isolate features that positively correlate with disease.

Similarly, moving from poor data to horizontal data might not create an advantage either, given that AI needs enough variation in the relevant features to find meaningful patterns that produce valuable insights.

Thus, when considering whether to collaborate with an external organization on a federated learning project, business leaders need to ask themselves, “What is the status of my data — poor, vertical, horizontal, or rich?” This question can be answered by assessing the vertical and horizontal dimensions of the data. (See “Creating Rich Data From Horizontal or Vertical Data.”)

The vertical dimension is assessed by considering whether the number of observations in the sample is representative of the whole population. A question like “Are all the segments of the population or reality represented in our sample?” should be answered with a “yes” for the data to be considered vertical. The horizontal dimension is assessed by considering whether the variables available about each of the observations in the sample are enough to explain the behavior of a single observation. A question like “Do we understand how the individuals in our sample will behave if we consider all the variables that we have recorded about them?” should be answered with a “yes” for the data to be considered horizontal. Answering both questions with a “yes” indicates rich data. Answering both questions with a “no” is a sign of poor data.

If the data is vertical, looking for partner organizations in other industries may be the right way forward, given that organizations that offer clients different services record different data on those clients. This is illustrated by the Orange-Zurich case, where the telecom company has data on clients’ communications and mobility while the insurance company has data on their risk aversion, consumption, and meaningful life events.

If the data is horizontal, looking for organizations in the same industry may be the right way forward because, in such a case, more samples are needed to enrich the data set.

If the data is poor, the organization needs to start a process of recording data. Studying the data that other organizations have already recorded and recording data that may be complementary to one or more of those organizations — by enriching either the organizations’ vertical or horizontal data sets — is the right strategy.

Federated learning allows an organization to take advantage of its data across barriers while remaining compliant with data-sharing restrictions.

If the data is rich, the organization is already in a position to harness the power of AI through its own data set. But federated learning still has something to offer — namely, the opportunity to monetize that rich data by contributing to the training of AI systems for other, external organizations while maintaining full ownership of its data set. This is what the Mayo Clinic is doing through Solutions Studio, a program that offers access to global, federated de-identified data.2

It’s worth noting that one interesting, often unexplored application of federated learning is within an organization. Data-sharing barriers in place for compliance or ethics reasons, and different legal entities or legal jurisdictions, may restrict some groups’ access to data that their company owns. Federated learning allows an organization to take advantage of its data across such barriers while remaining compliant with any data-sharing restrictions.

Overcoming Obstacles to Federated Learning Success

Once business leaders have identified organizations that own complementary data, there are additional obstacles to overcome, including converting unstructured data into a usable format. Unstructured data from sources like faxes, PDF files, or handwritten and scanned reports needs to be organized and machine-readable. Business leaders should anticipate and estimate the costs of such structuring in collaboration with the technical teams that will train the AI systems later.

Disparate labels can be another problem. Different organizations often use different terms to refer to the same feature, so the data catalogs of collaborating organizations will likely need to be synchronized, which can extend the time it takes to train the AI system. Business leaders need to factor this in when they project when initiatives will begin to deliver results.

The most critical challenge, however, is people and their fears. Resistance to collaboration with external organizations, particularly competitors, can hinder the exploration and implementation of federated learning.

Resistance to collaboration with external organizations can hinder the exploration and implementation of federated learning.

It is crucial, therefore, that individuals from the various collaborating organizations actively engage with one another when implementing federated machine learning architectures. This is true for any digitization process, even within a single organization, but it is especially so in the context of federated learning, in which multiple organizations with different organizational cultures and concerns are involved.

When deciding which organizations to collaborate with on federated learning, there are six main factors that business leaders need to consider:

  1. The state of their company’s data. Before searching for external organizations with which to collaborate, leaders must determine whether their company’s own data is poor, vertical, horizontal, or rich. This will aid in deciding the type of organization with which they should collaborate.
  2. The structure of their company’s data. Organizations with vertical data should look outside their industry. Those with horizontal data should seek collaborations within their own industry — even, potentially, with competitors.
  3. A logical starting point. Start with one specific algorithm and trusted partner, even within the boundaries of the organization, to securely explore federated learning.
  4. The potential for data monetization. Federated learning presents a privacy-centric method for organizations to monetize their data by contributing to the AI training processes of others. Identifying partner organizations with a need for this data is crucial to such monetization efforts.
  5. Possible technical challenges. Recognize and plan for technical challenges and costs, such as harmonizing the data formats and structures between different organizations.
  6. Employee buy-in. Work with people, not around them. Federated learning needs a transformational approach. There should be widespread buy-in and clearly assigned roles and resources.

When seeking to gain or maintain competitive advantage, deploying an off-the-shelf AI solution won’t be enough. Organizations seeking standout performance need to train and fine-tune their AI systems with proprietary data. Federated learning enables an organization to do so using data from external partners.

Topics

References

1. Y. Bammens and P. Hünermund, “Using Federated Machine Learning to Overcome the AI Scale Disadvantage,” MIT Sloan Management Review 65, no. 1 (fall 2023): 54-57.

2. For more on this program, see D. Raths, “Mayo Clinic Platform Seeks to Accelerate Deployment of Digital Health Solutions,” Healthcare Innovation, March 15, 2024, www.hcinnovationgroup.com.

Reprint #:

66205

More Like This

Add a comment

You must to post a comment.

First time here? Sign up for a free account: Comment on articles and get access to many more articles.