Overview: Big Data, Big Silos
Big data can also build big silos. More data is captured today through sensors and digital sources than ever, in theory creating a vast trove of information for data scientists. That doesn’t mean they can access it, however.
Let's say you’re a data scientist (Collaborator C in Figure 1), developing a model for early cancer detection. You know you’ll need large quantities of diverse examples to train a robust machine learning (ML)/deep learning (DL) model. Your goal is detecting lung cancer and your database already contains X-rays from a number of different patients, but what if you could also access patient exams from other clinics to get more examples?
Figure 1. All graphics from “OpenFL: the open federated learning library” released under Creative Commons Attribution 4.0 license
Your model would be more robust, but there’s a major roadblock: Generally, data from Collaborator A and B is confidential and protected by laws like the Health Insurance Portability and Accountability Act (HIPAA) or the General Data Protection Regulation (GDPR) and cannot be easily shared.
How can you get access to more sources of private, sensitive patient data without violating laws?
Enter federated learning.
What is Federated Learning?
Federated learning (FL) is an ML technique where data scientists collaboratively train a model orchestrated by a central server. This means that the training data is not centralized. The basic premise behind FL is that the AI model moves to meet the data, instead of the data moving to meet the model (Foley et al., 2022).
To train an AI model in that scenario, each collaborator prepares their own private data, then the model that needs to be trained is sent back to each collaborator. Collaborators train the model on local data with their on-site compute. Periodically, each collaborator shares only updates for model weights (learnings) and metrics to the aggregation server that combines learnings from individual collaborators and sends the updated model to each of the collaborators for further training (Figure 2). (For a real-world example, read this case study where Intel Labs collaborated with 71 international healthcare and research institutions to train AI models to identify brain tumors.)
During this entire process, the data used to train the model never leaves the collaborator’s node, staying firmly behind institutional firewalls. This entire federated learning workflow must be automated and orchestrated using a framework like Open Federated Learning (OpenFL).
Figure 2
Why use OpenFL?
OpenFL is a Python* 3 library for federated learning that enables organizations to collaboratively train a model without sharing sensitive information.
FL simplifies issues around data sharing, but there are other important security and privacy considerations. AI model developers must protect their model intellectual property (IP) when training their models in potentially untrusted environments. Collaborators need to ensure that their data cannot be extracted by inspecting model weights over the federated rounds (reverse engineering). That’s where OpenFL comes in; designed with privacy and security in mind it employs narrow interfaces and allows running all the processes within Trusted Execution Environments (TEEs) that can provide confidentiality of data and models, integrity of computation, and enable attestation of compute resources.
Get Involved
Our example was related to healthcare, but OpenFL can be used in any environment where you want to use federated learning. Whether it stems from a need to get more diverse data to train your model or a project like self-driving cars where each car represents a node that collects information while driving then each vehicle sends information back to the model owner, OpenFL can help.
You can learn more by checking out the OpenFL documentation.
The project welcomes contributions, if you’re interested head to the OpenFL Github page.
References
Foley, P., Sheller, M. J., Edwards, B., Pati, S., Riviera, W., Sharma, M., Narayana Moorthy, P., Wang, S., Martin, J., Mirhaji, P., Shah, P., & Bakas, S. (2022).
OpenFL: The open federated learning library. Physics in Medicine & Biology, 67(21), 214001.
https://doi.org/10.1088/1361-6560/ac97d9
About the authors
Ezequiel Lanza, Open Source Evangelist
Passionate about helping people discover the exciting world of artificial intelligence, Ezequiel is a frequent AI conference presenter and the creator of use cases, tutorials, and guides that help developers adopt open source AI tools like TensorFlow* and Hugging Face*. Find him on Twitter at @eze_lanza.
Morgan Andersen, Tech Evangelist
Tech professional who is passionate about helping others discover the wonderful world of coding and artificial intelligence through interactive experiences. Find her on LinkedIn.
Olga Perepelkina, AI Product Manager at Intel
Olga holds a PhD in Neuroscience and a postgraduate degree in ML/data science. She's also an industrial adviser at the School of Doctoral Training at the University of Glasgow. Find her on LinkedIn.