Federated learning (FL) allows parties to learn from each other without sharing their data. They share local updates about a global model with a server, which aggregates the updates to create the next version of the global model. But even though each party's data remains private, the shared updates and global model can leak sensitive information. For example, the adversary who knows the model update in FL can infer the membership information of points with high accuracy [Nasr et al. 2019] or even reconstruct the private local data [Zhu et al. 2019]. Intel recently announced that the Linux Foundation* AI & Data Foundation Technical Advisory Council accepted Open Federated Learning (OpenFL) as an incubation project to further drive collaboration.
Data protection regulations like the General Data Protection Regulation (GDPR) require that personal data be safeguarded in AI systems. These regulations mandate that users have control over their data and know how it is used. This means using proper auditing tools to identify and minimize privacy risks. Meeting these legal requirements calls for proper auditing tools that can anticipate potential privacy risks for the user when their personal data is used to build AI systems. That's where the Privacy Meter comes in! Privacy Meter can help in the data protection impact assessment process by providing a quantitative analysis of the fundamental privacy risks of a (machine learning) model.
OpenFL 1.5 has integrated the Privacy Meter (privacy auditing tool), which provides real-time privacy auditing in the Federated Learning setting. This integration empowers collaborators to be informed about potential privacy risks and take control of them. The following section will explain how the Privacy Meter works and highlight its value in integrating with OpenFL.
What is Privacy Meter?
Before introducing the Privacy Meter, it's essential to understand the definition of privacy.
What is the definition of privacy?
Privacy is a fundamental concept that everyone values, but it means different things to different people depending on the context. For example, some people might feel comfortable sharing their personal information online, while others might be more guarded. However, when it comes to the computation of sensitive personal information, like training machine learning models on medical records or financial data, it is essential to clearly define what privacy means.
Differential privacy is a concept that has gained widespread acceptance in recent years, with organizations such as the US Census Bureau, Apple, and Google using it to protect the privacy of individual participants in their data collection and processing. At its core, differential privacy is designed to enable the computation of statistics (including machine learning models) on private data without compromising the sensitive information of individuals in the dataset.
The key to differential privacy is ensuring that the information released from private datasets does not reveal any more information about individuals than what can already be learned. This means that by observing the released information, an adversary should not be able to gain any additional insights into the individuals in the dataset. For example, if you are participating in a survey about your health, the released information should not reveal anything about you personally.
In practice, achieving perfect privacy is challenging and can result in a significant loss of utility. To address this, differential privacy provides a relaxed definition allowing a trade-off between privacy and utility. More specifically, we say that a computation on a private dataset is (𝜀 , 𝛿) differentially private if the change in the output when individual data is removed or added is bounded by 𝜀 and 𝛿. This means that even if an adversary has access to the output of the computation, their accuracy in guessing whether a particular individual is in the dataset is limited. For more information about differential privacy and its applications, we recommend reading the first two chapters of the book about differential privacy.
How to quantify the privacy risk?
Now that we have a clear understanding of differential privacy, it's essential to introduce ways to measure the risk of privacy breaches. As we discussed earlier, the privacy risk can be directly related to the accuracy of an adversary in guessing the membership of individual records with respect to the dataset used for computation.
This type of attack is called a membership inference attack [Shokri et al. 2017], where the adversary aims to accurately infer the membership information about individual points with respect to the dataset. In other words, the adversary determines whether a particular individual's data was part of the dataset used to train a machine learning model. It has been widely used to show how much information trained models memorize about each individual record in the training dataset.
To better understand how we can measure the privacy risk in machine learning models, let's introduce a game between two players: the model learner (or challenger), who trains a model on a sensitive training dataset, and the adversary, who wants to infer the membership of target points in the sensitive training dataset with access to the trained model. The game is described as follows:
Privacy Game
- The challenger samples a dataset Dt from the underlying data distribution D and trains a model Ɵ using the public training algorithm on D.
- The challenger flips an unbiased coin β {0,1} and samples a data point z from the data distribution D if b =0, or from the dataset Dt if b=1.
- The adversary is given z, some access to the model (such as query access), access to the data distribution D, and outputs the prediction of coin flipping results b̂ using different membership inference attack algorithms.
- If b̂ = b, output 1 (the adversary succeeds), otherwise output 0.
To measure privacy risk, we can look at the adversary's performance on the whole ROC curve. This captures the trade- off between the true positive rate (i.e., when the model correctly identifies that a point is in the training dataset P[b̂=1|b=1] and the false positive rate (i.e., when the model incorrectly identifies a point as being in the training dataset P[b̂=1|b=0]. We can also look at the true positive rate when the false positive rate lies in a small region (e.g., FPR < 0.1).
The game can be formalized differently depending on the objective of the audit, such as auditing the privacy risk of a specific data point or auditing the privacy risk for a training algorithm. Check out [Ye et al. 2022] for more details on this topic. In summary, the performance of a membership inference attack can be used to quantify the privacy risk and provide the lower bound on how much information is leaked about the private dataset.
Privacy Meter
As we previously discussed, membership inference attacks can be used to measure privacy risk in machine learning models. However, to obtain a more accurate empirical analysis of the privacy risk, we need to use state-of-the-art (stronger) membership inference attacks and properly simulate different privacy games. This is where the privacy meter becomes valuable!
The Privacy Meter is a powerful open-source library for auditing data privacy in machine learning algorithms. It can help in the data protection impact assessment process by providing a quantitative analysis of the fundamental privacy risks of a model. The tool uses state-of-the-art membership inference techniques [Ye et al. 2022] to audit various machine learning algorithms, including those for classification, regression, computer vision, and natural language processing. It generates comprehensive reports about the aggregate and individual privacy risks for data records in the training set, at multiple levels of access to the model.
Privacy Meter is user-friendly, with easy-to-use tools for auditing privacy risks in different types of games. It supports models trained with different libraries, such as PyTorch, TensorFlow, and OpenVION. Additionally, the privacy meter provides ways to reproduce the results of existing membership inference attacks on benchmark datasets. Want to learn more about how the Privacy Meter can help you safeguard your sensitive data? Check out the Privacy Meter repository for further information.
Integration
Now, let's dive into the exciting integration of OpenFL and the Privacy Meter! As we mentioned, while federated learning (FL) provides data protection by not sharing data among parties, there is still a risk of sensitive information being leaked through shared model updates. This is where our integration comes in - we aim to provide participating collaborators with a way to anticipate their privacy risk during training and take action to minimize that risk.
Our goal is to audit potential privacy risks by taking on the role of an adversary who seeks to infer membership information. This approach allows us to identify and address privacy risks early in training, keeping sensitive data safe and secure. This approach is especially valuable as it allows us to take proactive measures to protect privacy rather than being reactive after a privacy breach has occurred.
How can it help?
Threat Model: In FL, an adversary may control the server or participating collaborators, so we consider two threat models:
- Server is trusted, and other parties are honest-but-curious (they follow the protocol but try to learn as much as possible from what information they have access to). In this threat model, each party can audit the privacy risk of the global model to quantify how much information will be leaked to other parties via the global model.
- Everyone, including the server, is honest-but-curious. In this threat model, each party can audit the privacy risk of the local and global models to quantify how much information will be leaked to the aggregator via the local model and to other parties via the global model.
Pipline In each round of FL, participating parties train using their local dataset, starting with the current global model as initialization. After training, the current global and updated local models are passed to the privacy auditing module to produce a privacy risk report.
Participating collaborators can take action based on the privacy risk report. For example, they may quit the FL, not share the current local update model with the server, or introduce noise to ensure the local updated model is differentially private. Our real-time privacy analysis empowers collaborators to anticipate and take control of potential privacy risks before sharing occurs, keeping sensitive data safe and secure.
Example Are you curious about how OpenFL and Privacy Meter can work together? If so, we've got you covered! We've created tutorials and please check them out for yourself by clicking on this link: here).
The integration of OpenFL and Privacy Meter is a game-changer for data privacy in machine learning. By enabling collaborators to anticipate potential privacy risks during training, the integration empowers them to take action and minimize risks before information sharing occurs. The user-friendly tools, a wide range of machine learning library support, and ability to reproduce results make Privacy Meter an excellent tool for auditing data privacy. Our blog post has provided a clear introduction to Privacy Meter, explained how it works in FL, and offered examples and tutorials to get started. With these powerful tools, collaborators can anticipate potential privacy risks and have tighter control over privacy risks, contributing to a more trustworthy and secure data ecosystem. The integration of OpenFL and Privacy Meter offers a promising solution for preserving data privacy in machine learning and improving the integrity of the data ecosystem.
References
- Shokri, Reza, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. "Membership inference attacks against machine learning models." In 2017 IEEE symposium on security and privacy (SP), pp. 3-18. IEEE, 2017.
- Ye, Jiayuan, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and Reza Shokri. "Enhanced Membership Inference Attacks against Machine Learning Models." In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 3093-3106. 2022.
- Nasr, Milad, Reza Shokri, and Amir Houmansadr. "Comprehensive privacy analysis of deep learning." In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), pp. 1-15. 2018.
About the Author
Hongyan Chang
Passionate about building trustworthy machine learning, Hongyan is a PhD student at the National University of Singapore, supervised by Reza Shokri. With a focus on privacy and fairness, especially in decentralized settings, she’s dedicated to advancing the field of trustworthy AI. Connect with her on Twitter at @Hongyan_Chang.