MACHINE LEARNING AND DIFFERENTIAL PRIVACY: OVERVIEW
The technological advancement of computer science, AI and machine learning have led to great achievements in different aspects of our society and daily lives. However, with the ever-growing digitalization and extensive sharing of our personal life online, the topic of privacy is gaining huge popularity and importance. This is especially true when at the beginning of the century new ways of business have emerged, where third-party entities are purchasing personal information to improve products and services.
AHMED ZEWAIN | DATA SCIENTIST | APRIL | 2019
One solution to the privacy question is the so-called data anonymization techniques where we “blur” out the most sensitive personal information before any kind of data analysis. Under sensitive information we mean medical records, political affiliations, social security numbers, addresses, and so on. You might be tempted to think that anonymization sounds sufficient to preserve personal privacy, but as experts in the field have shown, these methods can leak information if an attacker already has some information about individuals in the dataset.
One great example of this is the well-known case, where the personal health data of the governor of Massachusetts William Weld was discovered in an anonymized public database. By merging the overlapping records between two databases, namely the health database and a voter registry, the team of researchers was able to identify personal health records of the governor.
The classical data anonymization by means of k-anonymity. The idea is that each data point must be at least indistinguishable from k other data points. It turns out that with a little more knowledge about the database one can figure out the original database with high certainty.
THE MOST POPULAR DEFINITION OF PRIVACY
In 2004 the new mathematical term differential privacy was coined and is currently one of the most popular definitions of privacy. Differential privacy ensures that the publicly visible data does not change much for one individual if the dataset changes. This is done by adding random noise to the mechanism at work.
In short, we can say that differential privacy requires that a mechanism which outputs information about a dataset is robust to any change of one individual in the dataset. As the output cannot be significantly affected by one individual, the attackers cannot confidently extract private information about any individual sample.
NOISE ADDITION IN DIFFERENTIAL PRIVACY
As you might have guessed, this idea of noise addition in differential privacy conflicts with the machine learning way of thinking. Therefore, researchers have developed tools to capture the underlying distributions corresponding to the large datasets while still guaranteeing privacy with respect to individuals in the data.
Differential privacy can be added as a feature in machine learning algorithms in different ways depending on whether the task at hand is unsupervised or supervised learning. The 2014 paper “Differential Privacy and Machine Learning: A Survey and Review” by Ji et al. provides a comprehensive overview of the available work in differentially private algorithms in various fields of machine learning which helps push forward the research and implementations of the highly needed private algorithms.
One sought out scenario is to obtain complex open source libraries similar to the well-known machine learning Sci-kit learn, or deep learning Keras library. Such differential private libraries can facilitate the adaptation of individual privacy guarantee as a standard part of the data science workflow.
The essence of differential privacy where an adversary is not able to distinguish between the answers of two differential private algorithms on input database thus it does not matter if the user is in the database or not.
BUSINESSES EXPERIMENTING WITH DIFFERENTIAL PRIVACY
Some large IT companies have already led the way and began adapting differentially private algorithms into their products. One example is Google and the open source TensorFlow library which have embraced differential privacy with their latest release TensorFlow Privacy, that allows researchers to develop privacy respecting machine learning algorithms. Other companies such as Apple and Amazon are also known to experiment with differential privacy as part of their workflow.
The newly introduced European general data protection regulation (GDPR) puts great emphasis on respecting the individual’s privacy by means of data anonymization, however, it is still a long way from demanding privacy-preserving machine learning algorithms for maximum privacy protection.
One could argue that on the positive side, this gives researchers and contributors time to optimize machine learning algorithms and prepare them for the inevitable need for differential privacy in the near future.
However, it is also evident that the big companies are leading the way and taking responsibility in introducing differential privacy as the sole standard for privacy in future AI workflow.
Do you want to learn more about the field of AI and get the latest insights from 2021.AI employees working with AI every day?