Differential Privacy with Laplace mechanism implemented by python to protect data privacy

Nltd
4 min readOct 6, 2021

1. Problem

There are 2 types of information cared :
+ general information: The information of everyone contributing to the figures.
+ private information: The information of each body contributing to the figures.

To understand 2 types of information, we need to be explained specifically. For instance, we make a survey about the effect of alcohol on health by asking somebody “Does alcohol make you more stressed?”. From the collected figure, we can conclude that alcohol enhances stress. However, for each individual, they can answer yes or no.

Data and personal data become the most important “resource” or “asset” of the digital economy, especially personal data. Nowadays, there are a lot of big data systems with enormous users data and they need to publish the statistics to make research projects. However, publishing data to the community have to face some risks about leaking some private information and identities of individuals included in the data by reverse inference.

For example, if we have an average weight of 3 students and an average weight of 2 students, then we enable to infer the weight of the remaining student. Organizations want to publish data but they also need to protect the information of individuals, to do it, Differential Privacy is very essential.

The posed problem is how to make someone be able to get analyzed data and statistics but they can not infer to individual information of survey contributors.

2. Approach

According to the paper titled Calibrating Noise to Sensitivity in Private Data Analysis-2008, when dataset D excludes someone‘s data then results for any statistical method fluctuate slightly.

The change in this statistic is denoted by ε, also known as the privacy loss. The objective of differential privacy is to decrease ε as small as possible but ensuring keep the accuracy. It means that the applying does not change the output of statistic functions too much compared to before that.

To do it, we will add into the output of statistic functions a value being noise L. So, the final result of the statistic functions Y+L instead of returningY. Besides, the result Y+L must ensure that ε is tiny meaning when poping data’s someone the Y+L does not change significantly. Thus, we make mining people are unable to extract individual information. There are many ways to compute L , the stand out of these is using Laplace distribution.

2.1. Definition of ε-Differential Privacy

Let ε be a positive real number and A be a randomized algorithm that takes a dataset as input. The algorithm A is said to provide ε-differential privacy if, for all dataset D1 and dataset D2 that differ on a single element (i.e., the data of one person), and all subsets S:

The meaning of the above inequality is that if the extract the information of a person then the probability of that the output in D being the output of on S is not changed significantly. Suppose that individual data of a person X is in the dataset D1 and it is not in the dataset D2, anybody acquires statistics of D1 and D2, they are unable to exploit the data of person X. ε represents security, the smaller ε is, the more secure, so inequality is closer. Choosing ε is complex and relates to multiple conditions, other requirements.

2.2. Sensitivity

Sensitivity for ∆f of F function is declared as:

D1 and D2 are 2 datasets that differ only in 1 data unit. The sensitivity indicates the largest difference when executing the f function on 2 randomized neighboring datasets(Such as D1, D2). As we can see, the sensitivity is only base on the f function without the datasets (There is a sensitivity called Local Sensitivity, which bases on the dataset that we are considering). For instance, the sensitivity of the counting function is 1, simply because the counting function changes 1 when there is a unit of data is excepted or added.

2.3. Laplace mechanism

The mechanism is to add the appreciate noise into output of functions to satisfy Differential Privacy. Concretely, the f functions return F(x) instead of f(x) as before:

F(x)=f(x)+Lap(=0,b = f)

Lap is a Laplace distribution with center 0 and scale b.

2. Source code

Source code uses python to develop UI by command line allowing users to extract statistic data in a dataset. I used Laplace distribution and manually calculated sensitivity based on the dataset to add noise into the output.

Source code https://github.com/nltd101/differential-privacy

--

--