Wikimedia Foundation logoWikimedia Privacy and Wikimedia Research

Differential Privacy + Wikipedia Exploration

No guarantees are made that this tool will be maintained.

This is an experimental tool hosted on Cloud VPS. No additional personal data is collected by this tool per the Cloud Services Terms of Use.

This tool showcases how different approaches to differential privacy might affect top-viewed lists. In reality, this approach would likely be applied to e.g., pageviews by country, but here we use the top-viewed articles in a wiki (public data) as a proxy. This is a clone of an earlier exploration of differential privacy, but is built using Golang and Apache Beam's differential privacy library, rather than Python, Flask, and a hand-coded version of differential privacy.

Top-viewed Articles on a Wikipedia

This tool fetches the top-viewed articles for a given wiki from yesterday. First you have the actual data — i.e. accurate counts without any noise added. Then you have the data after differential privacy (DP) has been applied (specifically noise drawn from a Laplace distribution).

You can play around with the different hyperparameters to see how it affects the results. See this Facebook blogpost for a good worked example.

Language: which Wikipedia language to query.

Privacy Unit: which unit of privacy to use. Selecting "pageview" provides a guarantee that individual pageviews will be private, whereas "user" provides a guarantee that user sessions will be private. A user session can be capped at 1, 5, or 10 views/session, which encompasses a significant (80-99%) amount of traffic, depending on the cap and the size of the wiki.

Epsilon (ε): privacy parameter. Defaults to 0.1, but can also be 0.5, 1, or 2. The smaller you make it, the more privacy-preserving the differential privacy mechanism is and the greater data loss there is.

Delta (δ): the probability of information about the database accidentally being leaked. The smaller you make it, the less likely a leak is to happen. In Privacy on Beam, δ is also used to add noise to the threshold used to put a minimum bound on output values. Ideally, the value should be less than the inverse of a polynomial in the size of the database.

Sensitivity: the maximum amount that any individual can add to the result. With pageview-level privacy, this defaults to 1, as the maximum difference between two adjacent databases is 1 pageview. With user-level privacy, this can be set either 1, 5, or 10, to simulate varying thresholds for adjacent databases.