Data privacy. There’s a lot of misinformation and overreaction when it comes to data privacy but that’s in large part to the fact that there’s a lot of lack of data privacy. People are rightly concerned. In this episode Tom describes very simply and generally what differential privacy is, and what you need to know about how it’s used.
On DTNS we try to balance the idea that companies do definitely need to improve data protection with the idea that sharing data at all isn’t a bad thing in fact when done right can be a very good thing.
Not just for companies but academic research and nonprofits benefit from research on datasets. However just taking data, even when names are stripped off, can lead to trouble. As far back as 2000, researchers were showing that the right analysis of raw data sets could deduce who people were even when the data was anonymized. In 2000, Latanya Sweeney showed that 87% of people in the US could be identified from ZIP code, birthdate and sex.
One attempt to make data workable is called differential privacy. Apple mentioned the use of differential privacy in its 2016 WWDC keynote.
What is differential privacy?
An algorithm is differentially private if you can’t tell who anybody is by looking at the output.
Here’s a simple example. Let’s say you want to publish the aggregate sales data of businesses by category. Stores want to keep sales data private. So you agree that only the total sales for a category will be published. That way you can’t tell how much came from which businesses. Which is great until you come to the category of Shark repellent sales. There’s only one shark repellent business in your region. If you publish that category you won’t be saying the name of the business but it will be easy to tell who it is.
So, you have an algorithm that looks for categories where that’s a problem and maybe it deletes them or maybe it folds them into another category.
This can get trickier if, say, there’s a total sales number for the region and only one category was deleted. You just add all the published categories and subtract it from the published total and the difference is the missing business.
And remember there’s other data out there to use. Some attacks on data use data from elsewhere to deduce identities. Let’s say you study how people walk through a park and you discover that of 100 people observed 40 walk on the path and 60 cut through the grass. Seems private enough right. There’s no leakage of data in the published results.
But an adversary discovers the names of the people who participated in the study. And they want to find out of Bob walks on the grass so they can embarrass him. They also found out that of the 99 people in the study who weren’t Bob, 40 walked the path and 59 walked on the grass. BINGO! Bob is a grass walker. Now I know it sounds unrealistic that the adversary got that much info without just getting all of it. But differential privacy would protect Bob’s identity even if the adversary had all that info.
So what do we do? How do we do this differential privacy thing?
In 2003 Kobbi Nissim and Irit Dinur demonstrated that, mathematically speaking, you can’t publish arbitrary queries of a database without revealing some amount of private info. Thus the Fundamental Law of Information Recovery, which says that privacy cannot be protected without injecting noise. In 2006 Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and how to do it. That work used the term differential privacy.
A little bit on what that… For information regarding your data privacy, visit acast.com/privacy