Privacy Enhancing Technologies: An Introduction for Technologists
Privacy Enhancing Technologies (PETs) are technologies that provide increased privacy or secrecy for the persons whose data is processed, stored and/or collected by software and systems. Three PETs that are valuable and ready for use are: Differential Privacy, Distributed & Federated Analysis & Learning, and Encrypted Computation. They provide rigorous guarantees for privacy and as such are becoming increasingly popular to provide data in while minimizing violations of private data.
30 May 2023
- What are PETs?
- Differential Privacy
- Distributed and Federated Analysis and Learning
- Encrypted Computation
- Related Technologies
- Engineer Privacy In
Privacy Enhancing Technologies are in the news fairly regularly, with open calls from NIST and the UK government, Singapore and Europe to determine how and where these technologies can and should be used. As a developer, architect or technologist, you might have already heard about or even used these technologies, but your knowledge might be outdated, as research and implementations have shifted significantly in recent years.
This introduction takes you through the most prominent technologies that provide solid privacy guarantees. At the end of this article, you'll better understand where you might apply them and how to get started. These learnings were hard won, and further detailed in my newly released O'Reilly Book Practical Data Privacy. I wrote the book to share practical shortcuts and advice and to significantly reduce the learning curve to confidently using privacy technologies. By demystifying the field of privacy engineering, I hope to inspire you to build privacy into your architectures, applications and data flows from the start.
What are PETs?
Privacy Enhancing Technologies (from hereon: PETs) are technologies that provide increased privacy or secrecy for the persons whose data is processed, stored and/or collected by software and systems. These technologies are often used as a part of this processing and modify the normal ways of handling (and often, hoarding) raw or plaintext data directly from users and internal participants, such as employees. By increasing the privacy offered, you are both reducing the owned risk and providing users with better choices about how they'd like their data to be treated.
Privacy is a technical, legal, political, social and individual concept. In this article, you'll learn the basic technical aspects of privacy to enable more choices for users to navigate their identity and information when they interact with systems. There are, of course, many other aspects to building privacy into products. For now, these are out-of-scope for this article, but I can highly recommend exploring Privacy by Design and diving deeper into the fields of privacy and security engineering.
The proliferation of machine learning systems, often trained using person-related data, has increased the threat surface for privacy. Systems like ChatGPT, Stable Diffusion and other large language and vision models provide fun new ways of interacting with machine learning and can be transformative or useful for particular tasks. Unfortunately, they also use massive quantities of personal data, often without consent or opt-out options, and trained under murky workers rights. These are not only open privacy concerns, but also ownership concerns, reflected in several ongoing lawsuits from creators, coders and people who would rather not be trained.
These systems sometimes produce unknown and new risks, as exposed in expanding research on how to extract private information and training data population information directly from the models themselves. There is also significant research on how generative AI reproduces data very close to the training data, and the general issues of memorization of extremely large models. This memorization is very risky, as it can expose outliers whose location in the encoded space is inherently sparse, and, therefore, quite telling. Due to space and time constraints, I'll leave out the societal, ethical and environmental risks of these models and their use.
Thankfully, these problems are receiving increased attention and there is growing awareness of the risks at hand. It's no longer acceptable to simply hoover all data with complete disregard to users' wishes and rights. It is not cool to randomly scrape data and post it publicly as part of your "research". And companies, along with governments, are starting to ask how they can keep doing data science and useful data analysis while also giving users more choice, transparency and consent options.
Privacy technologies are one way to align the needs of data science with those of user consent, awareness and privacy. Until recent years, these technologies were mainly in research and innovation labs. In the past 5 years, they've moved out of the lab and into production systems. These are not the only ways to provide people with better privacy, but they're a good start for organizations that are already well on their data maturity journey, with a need to build better privacy into both current and new data systems.
Which PETs should you know about?
In this post, you'll only learn PETs that:
- are ready for production systems, providing you have engineering teams to architect, integrate, monitor and manage
- provide rigorous and scientific guarantees for privacy
- are future-proof and increasing in usage
I want you to learn these first, and then evaluate related technologies, so you start with the best choices and deviate from them only when you've exhausted the primary modern technologies.
Differential privacy is a rigorous and scientific definition of how to measure and understand privacy—today's "gold standard" for thinking through problems like anonymization. It was developed and extended in 2006 by several researchers, including Cynthia Dwork and Aaron Roth. Since that time, the original definition and implementations have vastly expanded. Differential privacy is now in daily use at several large data organizations like Google and Apple.
Differential privacy is essentially a way to measure the privacy loss of an individual. The original definition defines two databases, which differ by the addition or removal of one person. The analyst querying these databases is also a potential attacker looking to figure out if a given person is in or out of the dataset, or to learn about the persons in the dataset. Your goal, as database owner, is to protect the privacy of the persons in the databases, but also to provide information to the analysts. But each query you answer could potentially leak critical information about one person or several in the database. What do you do?
As per the definition of differential privacy, you have a database that differs by one person, who is either removed or added to the database. Suppose an analyst queries the first database—without the person—and then queries the database again, comparing the results. The information gained from those results is the privacy loss of that individual.
Let's take a concrete example from a real-world privacy implementation: the US Census. Every 10 years the US government attempts to count every person residing in the US only once. Accurately surveying more than 330 million people is about as difficult as it sounds, and the results are then used to support things like federal funding, representation in the US Congress and many other programs that rely on an accurate representation of the US population.
Not only is that difficult just from a data validation point-of-view, the US government would like to provide privacy for the participants; therefore increasing the likelihood of honest responses and also protecting people from unwanted attention from people or organizations that might use the public release nefariously (e.g. to connect their data, contact them or otherwise use their data for another purpose). In the past, the US government used a variety of techniques to suppress, shuffle and randomly alter entries in hopes this would provide adequate privacy.
It unfortunately did not—especially as consumer databases became cheaper and more widely available. Using solver software, they were able to attack previous releases and reconstruct 45% of the original data, using only a few available datasets offered at a low cost. Imagine if you had a consumer database that covered a large portion of Americans?
For this reason, they turned to differential privacy to help provide rigorous guarantees. Let's use a census block example. Say you live on a block and there is only one person on the block who is a First American, which is another word for Native American. What you might do is to simply not include that person, as a way to protect their privacy.
That's a good intuition, but differential privacy actually provides you a way to determine how much privacy loss that person will have if they participate, and allows you to calculate this as a way to determine when to respond and when not to respond. To figure this out, you need to know how much one person can change any given query. In the current example, the person would change the count of the number of First Americans by 1.
So if I am an attacker and I query the database for the total count of First Americans before the person is added I get a 0, and if I query after, then I get a 1. This means the maximum contribution of one person to this query is 1. This is our sensitivity in the area of differential privacy.
Once you know the maximum contribution and, therefore, the sensitivity, you can apply what is called a differential privacy mechanism. This mechanism can take the actual answer (here: 1) and apply carefully constructed noise to the answer to add enough room for uncertainty. This uncertainty allows you to bound the amount of privacy loss for an individual, and information gain for an attacker.
So let's say I query beforehand and the number I get isn't 0, it's actually 2. Then, the person is added and I query again, and now I get an answer of 2 again — or maybe 3, 1, 0, or 4. Because I can never know exactly how much noise was added by the mechanism, I am unsure if the person is really there or not — and this is the power of differential privacy.
Differential privacy tracks this leakage and provides ways to reduce and cleverly randomize some of it. When you send a query, there will be a probability distribution of what result will be returned, where the highest probability is close to the real result. But you could get a result that is a certain error range around the result. This uncertainty helps insert plausible deniability or reasonable doubt in differential privacy responses, which is how they guarantee privacy in a scientific and real sense. While plausible deniability is a legal concept—allowing a defendant to provide a plausible (or possible) counterargument which could be factual—it can be applied to other situations. Differential privacy, by its very nature, inserts some probability that another answer could be possible, leaving this space for participants to neither confirm nor deny their real number (or even their participation).
Sure, sounds nice... but how do you actually implement that? There are probabilistic processes that are called differential privacy mechanisms, which assist in providing these guarantees. They do so by:
- creating bounds for the original data (to remove the disparate impact of outliers and to create consistency)
- adding probabilistic noise with particular distributions and sampling requirements (to increase doubt and maintain bounded probability distributions for the results)
- tracking the measured privacy loss variable over time to reduce the chance that someone is overexposed.
These libraries usually integrate in the data engineering or preparation steps or in the machine learning training. To use them appropriately, you'll need to have some understanding of your data, know the use case at hand and set a few other parameters to tune the noise (for example, the number of times an individual can be in the dataset).
Differential privacy isn't going to replace all data access anytime soon, but it is a crucial tool when you are being asked questions around anonymization. If you are releasing data to a third-party, to the public, to a partner or even to a wider internal audience, differential privacy can create measurable safety for the persons in your data. Imagine a world where one employee's stolen credential just means leaking fuzzy aggregate results instead of your entire user database. Imagine not being embarrassed when a data scientist reverse engineers your public data release to reveal the real data. And imagine how much easier it would be to grant differentially private data access to internal use cases that don't actually need the raw data—creating less burden for the data team, decreasing risk and the chance of 'Shadow IT' operations popping up like whack-a-mole.
Differential privacy fits these use cases, and more! If you'd like to walk through some examples, I recommend reading Damien Desfontaines' posts on differential privacy and testing out some of the libraries mentioned, like Tumult Analytics. The book's repository also has a few examples to walk through.
It should be noted that differential privacy does indeed add noise to your results, requiring you to reason about the actual use of the data and what you need to provide in order for the analysis to succeed. This is potentially a new type of investigation for you, and it promoted thinking through the privacy vs. utility problem—where you want to optimize the amount of information for the particular use case but also maximize the privacy offered. Most of the technologies in this post will require you to analyze these tradeoffs and make decisions. To be clear, no data is ever 100% accurate because all data is some representation of reality; so these tradeoffs are only more obvious when implementing privacy controls.
Distributed and Federated Analysis and Learning
Martin Fowler previously featured the concept of Datensparsamkeit, also known as data minimization, an idea of using only the data you actually need and not collecting or harboring any additional data. With this concept in mind, distributed or federated analysis (and their machine learning counterparts) leave data at the edge, in the original data storage and on user devices in order to guarantee data minimization. Instead of taking the data and storing it centrally, you ship the analysis, machine learning model and training or processing directly to the data and only collect the results.
In today's data science, you are often already dealing with distributed data. Your data is stored across data centers, across machines, across containers and this federation is abstracted via an interface or framework, such as your Apache Spark code. Distributed or federated analysis and learning calls for a larger network, pushing the federation of the actual physical storage directly to the edge, or minimally across several large data formations.
Federated learning was first implemented by Google in 2016, although there were plenty of examples of doing edge computing and data analysis across distributed devices before that time. Their initial implementation took user phones and used the local keyboard data to train language models for better keyboard predictions. Instead of collecting sensitive keyboard data centrally, which would have likely raised eyebrows and regulatory pressure, they deployed distributed aggregators that coordinate the training rounds and collect the gradient updates for each training round from the phones. These updates are then averaged and sent to all participants for a new training round. The model is shared across all devices, but the training data for each individual stays on their device.
There are now many extensions of this initial implementation, also allowing for federated data analysis, where instead of training a machine learning model, a query or other data analysis is run across the devices and returns an aggregated result. There's also been significant work to incorporate differential privacy or leverage encrypted computation to improve privacy and secrecy protections for these gradient updates or aggregate responses, which can also leak information about the underlying data. There are a growing variety of statistical and machine learning algorithms that support federated approaches, along with varied architectures for deploying and managing edge compute or cross-silo setups. Cross-silo setups join two or more data partners who want to use a distributed setup for shared analysis or learning instead of sharing the raw data without privacy protections.
Distributed or federated analysis and learning are a good fit for any organization working directly with highly sensitive data that should never be centralized. It also works well for data sharing use cases, where partners share raw or poorly anonymized data across organizations or within an organization.
Distributed data enables true Datensparsamkeit and could be considered every time a team is asking for more data to be collected from users. Storing personal data centrally is a naïve way to do data science and analysis, generating essentially endless new risks and encouraging murky business models. Asking users for consent, removing unnecessary data, getting creative with shipping data analysis, machine learning or other processing to the edge rather than collecting data are the habits to form now to make your work and organization consensual, user-empowered and privacy-first.
If you want to explore federated learning further, take a look at Flower, and run a few of their examples for whatever machine learning framework you normally use. If you want to learn more about federated architectures, take a look at my InfoQ talk and review the insightful summary paper, Advances and Open Challenges in Federated Learning, written by experts across several large organizations and institutions working on federated learning.
The privacy and security guarantees offered by federated learning can be enhanced by using encrypted computation, which allows the participants to encrypt their contributions. Encrypted computation and encrypted learning offer new ways to securely compute on distributed data. Let's explore this technology in the next section.
What if I told you you could actually compute on data without decrypting it? Sounds like magic, right? It's not—it's cryptography! The field of encrypted computation has experienced massive growth and new breakthroughs in the past 5 years, moving these technologies out of research labs and into production systems.
You're likely already familiar with encryption at rest for your data or file storage and end-to-end encryption used in web development and many secure messaging and file transfer applications. Encrypted computation encrypts the data in a different way that doesn't exactly work like those other two. Normally, when you encrypt data you insert quite a bit of randomness to hide any potential information left in the ciphertext; this fits your security model and needs in those use cases. In encrypted computation, you still encrypt the plaintext, but do so using either cryptosystems or protocols like secret sharing which allow you to keep computing on the encrypted data. At the end, you can decrypt the final result of your computations and it will reveal the real result—as if you had computed with plaintext data.
How does this affect privacy? In cryptography, you often regard privacy differently—let's call this new concept of privacy secrecy. If you want to keep a value secret, you want to control exactly who can see it, and how and when it is revealed. Obviously, this also benefits privacy since it gives more control over access to unencrypted data. Additionally, it provides an extra layer of protection by enabling computation without actually revealing the individual inputs. The final analysis can then only be revealed with consent and participation of the original persons.
There are two major branches of the field: homomorphic encryption (HE) and secure multi-party computation (MPC). Homomorphic encryption uses cryptosystems that have homomorphic properties and follows a more traditional cryptographic protocol, where you have a key that is used to encrypt and one used to decrypt. HE systems are computationally expensive, but can be accelerated with specialized hardware or optimizations based on your particular use case—especially if you have a small input size.
Secure multi-party computation is built for data sharing encryption use cases, where multiple parties compute something together or in a communal setting (like elections, auctions or cross-organization scenarios). The data is encrypted using a variety of MPC protocols, chosen to fit the particular scenario's security, participant requirements and use case. One popular choice is secret shares, which allow you to take a secret value and split it into encrypted shares that can be distributed among participants. When multiple players contribute shares, the group can compute values together and combine them at the end to reveal the decrypted result of the shared computation. As you might guess, MPC protocols require several interactions, which means network latency, synchronization and encrypted message size are your biggest performance factors.
Encrypted computation is a great replacement for plain-text operations that expose sensitive data in unwanted ways. For example, you could use homomorphic encryption to have users submit sensitive data and get a result via your algorithm or system. The user would be the only person who can decrypt the result, but the result was produced by your system on the encrypted input. Or, you can use MPC to replace current plaintext data sharing and computation with partners to create actual secrecy, and therefore more privacy, for the data you bring to the computation. You can architect these computations in ways such that only one or more parties can reveal the final output, helping you design data sharing systems with clear protections.
There are many other use cases for this technology, like voting, auctions and confidential computing. If you'd like to explore it further, check out Zama.ai's work on homomorphic encryption, resources from the MPC Alliance or Morten Dahl's introduction to secret sharing. There are also Jupyter notebooks from my book repository, and the encrypted computation chapter covers the fundamental building blocks of these protocols and shows how you can use them in real data science and encyrpted learning setups.
There are two specific use cases I frequently see in current data architectures that can be improved using encrypted computation. I'll outline them here to accelerate your encryption use in your own architecture.
Finding Joins: Private Set Intersection (PSI)
Private Set Intersection is an application of encrypted computation that allows two or more parties to compare their datasets and find intersections without revealing the values directly. This technology could replace much of today's insecure data sharing, which is used to identify shared users for marketing or data processing purposes.
Instead of sharing identifiers, such as emails, usernames or phone numbers, the organizations encrypt these identifiers using specific cryptosystems which allow them to compare the encrypted identifiers and find the matching identifiers. There are some security caveats in how this is implemented and choices regarding performance optimizations—particularly if the organizations have mismatched dataset sizes. This intersection step can be combined with further encrypted computation to analyze the intersection or additional data related to these identifiers without decrypting the intersection. This provides the added benefit that no human will see the direct intersection in decrypted space.
There are several concrete examples—including code—in the book and book repository if you are interested in learning more.
Private Queries: Private Information Retrieval (PIR)
Private Information Retrieval allows a person to request information, like a database query, without revealing their query or request to the database owner. It leverages encrypted computation building blocks to do so. This is particularly useful when the data owner holds extremely sensitive and private data, such as lab results or highly confidential documents. By providing the user with request secrecy, you also enforce some plausible deniability—a key factor in guaranteeing privacy.
Now that you've explored the best PETs in production today, let's analyze some related technologies in the broader category of privacy technologies.
There are many related technologies far too many for one post! In this section, I've selected several based on popularity or interesting properties. While the list is not exhaustive, it can provide an initial overview of other possible options for you when working with sensitive data.
Detecting Personally Identifiable Information (or PII) is a difficult, but necessary, problem for many organizations that manage person-related data. Over the past decade, organizations have increasingly applied mixed technologies (rulesets + machine learning) to better identify and label PII. If your organization doesn't already have a strong data governance and data privacy program, you'll need to start with the basics. Focus on creating appropriate documentation, PII labeling and building data governance and data privacy understanding before incorporating PETs into your daily work.
Format-preserving encryption for pseudonymization
There are use cases where PETs don't fit because you need to work with data in an unencrypted and centralized way. If this is the case, you enter the world of very basic privacy technologies, like pseudonymization. The best pseudonymization for general use is also cryptographic in nature, leveraging the field of format-preserving encryption to create unique and difficult to reverse engineer identifiers. It is not nearly as private or secret as the other technologies in this article, but it is much better than using raw, plaintext data in situations where an encrypted pseudonym would work! There are also several other forms of pseudonymization you should review if you have specific use case constraints or requirements, such as masking, tokenization and redaction.
Enclaves are secure compute environments that protect the processing of the data from the rest of the computer—sometimes called Trusted Execution Environments (TEE). In a situation where you want to keep a running process—not individuals—private, enclaves are appropriate. They fit only very specific parts of the secrecy problem—where you don't trust your shared compute environment or the organization running your cloud—but can also be used along with other privacy technologies to add an additional layer of security. Generally, enclaves are most appropriate for state-level security problems like problems such as operating computer infrastructure in a hostile cloud. They are expensive and an improper fit for most privacy and secrecy problems faced by organizations processing sensitive data today.
Clean rooms are a way to control the environment, software and context of data use and are often used when the data analyst or scientist is not trusted and their work must be monitored. In theory, restricting access and observing the data scientist or analyst's work is enough to provide privacy. Then, however, a qualified individual must then audit the work to confirm that no private data was released. Often, these clean rooms offer raw and plaintext access to the data, meaning the chance that the analyst or scientist learns something about the individuals in the dataset is extremely likely. Since you would need an equally qualified analyst or scientist to audit the activities, it is probably better to just use them for the analysis instead of outsourcing this work to an external party. Clean rooms are typically used by companies that want to do more with sensitive data, but are unfamiliar or unsophisticated with regard to modern privacy technologies. To offer stronger privacy guarantees, use the aforementioned recommended technologies to create solid data science and analysis environments. Have your analysts learn how to use and tune these technologies for their work instead of surveilling their work in hopes of catching poor practices. Technologies like clean rooms often create a false assumption of security—making them less safe than no privacy technology at all.
Synthetic data is the ability to create realistic-looking data from either real data or understanding about the data. Using synthetic data instead of real data can support privacy in several parts of software and system design and development, such as debugging, testing, prototyping and system validation. There are some synthetic data systems that use safer methods to enhance privacy and some that are less safe. Sadly, it is not easy for a non-practitioner to recognize these differences.
If you are interested in reviewing machine learning synthetic data possibilities, take a look at Gretel.ai's work on creating differentially private synthetic data. You still need to learn about differential privacy to properly leverage this part of their software, but using differential privacy in your synthetic data is the safest choice if you plan on using machine learning. Otherwise, I recommend non-machine learning methods and diving deeper into the methodologies if you are asked to input any real data.
At some point in the future, I hope there will be widely available synthetic data systems that are always privacy-respecting; this would be a huge help to engineers, developers and data persons to safely test, model and experiment with the software, architecture and pipelines.
Engineer Privacy In
I hope you are inspired by this whirlwind tour of PETs and potential use cases—better informed and motivated to begin engineering privacy into your systems in real ways. This isn't a one-time or an absolute process, but instead an incremental and agile one—driven by the risk appetite, technological readiness and privacy awareness of your organization..
Any step forward to offer users more privacy, transparency and choice is a small win. If you find your organization is not ready for PETs, you can still work on evangelizing privacy and increasing awareness of the changing risk and technological landscape. Having a conversation about these topics as a regular part of product design and implementation opens new pathways to evolve PETs from a 'nice idea' into a real system.
If you are looking for ways to shift or change your career, investigate the growing field of privacy engineering. Privacy engineers hold the responsibility of designing, architecting, integrating and implementing PETs. I wrote Practical Data Privacy for data scientists and technologists who want to fundamentally change the way they implement data systems—enabling user choice and real privacy through better understanding of privacy technology.
A final note: privacy is much more than technology. It's personal, social, cultural and political. Applying technology to societal problems is often naive and even dangerous. Privacy technology is one tool of many to help address real inequalities in access to privacy and power in the world. It cannot and does not address the core issues of data access, surveillance and inequalities reproduced or deepened by data systems. These problems are multi-disciplinary in nature and require much expertise outside of our technical realm.
Conversation, awareness, multi-disciplinary teams and true shifts in data power and responsibility can fundamentally change the current gaps in privacy and create empowered, user-centric, privacy-aware software and systems. If you choose to take the next steps, you'll be one of many technologists who design, build and run user-centric data systems with privacy technology to support a future where data use is transparent, just and user-driven.
Special thanks to Lauris Jullien, whose feedback greatly improved this post.
30 May 2023: Published