Anonymous Data? How Anonymous?

According to CCRi’s Ethical Principles, individual privacy is a right. When statistical microdata (data about individuals) is released to the public or other non-trusted parties, it must go through a process of anonymization to protect the privacy of the individuals represented. This discussion will analyze 1) the potential costs of insufficient anonymization, 2) a few broad categories of anonymization techniques, and 3) how technical professionals can most effectively communicate about anonymization. If you are a data professional who has not yet studied anonymization, this should serve as a good introduction.

What are the costs of insufficient anonymization?

Private information such as financial records, website credentials, personal conversations, and location data are widely digitized. Some of these data, when shared with a larger audience, are useful to businesses, academics, and policymakers, but insufficient anonymization can make it possible for malicious actors to target individuals or groups.

Spatiotemporal data is particularly sensitive and difficult to sufficiently anonymize. A recent New York Times study analyzed a commercially-available set of “anonymous” cell phone location data. By linking the commercial data’s quasi-identifiers (attributes that can be matched up with external data) with public records (a technique that has also been used to infer individuals’ political affiliations based on the location and duration of their Thanksgiving dinners), the journalists identified and tracked individual military officials, law enforcement officers, lawyers, performers, protestors, and a Secret Service agent tasked with protecting the President. It’s easy to imagine what could go wrong if that data fell into the wrong hands.

And “the wrong hands” could be anyone who has the means to buy the data, who is employed with the right company, or who is on the receiving end of a leak, as the Times journalists were. Even if malicious actors never gain inside access, hacks of media companiesretail storescredit reporting agenciesbanksairlinestrain companieslaw firms, and defense contractors give us reason to believe that sensitive data will eventually be breached.

In recent months, COVID-19 has spurred interest in harnessing cell phone location technology to support “contact tracing,” which identifies people who came into contact with someone carrying a contagious disease. While there have been impressive efforts (most notably specifications designed by Apple and Google) to enable privacy-conscious contact tracing, there has already been at least one example of poorly-implemented contact tracing enabling a stalker and concerns among South Korean LGBTQ communities that their tracing program could result in forced outing.

Insufficiently-anonymized data can also cause problems when its precision conveys a false sense of certainty. In Arizona, a man was arrested and jailed, with devastating consequences for his reputation and livelihood, based on cell phone location data that was used as evidence that he was present at the scene of a murder. The data, collected under a geofence warrant (a general request for geolocation data in a particular area and time), exhibited questionable accuracy by placing the subject in two locations at once a number of times. It was still treated as sufficient evidence for an arrest. In another example, data from no more than seven cell phones at a virology lab in Wuhan were used as evidence that COVID-19 had escaped from a lab. The story, shared on Twitter by at least three members of Congress, has been heavily criticized for ignoring significant gaps in the geospatial data as well as other contextual data such as Facebook posts by lab employees. Yet the story and underlying report have fueled conspiracy theories.

Besides the risks to national and individual security as well as those of misled law enforcement and the spread of misinformation, many believe erosion of privacy is an erosion of human rights. Legal theorists have proposed thinking of privacy as part of the environment, and so in need of similar protection. Courts have awarded damages for “subjective” harms, such as emotional distress, due to breaches of privacy. Article 12 of the Universal Declaration of Human Rights explicitly marks privacy as fundamental to humanity, a position echoed in CCRi’s own Statement of Ethical Principles.

What anonymization methods are available?

Anonymization of data sets, like encryption of data, requires striking a balance. For encryption, the balance is between easily communicating among trusted parties and hiding sensitive data completely from an untrusted public. For anonymization, the balance is between relaying useful data to benevolent actors in the public and hiding sensitive information in the same data from malevolent actors in the same public. Encryption describes the balance in terms of (among other measurements) cipher suites (tools) and key lengths (security metrics). Anonymization has its own set of tool names and metrics to describe quality and degree of anonymity. Below is a brief overview of some of these tools and their associated metrics.

Descriptors and measures of data anonymity

k-Anonymity describes an individual’s level of protection from being uniquely identified by linking to external data sets. If a data set is 10-anonymous, then identifiers are generalized so that no fewer than ten individuals share the same generalized identifiers. If external data can be used to place an individual in some “bucket” of people represented in the data set, the associated information can only be associated with the individual with 10% probability. For example, someone knowing that I am age 20-30 and live in Charlottesville may find the ten (or more) records representing my “bucket” in a larger, 10-anonymous set of health records, but they would not know which of ten or more individuals I was in that group.

While k-anonymity limits the ability to identify an individual in a data set, t-closenessp-sensitivity, and -diversity prevent disclosures of information about individuals who are externally known to be represented in a dataset. Consider the example above where I am known to be in a group of no fewer than ten people in a set of health records. If every person in my group of ten share some sensitive health attribute (such as diagnosis of a particular disease), then k-anonymity won’t keep an adversary from learning my diagnosis. It’s enough to simply know I am in a particular bucket. If the bucket were ℓ-diverse, then there would be at least  diagnoses possibly associated with me. t-Closeness and p-sensitivity measure different kinds of resistance to similar attacks that rely on knowing the bucket in which an individual is represented.

These four metrics and their associated anonymization techniques help protect anonymity in controlled public data releases, with special care taken in series of releases based on the same private data.

However, different considerations apply when anonymizing data services that support ongoing, arbitrary user queries. ε-Differential privacy measures resistance to inference attacks on such services.

Each of the letters prepending these terms are partial answers to the crucial question for data technologists: “How anonymized?” Each metric measures a trade-off between data privacy and utility. The k in k-anonymity refers to the smallest number of individuals represented in some “bucket” of an aggregated data set. As k is made larger, each individual is less identifiable, but what is known about the larger groups is less granular. tp, and ε (all having somewhat more complex explanations than that of k) likewise measure how “hardened” a data set is against de-identification.

Data professionals planning an anonymized data release or designing a data service must rely on contextual information such as the problem being addressed, data sensitivity, and adversary threat models to determine appropriate levels for these trade-off metrics. Determining “How anonymous should this data be” is non-trivial and should be done by technologists familiar with all the above techniques and their various strengths and weaknesses.

These metrics provide a basis to answer two basic questions about anonymization of data: “How was the data anonymized and to what extent.” Because method and degree selection are contextual, there is no universal anonymization “score,” just as there is no universal encryption “score”. But data professionals should communicate their choices and justifications. Doing so avoids leading stakeholders to an unrealistically simple understanding of how their or others’ data is being protected.

And, while avoiding overwhelming the audience with unnecessary detail, the anonymizer must avoid excessive generality. For example, while all the tools mentioned above involve aggregation of some type, one should never imply that aggregation necessarily anonymizes data. It does not. A Google search for the phrase “aggregated and therefore anonymous” returns 137 million results, 118 million of them coming from boilerplate cookie notice text. By communicating “aggregation is equivalent to anonymization” to consumers via these cookie notices, companies are reinforcing the idea that data protection is a simple matter of “doing the anonymization thing.” It also communicates naiveté, if not actual malice, on the part of the communicating company and calls into question the quality of their anonymization efforts.

Anonymization of geospatial data

We at CCRi pride ourselves in our domain knowledge of geospatial data, especially very large datasets and streams. This domain faces unique challenges when it comes to protecting privacy. Geospatial data can be especially tricky to anonymize, especially if it’s tied to temporal information. Varied datasets require varied anonymization approaches, and dynamic data such as regularly-updated aggregation services require yet another set of techniques.

Perhaps the simplest approach is aggregation. While aggregation is not necessarily sufficient for strong anonymization, it’s a useful tool to move data towards anonymization. For example, information about all points within a given postal code may be presented as counts or averages. Or, points in tracks may be treated as quasi-identifiers and grouped into k-anonymous buckets. But aggregation at this level loses significant granularity.

More finely-tuned techniques fall under the term “geomasking.” One popular approach, called “donut masking,” displaces points in a donut-shaped area around the original point. The “donut hole” prevents the point from being accidentally placed too close to its original location. The outer bound prevents points from being displaced too far, excessively degrading the information quality. k-Anonymity under this approach may be calculated by counting the number of locations (for example, residences and buildings) closer to the displaced point than to the original point.

Within geomasking, a variety of tools are available to match various shapes of input data. While simple donut masking may be sufficient for homogeneously-populated areas like a city suburb, the same algorithm may leave people trivially identifiable when applied to data that includes both urban and rural areas. In heterogeneously-populated areas you need more dynamic approaches. “Adaptive random perturbation” is one such approach that shows promise. It adapts the geomask size to the characteristics of the area being masked to ensure k-anonymity.

Activity-based information sets such as social media check-ins (points) or GPS travel records (tracks) are more difficult to anonymize than data with a one-to-one relationship between points and individuals in which each individual is associated with many points. Researchers found that in a dataset containing hourly location “pings” from 1.5 million people  –  similar to what may be gathered by cell service providers or cell phone apps  –  95% of individuals can be identified from only four points. Geomasking techniques like donut masking are ineffective when too many points are recorded. If, for example, there are 100 points recorded in and around someone’s home, donut masking will simply place the person’s home in the center of the donut hole. (Analogously, if points are perturbed along a known track such as a road, de-obfuscation may be as simple as “snapping” points back along the known path.) Depending on the target use for the data, the anonymizer could apply a clustering algorithm to identify the “home” location and then apply geomasking. It is less clear how masking would be applied if not all points make clusters, as in data covering a person’s commute.

Associating temporal with geospatial data compounds the difficulty of anonymization. “A point representing someone’s location at some point in time” may not reveal anything sensitive. But “a point representing someone’s location at 3AM” may reveal the location of someone’s home. While most examples here focus on geospatial data, strong anonymization must consider all dimensions.

If the data recipient is somewhat trusted, then less aggressive anonymization techniques may be sufficient. For example, if the recipient is a researcher within a health organization who wants to apply a clustering algorithm, data could be obfuscated by simply translating and rotating a set of points. A motivated adversary could find a way to reverse the transformation. But for some recipients, it may be sufficient to simply ensure that the data is not trivially identifiable.

A key part of creating and maintaining an effective anonymization scheme is good communication both within the team charged with anonymization and with the consumers of the data. It is easy to either over-simplify or over-complicate.

How should we communicate about anonymization?

CCRi’s Ethical Principles state that “Users should be given meaningful information and empowering choices.” Specifically, “Information must be summarized enough to be useful, but detailed enough not to be misleading.” When communicating anonymized data or about anonymized data, we must consider the audience and tailor our messages for clarity, conciseness, and completeness.

A typical lay-person’s understanding of anonymized probably assumes irreversibility, and that definition has official support. Merriam-Webster’s definition of anonymization, for example, implies anonymized data cannot be re-identified (emphasis mine):

to remove identifying information from (something, such as computer data) so that the original source cannot be known to make (something) anonymous

Given this definition, the question “How anonymized is the data?” would be nonsense – a data set is either anonymized or it is not. Legal standards also support this dichotomy. Chinese and European law require irreversible anonymization, though the courts have had difficulty enforcing or even fully understanding that standard.

Recent news headlines, often with “anonymous” in scare quotes, show us that the term has been used to gloss over the incredible technical challenges associated with disconnecting data from individuals. For example, the word has been used to market cell phone location data which has proven almost trivial to re-identify.

There is a critical disconnect between how anonymization is marketed and how it is actually performed. That disconnect is not a mere side-effect of different professional areas’ varied lexicons. Behind this clash of definitions are competing interests. The surveilled desire complete, irreversible anonymity; surveillants desire useful, marketable data  – or, in some cases,  something serving more nefarious ends.

Data scientists are charged with striking the proper balance between those needs. They must apply anonymization techniques to a sufficient extent, as contextually determined by practical, ethical, and legal standards. But, they must do so without degrading the anonymized data beyond some threshold of usefulness determined in cooperation with the stakeholders who wish to use that data or its products. Thus, for data professionals, anonymization is a matter of degrees. For them, the question “How anonymized is the data set?” is both sensible and crucially important.

Thankfully, there is a linguistic tool at our disposal. By qualifying the word anonymized – for example by describing a data set as “heavily anonymized” – we can make the adjective explicitly gradable and force the audience to discard the more typical but overly-optimistic definition.

Lessons from Information Security

Security professionals have dealt with a similar problem with the word encrypted. By a strict technical definition, text encrypted with the humorously-insecure ROT-13 algorithm is “encrypted,” although a non-technical person with five minutes’ training could decrypt ROT-13 encoded text with a pen and paper. However, a responsible e-commerce website would never describe its site as “encrypted” if ROT-13 were its encryption algorithm. It would be disingenuous, because consumers equate “encrypted” with “secure” and it might expose the company to legal action under consumer protection laws. For example, companies that handle credit card transactions without Payment Card Industry Data Security Standard (PCI DSS) compliant encryption could run afoul of the the GDPR in Europe or the GLBA in the US.

Website maintainers encrypting site communications have one major set of tools at their disposal that does not yet exist for data professionals anonymizing sensitive information: well-defined standards, such as PCI DSS. Web browser vendors and security professionals have together defined lists of acceptably-secure encryption algorithms. These lists are regularly updated, as certain vulnerabilities are uncovered for certain encryption algorithms. And, users are clearly warned when a website attempts to use insecure encryption. ROT-13, though technically a type of encryption, is not on anyone’s list of “secure” ciphers.

Given these clear and commonly-accepted standards, website maintainers can ethically describe their properly-configured website as “encrypted.” In this case, the technical professionals have made certain that the user’s basic understanding of the word encryption at least approaches the reality of its technical implementation.

Not all encryption problems are as clear-cut as website encryption. In contexts where standards are less well-defined, cryptographers will often clarify with qualifiers like “heavily encrypted,” or “encrypted with military-grade cryptography” (explicitly invoking an external standard). By qualifying the term, the technical professionals 1) help the lay-person understand that encryption is not a simple yes/no question in a given context, and 2) offer a word or phrase to serve as the basis for further research.

Thus, qualification and clarification help enable informed decisions. Such efforts to be honest and upfront stand in stark contrast to tech companies’ linguistic sleight-of-hand in security marketing – for example, swapping the blanket term encryption in place of the more specific phrase end-to-end encryption to gloss over important limitations of security features that do not make use of a particular kind of encryption.

Qualifying anonymization

Unlike encryption on e-commerce sites, widely-accepted standards for data anonymization do not yet exist. After all, cryptography has been around for thousands of years, while the explosion of publicly-available personal data is relatively recent.

Until better anonymization standards exist and are widely understood (by technical and non-technical audiences alike), data professionals should always qualify the term anonymous and its derivatives when communicating with stakeholders. Simply affixing grading adverbs such as heavily and carefully are a decent start. They achieve the first benefit described above with respect to encryption: they help the lay person understand that anonymization is not a simple yes/no question.

To achieve the second benefit of qualifying terms – to offer a basis for deeper understanding of the techniques used – we should take advantage of the anonymization techniques mentioned in the previous section. Each comes with some quantifying value. That value can be mentioned and then justified for the data in question.

Anonymizing your data

Anonymization is difficult. A proper balance must be struck between data utility and identifiability. Finding that balance requires an understanding of current anonymization techniques and current re-identification attacks. Anonymization of geospatial data requires unique expertise in that domain. Experts must take care to communicate their efforts without either over-simplifying or over-complicating the process for a given audience.

CCRi specializes in building strong geospatial data solutions and communicating about them in a way that maximizes mutual understanding and utility. We also exercise a commitment to professional ethics as outlined in our Statement of Ethical Principles. We hope this overview of the state of anonymization technology will help technologists more effectively protect the right to privacy and more effectively communicate about this relatively young and quickly-developing field.