Part of the difficulty and sensitivity of working with Human Data is protecting the privacy of the individuals that data represents. The most important principle to remember is that if you do not need to be processing personally identifiable information (PII), then don’t. When the nature of the analysis does require you to work with this type of data, it needs to be anonymized. In this piece, I will take a look at some techniques that can be used to ensure that Human Data programs protect consumer privacy while retaining value as a behavior prediction tool.
- Encrypting personally identifiable information. If the existing structure of your data is important and you do not want to risk changing it or upsetting it in any way before you perform your analysis and testing, consider simply choosing the fields and data types that are personally identifiable and simply encrypting them, with a strong passphrase and perhaps a hash or salt or two. (Off topic: Now I am hungry for hash browns with salt and pepper.) This ensures that the integrity of the record itself remains intact, with no changes to the form or the data; all you have done is obfuscated and secured the personal information stored in that record. For analysis done internally that will not be either published or shared with affiliate companies, this is probably a satisfactory resolution to any anonymity requirements you have or plan to put into place.
- K-anonymity. K-anonymity refers to a technique where if you release one record that has particular identifiable information in it, you also release or use several other records who have the same or substantially similar values as those that appear in the data in question. For example, if you are using an amalgamation of records from a certain zip code, ensure that more than one record from any given zip code is included in your sample set. This way, it is more difficult to identify a single record and tie it back to a single individual. K-anonymity traces its roots to 2002, when medical researchers attempted to find a mathematical way to solve for the problem of releasing medical data on scientific research without identifying patients and their individual diseases and treatments. The K can of course be replaced with an integer—for example, you can say your data set has 4-anonymity if some attributes in your table can be found in at least four other rows as wells.
- Generalization. Generalization is one of the ways k-anonymity can be achieved, although k-anonymity is more of a standard or a goal than it is an official technique on the way to anonymizing data. When you generalize data, you remove specificity from it. If you have a table where, for instance, individual household income levels are included–$164,000, $58,543, $90,893, and $232,234, for example—generalizing these specific numbers would mean you could say the values are “more than $150,000, less than $60,000, between $90,000 and $100,000, and more than $225,000” respectively. Essentially you are taking exact figures, establishing a baseline category, and then obfuscating the data by assigning it to one of your categories in order to remove any sense of specificity from it.
- Perturbation. Data perturbation is an anonymization technique that actually changes the data in a record set, albeit in a statistically insignificant way. There are several “sub-techniques,” or different ways of perturbation that can be put into play here depending on the type of data you use and the level of anonymity that may be required in your application. Perhaps the most useful for Big Data and Human Data applications is microaggregation, where you sort personally identifiable information in a given order, smallest to largest or largest to smallest for instance, and then take groups of similar sized numbers and then average them together and replace the specific values with that average. So if you had the following table, for example:
You would group the income rows into three groups of two, and then average the incomes for each of these three subgroups, and then replace the income figures with those averages. So keys 1 and 2 would average $56,683.50, keys 3 and 4 would average $37,386.50, and keys 5 and 6 would average $30,511. Then the newly perturbed table would look like the following:
In this way, no individual income is available any longer, but you have not skewed the numbers in a way that would make analysis and prediction difficult.
There are other methods of data perturbation, including:
- Data swapping, where you swap pairs of data so that the total dataset still includes the same data but its locations and associated records are changed—this is somewhat less useful in Human Data because of the linking together of behavior insights.
- Post-randomizations, which randomizes values based on a mathematical sequence of probability.
- Adding noise, which adds random values but is not very effective at actually anonymizing data as it does not actually remove links.