We spend a lot of time talking about how we’re identified in the digital world. But tell us, what is de-identification?
De-identification refers to the process of removing personal identifiers from a database so that the data cannot be attributed to specific individuals. This ensures that individual privacy is preserved while maintaining the utility of the database for analytics and other purposes.
For example, in a bank’s database, this could involve de-identifying an individual’s name, account number, address, contact details, and so on. The de-identified database could continue to have the user’s transaction history and other attributes, such as gender, city, and employment type. This allows the bank to analyse customers’ transaction patterns and use these insights to inform product design, all without accessing sensitive personal data.
How has de-identification gained prominence in recent years?
In my experience, there are three primary factors that are driving enterprises to re-evaluate their current data handling practices:
- Increasing data breaches: A report by Breach Level Index – a project that tracks publicly disclosed information on data breaches – suggests that the number of data breaches in 2017 was almost double that in 2013. So, while cybersecurity measures are getting more sophisticated, so are the hacking attempts. Because of this, user data – itself an asset – is becoming a liability too.
- Introduction of stringent regulations: Regulations like GDPR and CCPA have tightened the regulations governing enterprises and enforced restrictions on data collection, storage and processing.
- Growing user awareness: At present, users are a lot more conscious about who is using their personal data and for what purposes than they were a few years ago.
As a result, enterprises are feeling some pressure to adopt responsible ways of storing and processing user data. This is resulting in a gradual shift away from unscrupulous data use towards a more balanced approach between data usability and user privacy. This is where de-identification solutions come in.
With the rise of regulation addressing data processing, how has the approach to de-identification changed specifically?
Early understanding saw data de-identification as a binary state wherein the data was transformed from ‘Identified’ to ‘Anonymized’: a process that was regarded as inherently irreversible. But the notion of a binary state has evolved; we now recognise a spectrum with multiple shades of identifiability. As a result, a variety of de-identification approaches, tools, and algorithms have also emerged.
There are four key categories of identifiability within this spectrum:
- Identified Data identifies or is directly linked to data that identifies a specific individual (such as a name, email, etc.).
- Identifiable Data cannot be attributed to a specific individual without the use of additional information (e.g. a lookup table). But there exists a known, systematic way to reliably create or re-create a link with identifying data.
- Anonymized Data relates to data that is purged of all the direct and indirect identifiers such that there is no systematic way for the data controller to recreate the data. However, given the abundance of data available on the internet and through other sources, there is a small risk of re-identification.
- Ideal-Anonymous Data represents the far end of the de-identification spectrum where data is completely anonymous with no risk of re-identification. This is extremely hard to achieve.
As you increase the privacy levels, data utility will naturally go down, but so will the regulatory oversight. That trade-off will determine which solution is adopted by which enterprise.
Figure 1: Categories of de-identification. Source: ‘Viewing the GDPR Through a De-Identification Lens: A Tool for Clarification and Compliance’, by Mike Hintze
You mentioned regulations are forcing enterprises to change their data handling practices and that regulatory oversight varies with the category of de-identification solution. Can you elaborate on that?
Regulations take into account the varying degrees of identifiability and so they specify different rules for different approaches. In the context of GDPR, identified data is subject to all constraints, including storage limitation, purpose limitation, data subject rights (access, rectification, erasure, portability etc.), data breach notification, and more.
Identifiable data, the second category, provides data controllers relief from certain regulations. Identifiable data can be used even beyond the purpose for which it was originally collected, it is also accepted as a valid measure of meeting “privacy by design and default” requirements, and it provides relief from breach notification requirements, etc.
Going a step further, anonymization measures mean that the requirement to meet data subject rights is relaxed; this is in addition to the exemptions applied to identifiable data. Some forms of measures may also be considered outside the realm of GDPR.
The last category of ideal-anonymous data is completely outside the scope of GDPR.
This gradated system of regulations incentivizes enterprises to adopt a stricter de-identification approach. However, there is ambiguity around which de-identification approach falls within which category and corresponding set of regulations. A standardized certification process is needed in order to address this ambiguity.
What types of tools and approaches are available to help with de-identification?
There are multiple tools and approaches available:
- Noise addition/ Perturbation: Making random, statistically insignificant changes to a data set.
- Pseudonymization/ Tokenization/ Substitution: Replacing sensitive data with substitutes or pseudonyms.
- Data masking: Hiding sensitive parts of the data with either random characters or other data. For example: account number “7654139854215” can be stored as “765XXXX XXXXXX”.
- Data suppression: Removing the sensitive data from the database altogether.
- Generalization: Grouping data. For example, generalizing age into ranges (< 20, 20-30, etc.).
- Permutation: Shuffling values within an attribute in a database. It continues to maintain the statistical characteristics of the original database.
- Aggregation: Leveraging a database for only summarized values such as sum, mean, ranges, etc.
Modern solutions to de-identification deploy two or more of these tools. We are also seeing a transition away from static application of these tools towards dynamic application, that is, applying de-identification measures at every instance of database usage.
Give us examples about companies that are providing these tools?
Identified data is the status-quo, meaning that it is the standard state in the absence of any de-identification solution.
Examples of ‘identifiable solutions’ include the company, Anonos, which provides a pseudonymization solution whereby the sensitive identifiers are replaced by tokens. However, they change the token dynamically, thereby ensuring that the modified database is not linkable to the original database without specific additional information.
‘Anonymized solutions’ are offered by Aircloak, which has created a proprietary algorithm called “Diffix” that works on noise addition and aggregation. Diffix adds varying noise to each query dynamically, making the aggregated results imprecise. This approach does not require use of direct or indirect identifiers. Mostly.ai created an algorithm that learns the pattern and structure from existing data to create a new `synthetic` dataset. These synthetic records retain their statistical properties, but have no direct link to actual individuals.
‘Ideal-anonymous solutions’ are difficult to identify because the bar for true data anonymization is extremely high.
Where are we seeing adoption of these de-identification solutions?
The use-cases of these solutions are broad and numerous, as they can operate on a variety of different databases. However, the financial services and healthcare sectors are both early adopters, as both are data-rich and deal with sensitive data.
What are the challenges and risks embedded in de-identification?
A couple of popular examples of re-identification attacks are:
- “Netflix Prize”: Netflix launched a competition for participants to create a filtering algorithm that could predict ratings for films based on past ratings. To support this, Netflix released a database comprised of past ratings, movie information, and date of rating. The user details were pseudonymized. Even though the database was de-identified in isolation, two researchers were able to identify users based on background knowledge obtained from IMDB. This led to concerns that a user’s viewing history could reveal sensitive information, and a lawsuit was filed by an in-the-closet, homosexual woman over the possibility of her being outed.
- “New York Taxi”: In 2014, New York City Taxi released a database with details like pick-up and drop-off locations, time, and fare amount. Rider details were not released and the taxi and driver details were de-identified. Someone mapped the information in this de-identified database using public data, including images of celebrities getting in or out of taxis. They used this information to isolate journey details for Bradley Cooper and Jessica Alba. People can do the same for a relative or a colleague with a small amount of auxiliary knowledge, allowing anyone to figure out the locations to which a given individual has travelled.
What would make de-identification technology more attractive for investors to support and easier for data managers to adopt?
I see three primary bottlenecks that are leading to long sales cycles and slower adoption of de-identification solutions. If solved, de-identification technology would undoubtedly increase in value and utility:
- Credibility of solution: Given the sensitive nature of the data being dealt with, the credibility of the selected solution is of paramount importance. In the absence of a standardized certification process, approaches like bug bounty challenge or validation through third parties can give investors and end-customers the required assurance.
- Integration: The products currently in the market are not easily integrable into the existing tech infrastructure of enterprises. Naturally, enterprises are wary of making changes to the backend to try something new. If this blocker can be addressed by enabling deployment through a SaaS platform or APIs, enterprises will likely be much more open to pilots and short-term contracts.
- Lack of customization: Every enterprise requires a unique balance between preserving user privacy and processing data under different use-cases. The flexibility to customize will provide convenience and accelerate adoption.
What changes to this sub-segment of privacy-enhancing technologies do you foresee that’ll support the adoption of de-identification solutions?
Firstly, from a technology or product perspective, I see these solutions evolving to allow processing on big data and unstructured databases, thereby expanding the depth and breadth of use-cases.
Secondly, I expect the sales process to become more streamlined as the market grows more educated. Specifically, I see the data protection officers – ideal buyers for these solutions – playing a more significant role in future. At present, their role is more administrative while the decision-making and spending power rests elsewhere within the organization. As their role gains prominence, sales cycle will become shorter.
Thirdly, I foresee de-identification solution providers partnering with database or analytics solution providers to offer integrated solutions. Such partnerships could support large-scale deployment through the existing distribution channels of these potential partners.
Fourthly, for better or worse, regulation will continue to be a driver, more-so as enforcement becomes stricter. Introduction of data protection regulations outside the EU and US will also expand the geographic scope.
Fifthly, de-identification could open up the possibility of monetizing data-sharing with third parties. This could prove to be an additional incentive for enterprises.
Finally, user privacy could become a differentiator and a competitive advantage in the long-term. We are already seeing Apple take this approach. As more enterprises realize this, such solutions will become a default, that is, ‘privacy by design’. At least, that is our hope!
Omidyar Network is collaborating with at least 13 venture capitalists and five other thought-leaders, including the National Venture Capital Association, to shift the current data paradigm toward a Race to the Top. We need a data economy that respects innovation and customer values / human dignity. More regulation won’t solve this problem; we also need more “trust-first” businesses and technologies that meet changing customer demand and expectations for privacy and security, including de-identification solutions. VCs can help determine what types of businesses and technologies stand out and succeed. They can put more capital in the market to encourage and fuel the growth of this market.
Together, this group of VCs are co-creating a set of investment tools that lead us to the most promising trust-first businesses and enable us to support portfolio companies in their adoption of stronger data practice. We welcome any investors or startups to join us in the journey to find more sustainable and responsible alternatives to the “collect everything, protect nothing” status quo.