Photo: iStock-1082482710-LB.jpg

Estimating the success of re-identifications in incomplete datasets using generative models

– Nature

This study reveals the ease with which research participants can be re-identified based on supposedly anonymized personal data

Robust research is dependent on high-quality medical, behavioral, and socio-demographic data. Large volumes of personal data are collected on study participants in order to draw reliable conclusions. However, this large-scale data-collection presents privacy risks for participants involved.

This research, carried out by an international team of researchers, describes a new method of assessing the risk of participant re-identification in any given study. Using this model, which analyzes 15 demographic attributes, the authors found that 99.98% of Americans can be re-identified after having featured in a research dataset.

Research datasets are supposed to be anonymized prior to sharing, meaning that they are no longer regarded as personal data, however this study demonstrates that de-identification methods are not robust enough to guarantee participants’ privacy. The consequences of this, the authors note, are potentially serious for participants who are identified – it could impact on their insurance status, employment, and relationships.

Even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model

View Article