Blinded Anonymization: a method for evaluating cancer prevention programs under restrictive data protection regulations

Stud Health Technol Inform. 2015:210:424-8.

Abstract

Evaluating cancer prevention programs requires collecting and linking data on a case specific level from multiple sources of the healthcare system. Therefore, one has to comply with data protection regulations which are restrictive in Germany and will likely become stricter in Europe in general. To facilitate the mortality evaluation of the German mammography screening program, with more than 10 Million eligible women, we developed a method that does not require written individual consent and is compliant to existing privacy regulations. Our setup is composed of different data owners, a data collection center (DCC) and an evaluation center (EC). Each data owner uses a dedicated software that preprocesses plain-text personal identifiers (IDAT) and plaintext evaluation data (EDAT) in such a way that only irreversibly encrypted record assignment numbers (RAN) and pre-aggregated, reversibly encrypted EDAT are transmitted to the DCC. The DCC uses the RANs to perform a probabilistic record linkage which is based on an established and evaluated algorithm. For potentially identifying attributes within the EDAT ('quasi-identifiers'), we developed a novel process, named 'blinded anonymization'. It allows selecting a specific generalization from the pre-processed and encrypted attribute aggregations, to create a new data set with assured k-anonymity, without using any plain-text information. The anonymized data is transferred to the EC where the EDAT is decrypted and used for evaluation. Our concept was approved by German data protection authorities. We implemented a prototype and tested it with more than 1.5 Million simulated records, containing realistically distributed IDAT. The core processes worked well with regard to performance parameters. We created different generalizations and calculated the respective suppression rates. We discuss modalities, implications and limitations for large data sets in the cancer registry domain, as well as approaches for further improvements like l-diversity and automatic computation of 'optimal' generalizations.

MeSH terms

  • Breast Neoplasms / epidemiology
  • Breast Neoplasms / prevention & control*
  • Confidentiality / legislation & jurisprudence*
  • Data Anonymization / legislation & jurisprudence*
  • Data Mining / methods*
  • Electronic Health Records / legislation & jurisprudence*
  • Female
  • Germany
  • Government Regulation
  • Humans
  • Medical Record Linkage / methods
  • Program Evaluation / methods