Assessing privacy risks in population health publications using a checklist-based approach

Assessing privacy risks in population health publications using a checklist-based approach Abstract Objective Recent growth in the number of population health researchers accessing detailed datasets, either on their own computers or through virtual data centers, has the potential to increase privacy risks. In response, a checklist for identifying and reducing privacy risks in population health analysis outputs has been proposed for use by researchers themselves. In this study we explore the usability and reliability of such an approach by investigating whether different users identify the same privacy risks on applying the checklist to a sample of publications. Methods The checklist was applied to a sample of 100 academic population health publications distributed among 5 readers. Cohen’s κ was used to measure interrater agreement. Results Of the 566 instances of statistical output types found in the 100 publications, the most frequently occurring were counts, summary statistics, plots, and model outputs. Application of the checklist identified 128 outputs (22.6%) with potential privacy concerns. Most of these were associated with the reporting of small counts. Among these identified outputs, the readers found no substantial actual privacy concerns when context was taken into account. Interrater agreement for identifying potential privacy concerns was generally good. Conclusion This study has demonstrated that a checklist can be a reliable tool to assist researchers with anonymizing analysis outputs in population health research. This further suggests that such an approach may have the potential to be developed into a broadly applicable standard providing consistent confidentiality protection across multiple analyses of the same data. data anonymization, confidentiality, privacy, biomedical research, health services research INTRODUCTION Access to data for population health and health services research is a powerful tool for influencing health policy and health promotion.1,2 At the same time, analysis of health data must comply with relevant standards of privacy and confidentiality. Increasingly, richer datasets are being created by linking (or integrating) datasets. For example, linking a cancer registry with hospital and/or pharmaceutical data can give a more complete picture of factors influencing health outcomes. But richer datasets also contain more details about individuals, so the risk to privacy is generally higher. Recent growth in the number of population health researchers using linked data may lead to an increase in risk from multiple analyses of the same data, suggesting the need for broadly applicable standards of confidentiality protection. The objective of this study is to investigate the usability and reliability of a checklist-driven approach to assist researchers with assessing privacy risks and anonymizing outputs of statistical analysis in population health research conducted on datasets they either hold or access through a virtual data center such as the Secure Unified Research Environment (www.saxinstitute.org.au/our-work/sure/). This objective is achieved through a study involving the application of a recently proposed checklist4 to a sample of published papers. BACKGROUND In response to the need to protect individuals’ privacy during access and use of data for population health research, various approaches and methods have been developed.4–8 Currently, researchers holding datasets are responsible for anonymizing analysis results, while in most virtual data centers analysis results are checked for privacy risk by an expert of the hosting agency using guidelines.6 It is recognized, however, that in-house checking may not be feasible in the long term as demand rises.4,9 The majority of published anonymization guidelines, reidentification risk measures, and reidentification risk evaluation studies exclusively address the release of anonymized datasets for research.6–8,10–14 There are few published anonymization guidelines for virtual data centers. One example is the European Statistical System Centres and Networks of Excellence Statistical Disclosure Control guidelines,6 designed for use by experts within national statistical agencies, to determine whether outputs generated by researchers could be released for use in publications or presentations. In another example, the Statistics New Zealand Microdata output guide15 describes methods and rules researchers must use for anonymizing output produced from Statistics New Zealand’s microdata before being subject to manual checks by experts within Statistics New Zealand. A previous paper4 reported the outcomes of a project to address the challenge of protecting the privacy of individuals whose data are made available for public health and health services research through a virtual data center. The recommended approach involved: Data Preparation: Data custodians apply basic anonymization treatments to datasets before making them available to researchers through a secure interface. Output Anonymization: Researchers identify potential privacy risks, further examine them in context, and then apply anonymization treatments as needed to reduce risks to acceptable levels. To assist researchers who are not necessarily experts in statistics or statistical disclosure control with identifying and treating potential privacy risks in the outputs of statistical analyses in population health research, a checklist for output anonymization was developed.4 The checklist is suitable for use by researchers holding datasets or accessing them via virtual data centers. The checklist is used to identify outputs with potential privacy risks by applying a number of tests to the output. Researchers then need to further examine any identified risks, since an output failing a test does not mean there is a definite privacy risk. The actual privacy risk of an output will depend on the study context and, particularly, other information that might be available; for example, external datasets that could be used to reidentify health information using data matching. The checklist is presented according to the types of statistical analysis output occurring in public health research results: statistics such as means, graphical outputs such as Kaplan-Meier plots, modeling output such as relative risk, and tables. For a given output type in the first column of the table, say a Kaplan-Meier plot, the Anonymity Test column presents the applicable anonymity tests, each in its own row. If an output fails an anonymity test – for example, if a Kaplan-Meier plot fails the individual data test by revealing actual dates of death – then this is examined in context to determine if there remains a privacy risk. For example, some plots reveal dates of death in terms of number of days from an unknown start point, or report deaths only by month, so would have low privacy risk. For any outputs with remaining potential privacy concerns, the Anonymization Treatment column explains how to reduce the potential privacy risk; in this case, by smoothing the plot or coarsening time periods in the underlying data. A final column provides some notes, which could be deleted in a production version of the checklist. An important benefit of this checklist approach is that researchers can compare the original output with the anonymized output to ensure that the anonymization treatments applied do not adversely impact the statistical inferences and conclusions they have drawn. Given a particular analysis output, the process involves 3 steps. The first step is: Step 1: Locate the row of the checklist corresponding to that output type, apply each anonymity test in that row, and identify any that fail the test as a potential privacy risk. Thus, for example, if an output comprises a count (number), the threshold n test would identify the output as a potential privacy risk if it is less than n. This test is equivalent to requiring that if a combination of variables occurs for an individual in the data, then it occurs for at least n individuals (this property is often called n-anonymity). We emphasize that it is the reidentification risk of the analysis output that is of concern, not the reidentification risk of the dataset itself. However, attacks on analysis output often involve attempts to reconstruct data values, either exactly or with low uncertainty.3 Applying the checklist tests and treatments to outputs computed on all variables may not always be necessary for privacy protection, and we introduce the concept of quasi-identifier to explain why this is the case. Quasi-identifiers are background knowledge variables about individuals that an intruder could use, individually or in combination, to reidentify a record with high probability. Unlike direct identifiers, quasi-identifiers are normally useful for analytic purposes and cannot be used alone to uniquely identify an individual with certainty.16 The most common way that reidentification of a data record can occur is by matching quasi-identifiers to an external database containing those quasi-identifiers together with identifying variables such as names and addresses. Examples of quasi-identifiers are: sex, date of birth, and age; location (such as post codes or census geography); ethnic origin; total years of education; marital status; criminal history; income; profession; event dates (such as admission, discharge, procedure, or visit); codes; and country of birth. Examples of variables that are generally not quasi-identifiers in the health setting are clinical observations such as blood pressure or body mass index. However, care must still be taken with such clinical observations, since they can present a reidentification risk in some situations. For example, an individual who is the only morbidly obese person in a dataset is at higher risk of being recognized and reidentified. When working with quasi-identifiers, note that it is possible that attributes that are not quasi-identifiers in one situation can be quasi-identifiers in another situation. Further, attributes that are not quasi-identifiers today might become quasi-identifiers in the future, so this practical simplification should be implemented with care. More generally, the actual privacy risk of any research output will depend on the study context and other information available, particularly external datasets facilitating matching of quasi-identifiers; that may change over time. For these reasons, it is generally not possible to give a definitive list of quasi-identifiers. More information on selecting and using quasi-identifiers can be found elsewhere.16,17 Where it is clear that the analysis output involves variables that are not quasi-identifiers, the researcher would apply a second (optional) step: Step 2 (optional): Where step 1 identifies a potential privacy risk, apply the same anonymization test from the checklist to the sub-cell or restricted statistic determined by only the quasi-identifiers. In the example of a count, step 2 would involve recomputing the value while disregarding variables that are obviously not quasi-identifiers. A consequence is that step 1 will normally identify a larger number of outputs as having associated potential privacy concerns. Subsequent application of step 2 will normally significantly reduce this number. Finally, the researcher would conduct a third step: Step 3: Where steps 1 and 2 identify a potential issue, apply one of the suggested anonymization treatments from the checklist. The checklist was designed for the context of publication of results from Australian population health research using linked data. We first produced a draft checklist and then tested it by choosing a sample of publications, which were read by a small panel. The aim was to look for omissions and obscurities in the draft checklist; for example, outputs in the papers read that were not included and descriptions, tests, and treatments that were unclear or unhelpful. This led to a revised checklist.4 METHODS In the present review we broadened the scope of publications considered but retained the emphasis on linked data for two reasons: to give continuity with our previous work, and because linked data may a priori have potentially higher privacy risks due to the generally larger number of variables. We selected a new pool and sample of publications, and a panel of readers to apply the checklist as in steps 1 and 2 above. The readers were all academic researchers with master’s or doctoral degrees in mathematics or statistics, but with very little or no experience in population health research, practical disclosure risk assessment, or anonymization. Despite our efforts to ensure completeness and clarity in the checklist, we recognized that its interpretation could be subject to variation. So we also checked for consistency of interpretation by having a subsample of the papers read by pairs of readers and then compared their results using Cohen’s κ.18 Our initial pool of publications was created from two sources: All 64 journal publications using linked data provided by the Centre for Health Record Linkage in Australia (see www.cherel.org.au) published between 2009 and 2013, and A random selection of 300 of the 2236 journal publications returned by the PubMed search engine (see www.ncbi.nlm.nih.gov/pubmed) for the years 2008–2013, with the keywords “record linkage” and full text available. A sample of 100 publications was selected randomly from the pool of 364 publications. Each of 5 readers was allocated 10 papers, and the remaining 50 papers were allocated in groups of 10 to the pairs of readers (1, 2), (2, 3), (3, 4), (4, 5), and (5, 1). So each reader read 30 papers, 10 by themselves and 20 that were also read independently by another. The sizes of the samples and the overlaps were chosen so the reviewing task was a reasonable size for each reader while enabling a pairwise comparison of readers for consistency. Researchers would normally undertake a training session on the tests and checklist before using them. To simulate this in our study, each reader initially evaluated a common set of 3 publications and then participated in a discussion session to compare results. To understand the prevalence of potential privacy risks identified by applying steps 1 and 2 of the checklist, we collected baseline data about the occurrence of statistical analysis output types in the publications. The readers categorized analysis outputs into the types listed in column 1 of the checklist and counted the number of publications containing at least one output from each type to give some indication of the popularity of each output. So each output type was given a score of 1 or 0 (presence/absence) for each publication. Counting the actual number of outputs of each type in each publication was considered unnecessary for the purpose intended. The readers then used the anonymity tests in the checklist to identify potential privacy risks among the individual statistical outputs found in the publications, ignoring whether variables were quasi-identifiers or not (step 1). The extent of agreement between each pair of readers in their identification of outputs for potential privacy risk was determined by calculating Cohen’s κ.18 This is a statistical measure of interrater agreement for categorical items and is generally regarded as a more robust measure than a simple percent agreement calculation, since it takes into account agreement occurring by chance. Values of κ above 0.4 are generally accepted as representing fair to good agreement.19 Finally, the readers considered which variables were quasi-identifiers and reduced the collection of identified potential privacy risks accordingly. RESULTS Baseline data on occurrence of statistical analysis output types The readers identified 25 different output types in the sample, with a total of 566 occurrences. The majority of the outputs were of the following types: Summary statistics: count/number, percentage, mean, median, mode, ratio, maximum, minimum, percentile, variance, standard deviation Plots: scatterplot, histogram, Kaplan-Meier survival curve Modeling outputs: parameter estimate, relative risk, odds ratio, hazard ratio, confidence interval, P-value, chi-square Among these, the most common classes of output were number, percentage, confidence interval, individual data, P-value, and odds ratio. Figure 1 shows the 25 output types as column labels and the number of publications in the sample containing at least one output of each type as the height of each column. Thus, for example, 57 publications in the sample reported a confidence interval, while 1 publication reported a mode. The most common types were number (91) and percentage (85). So, for instance, the output type number was present in 91 of the 100 papers (91%) and thus represented 16% of all outputs present in the sample. Figure 1. View largeDownload slide Number of outputs of each type among the 566 statistical outputs identified in the publications of the sample of 100 publications, also showing the number of outputs of each type that were identified by one or more of the step 1 anonymity tests as applied by one or more readers. Figure 1. View largeDownload slide Number of outputs of each type among the 566 statistical outputs identified in the publications of the sample of 100 publications, also showing the number of outputs of each type that were identified by one or more of the step 1 anonymity tests as applied by one or more readers. There was a small number of papers in the sample that did not include any statistical analysis of actual linked health data, including one that had no mention of data linkage in the text despite its appearing as a keyword. Application of step 1 anonymity tests There were 128 instances of individual statistical output types identified as potential privacy risks by one or more of the checklist anonymity tests by one or more readers (in step 1). If a type was so identified more than once in a paper, it was only counted once (presence/absence again). Figure 1 displays the 128 instances by type; so, for example, 42 instances of a reported number and 2 instances of a scatter plot were identified. The most common output type identified was individual data. Specific reasons for identification in some of the common output classes were: Inclusion of overly detailed contextual information, such as extremely fine-grained location and dates of the study, that could be interpreted as individual data Reporting of summary statistics based on a small number of values (number, percentage, odds ratio, mean, frequency table, ratio) High precision/unnecessary number of significant figures (confidence interval, P-value) Nonsmoothed curves (Kaplan-Meier estimates) Graphs and plots below a threshold or revealing individual values (graph of grouped data) It can be seen from Figure 1 that the published output types most commonly identified under the tests were individual data, scatter plots, maxima, minima, and Kaplan-Meier plots. As an example of the interpretation of these results, consider the output class Kaplan-Meier estimates. At least one example of a Kaplan-Meier estimate occurred in 11 of the publications, and in 7 of these, one or more estimates were identified by at least one of the step 1 anonymity tests by at least one reader. Note that there is a low correlation between the potential privacy risk of an output type and its frequency of occurrence. This is not unexpected, since an output would be included if it contributes to the analysis, regardless of its potential privacy risk. For the 2 types always identified in step 1, one occurs in nearly half the papers read, the other in just 2 papers. The more frequently occurring one is, at least potentially, the more concerning. Interrater agreement for step 1 anonymity testing In this study, we found full agreement (κ = 1) between the 3 pairs of readers 1 and 2, 2 and 3, and 3 and 4. For readers 5 and 1, κ = −0.01, while for readers 4 and 5, κ = 0.3. Further investigation of the data from these 2 pairings reveals that either the differences are due to differences in the readers’ labeling of output types in the collection of the baseline data, or they can readily be addressed with researcher training. Readers 4 and 5 differed in their reporting of output types across several papers. In one paper, reader 4 reported individual data, number, percentage, confidence interval, P-value, and odds ratio, while reader 5 reported number, percentage, percentile, scatter plot, confidence interval, P-value, chi-square, odds ratio, and logistic regression. On inspection, the publication includes both a plot and the associated data, an odds ratio within the output of a logistic regression, and a chi-square test mentioned in the analysis but not in the text. So the difference between the readers is due to the difference in labeling of these nested output types and subsequent different counts assigned by the readers. A check shows that the 2 readers’ identification of potential risks was, however, effectively identical across all the common publications. Thus, although there is not full agreement between readers, the differences do not affect outcomes from the step 1 anonymity tests and only affect the baseline counts. The difference between readers 1 and 5 is due to the fact that reader 1 did not identify individual data as potentially risky in any of the papers shared with reader 5. This discrepancy can be explained in terms of difference in training. All readers were trained on a common set of 3 publications; however, readers 2, 3, 4, and 5 participated in a similar exercise of applying a draft version of the checklist to a different sample of publications (see the “Background” section), while reader 1 was only involved in the current study. Subsequent investigation showed that the common training publications here did not contain any instances of individual data, hence reader 1 had no training or experience with individual data. The fact that reader 1 had perfect agreement with reader 2 on their papers, which, as it happened, did not include any occurrences of individual data, strongly suggests that the checklist is otherwise clear. Application of step 2 anonymity tests Subsequently, the step 2 anonymity tests were applied. For each output identified as a potential privacy risk in step 1, the tests were reapplied, disregarding variables that were obviously not quasi-identifiers. None of the outputs identified by the step 1 threshold n test were subsequently identified by the step 2 threshold n test, when counts based on only quasi-identifiers were considered. An example of such an output is a count of one child in a reasonably sized geographical area (population 8 million) over a 5-year period with diabetes of an unknown type. While this is a small count, the associated variables of geographic area, 5-year period, and diagnosis are considered unlikely to appear in an external database together with identifying information. There were only 6 individual outputs identified by the step 1 anonymity tests that were still identified under the step 2 tests. Two instances were a confidence interval and a P-value being given to an unnecessarily high number of significant figures. These could have been avoided without any loss of analytic utility by simply rounding the values to fewer significant figures. Two Kaplan-Meier survival curves were cross-classified by at least one quasi-identifier (ethnicity) and were provided in a format that enabled exact time of event to be recovered to the nearest day. However, the time to event was calculated for each participant from an unknown time of birth. The residual risk of reidentification by linkage to an external database in each case was judged to be negligible. Although none of these 4 cases had a high privacy risk in its study context, each could have been avoided by applying one of the suggested anonymization treatments without affecting the final conclusions of each publication. The remaining 2 outputs involved one instance of the researchers providing an unnecessary amount of detail about their study name, location, and dates, and one instance of researchers describing an outlier using values of some quasi-identifiers. In each case, the privacy risk was probably quite low, but could have been avoided if the researchers provided slightly more general information, without affecting the final conclusions of each publication. DISCUSSION AND CONCLUSION While data use agreements, access controls, and anonymization measures applied directly to population health and health services data do provide good privacy protection during access and use by trusted researchers, there can still be privacy concerns associated with analysis outputs published in the academic literature and other publicly accessible formats. Virtual data centers are becoming an increasingly popular setting in which researchers are given access to detailed confidential data. However, there are concerns that the current manual output-checking procedures will not scale up as demand for such services grows. In seeking to explore options for meeting this challenge, this study has demonstrated the usability and reliability of a checklist designed to assist researchers with anonymizing outputs of statistical analysis in population health and health services research. The checklist can be used by researchers for datasets they either hold or access through virtual data centers. The checklist was applied to a sample of 100 academic population health publications distributed among 5 readers. Cohen’s κ was used to measure interrater agreement. Identification of potential privacy risks was found to be generally good between the readers involved in the study. This further suggests that such an approach may have the potential to be developed into a broadly applicable standard providing consistent confidentiality protection across multiple analyses of the same data. In addition to the primary objective, the paper has also indicated little evidence of actual privacy risk from published analysis outputs in our sample of publications involving statistical analysis of linked population health and health services data. However, we also observe that privacy risks could be further reduced by applying simple anonymization treatments without affecting the final conclusions of the research. Our study has underlined the need for training of researchers before they anonymize their own statistical analysis outputs using our checklist tool. Such training should cover the regulatory environment and privacy expectations and requirements, privacy risk, and how disclosures can occur, using the checklist for privacy risk assessment and applying anonymization treatments as needed, and finally how to ensure that anonymization treatments do not adversely impact statistical inferences and conclusions.4 Public health and medical journals are encouraging authors to submit supplementary materials, including datasets and additional analysis results, for online publication. The ease of online publication is likely to increase both the volume and detail of analysis results publicly available, which in turn is likely to increase privacy risk.4 ACKNOWLEDGMENTS The authors thank Joseph Chien, Daniel Elazar, and Joanna Khoo for valuable contributions to the project, and the associate editor and reviewers for their helpful suggestions. The first author thanks the Isaac Newton Institute for Mathematical Sciences, University of Cambridge, for support and hospitality during the data linkage and anonymization program, where some work on this paper was undertaken. FUNDING This work was supported by the Population Health Research Network, an initiative of the Australian government being conducted as part of the Super Science Initiative and funded by the Australian Government Department of Education and Training Education Investment Fund.  This work was partially supported by Engineering and Physical Sciences Research Council grant no. EP/K032208/1 and a grant from the Simons Foundation. COMPETING INTERESTS The authors have no competing interests to declare. CONTRIBUTORS CMO’K led the overall project, participated as a reader, and drafted and revised the manuscript. AI designed the study, participated as a reader, and analyzed the data. TC proposed the study and participated as a reader in the initial testing of the checklist. MW participated as a reader, contributed to the analysis, and contributed to the drafting and revision of the manuscript. MO’S managed the overall project, participated as a reader, and provided comments on the draft manuscript. AK participated as a reader. REFERENCES 1 Safran C, Bloomrosen M, Hammond WE, et al.   Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc.  2007; 14: 1– 9. Google Scholar CrossRef Search ADS PubMed  2 Weiner MG, Embi PJ. Toward reuse of clinical data for research and quality improvement: the end of the beginning? Ann Intern Med.  2009; 151: 359– 60. Google Scholar CrossRef Search ADS PubMed  3 O’Keefe CM, Chipperfield JO. A summary of attack methods and confidentiality protection measures for fully automated remote analysis systems. Int Stat Rev.  2013; 81: 426– 55. Google Scholar CrossRef Search ADS   4 O’Keefe CM, Westcott M, O’Sullivan M, Ickowicz A, Churches T. Anonymization for outputs of population health and health services research conducted via an online data center. J Am Med Inform Assoc.  2017; 24: 544– 49. Google Scholar PubMed  5 O'Keefe CM, Rubin DB. Individual privacy versus public good: protecting confidentiality in health research. Stat Med.  2015; 34: 3081– 103. Google Scholar CrossRef Search ADS PubMed  6 Hundepool A, Domingo-Ferrer J, Franconi L, et al.   Statistical Disclosure Control . Hoboken, NJ: John Wiley & Sons; 2012. Google Scholar CrossRef Search ADS   7 Duncan GT, Elliot M, Salazar-Gonzalez J-J. Statistical Confidentiality . New York: Springer; 2011. Google Scholar CrossRef Search ADS   8 Malin BA, El Emam K, O’Keefe CM. Biomedical data privacy: problems, perspectives, and recent advances. J Am Med Inform Assoc.  2013; 20: 2– 6. Google Scholar CrossRef Search ADS PubMed  9 O'Keefe CM, Westcott M, Ickowicz A, et al.   Protecting confidentiality in statistical analysis outputs from a virtual data centre. Working Paper. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality . Ottawa, Canada; 2013. www.unece.org/stats/documents/2013.10.confidentiality.html. Accessed March 11, 2016. 10 Elliot M, Mackey E, O’Hara K, Tudor C. The anonymisation decision-making framework. UK Anonymisation Network . http://ukanon.net/wp-content/uploads/2015/05/The-Anonymisation-Decision-making-Framework.pdf. Accessed August 20, 2016. 11 US Department of Health & Human Services. Guidance Regarding Methods for De-Identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule . www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/. Accessed August 20, 2016. 12 Xia W, Kantarcioglu M, Wan Z, Heatherly R, Vorobeychik Y, Malin BA. Process-driven data privacy. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management . ACM; 2015: 1021– 30. 13 Malin BA, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Invest Med.  2010; 58 1: 11– 18. http://dx.doi.org/10.2310/JIM.0b013e3181c9b2ea. Accessed October 11, 2016. Google Scholar CrossRef Search ADS   14 El Emam K, Jonker E, Arbuckle L, Malin BA. A systematic review of re-identification attacks on health data. PLoS One.  2011; 6 12: e28071. Google Scholar CrossRef Search ADS PubMed  15 Statistics New Zealand. Data Lab Output Guide . Wellington: Statistics New Zealand; 2011. www.stats.govt.nz/tools_and_services/microdata-access/data-lab.aspx. Accessed April 6, 2016. 16 El Emam K. A Guide to the De-identification of Health Information . New York: CRC Press; 2013. Google Scholar CrossRef Search ADS   17 Dankar FK, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Med Inform Dec Mak.  2012; 12 1: 66. Google Scholar CrossRef Search ADS   18 Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas.  1960; 20: 37– 46. Google Scholar CrossRef Search ADS   19 Fleiss JL. Balanced incomplete block designs for inter-rater reliability studies. Appl Psychol Meas.  1981; 5: 105– 12. Google Scholar CrossRef Search ADS   © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

Assessing privacy risks in population health publications using a checklist-based approach

Loading next page...
 
/lp/ou_press/assessing-privacy-risks-in-population-health-publications-using-a-MsAhlunrps
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1067-5027
eISSN
1527-974X
D.O.I.
10.1093/jamia/ocx129
Publisher site
See Article on Publisher Site

Abstract

Abstract Objective Recent growth in the number of population health researchers accessing detailed datasets, either on their own computers or through virtual data centers, has the potential to increase privacy risks. In response, a checklist for identifying and reducing privacy risks in population health analysis outputs has been proposed for use by researchers themselves. In this study we explore the usability and reliability of such an approach by investigating whether different users identify the same privacy risks on applying the checklist to a sample of publications. Methods The checklist was applied to a sample of 100 academic population health publications distributed among 5 readers. Cohen’s κ was used to measure interrater agreement. Results Of the 566 instances of statistical output types found in the 100 publications, the most frequently occurring were counts, summary statistics, plots, and model outputs. Application of the checklist identified 128 outputs (22.6%) with potential privacy concerns. Most of these were associated with the reporting of small counts. Among these identified outputs, the readers found no substantial actual privacy concerns when context was taken into account. Interrater agreement for identifying potential privacy concerns was generally good. Conclusion This study has demonstrated that a checklist can be a reliable tool to assist researchers with anonymizing analysis outputs in population health research. This further suggests that such an approach may have the potential to be developed into a broadly applicable standard providing consistent confidentiality protection across multiple analyses of the same data. data anonymization, confidentiality, privacy, biomedical research, health services research INTRODUCTION Access to data for population health and health services research is a powerful tool for influencing health policy and health promotion.1,2 At the same time, analysis of health data must comply with relevant standards of privacy and confidentiality. Increasingly, richer datasets are being created by linking (or integrating) datasets. For example, linking a cancer registry with hospital and/or pharmaceutical data can give a more complete picture of factors influencing health outcomes. But richer datasets also contain more details about individuals, so the risk to privacy is generally higher. Recent growth in the number of population health researchers using linked data may lead to an increase in risk from multiple analyses of the same data, suggesting the need for broadly applicable standards of confidentiality protection. The objective of this study is to investigate the usability and reliability of a checklist-driven approach to assist researchers with assessing privacy risks and anonymizing outputs of statistical analysis in population health research conducted on datasets they either hold or access through a virtual data center such as the Secure Unified Research Environment (www.saxinstitute.org.au/our-work/sure/). This objective is achieved through a study involving the application of a recently proposed checklist4 to a sample of published papers. BACKGROUND In response to the need to protect individuals’ privacy during access and use of data for population health research, various approaches and methods have been developed.4–8 Currently, researchers holding datasets are responsible for anonymizing analysis results, while in most virtual data centers analysis results are checked for privacy risk by an expert of the hosting agency using guidelines.6 It is recognized, however, that in-house checking may not be feasible in the long term as demand rises.4,9 The majority of published anonymization guidelines, reidentification risk measures, and reidentification risk evaluation studies exclusively address the release of anonymized datasets for research.6–8,10–14 There are few published anonymization guidelines for virtual data centers. One example is the European Statistical System Centres and Networks of Excellence Statistical Disclosure Control guidelines,6 designed for use by experts within national statistical agencies, to determine whether outputs generated by researchers could be released for use in publications or presentations. In another example, the Statistics New Zealand Microdata output guide15 describes methods and rules researchers must use for anonymizing output produced from Statistics New Zealand’s microdata before being subject to manual checks by experts within Statistics New Zealand. A previous paper4 reported the outcomes of a project to address the challenge of protecting the privacy of individuals whose data are made available for public health and health services research through a virtual data center. The recommended approach involved: Data Preparation: Data custodians apply basic anonymization treatments to datasets before making them available to researchers through a secure interface. Output Anonymization: Researchers identify potential privacy risks, further examine them in context, and then apply anonymization treatments as needed to reduce risks to acceptable levels. To assist researchers who are not necessarily experts in statistics or statistical disclosure control with identifying and treating potential privacy risks in the outputs of statistical analyses in population health research, a checklist for output anonymization was developed.4 The checklist is suitable for use by researchers holding datasets or accessing them via virtual data centers. The checklist is used to identify outputs with potential privacy risks by applying a number of tests to the output. Researchers then need to further examine any identified risks, since an output failing a test does not mean there is a definite privacy risk. The actual privacy risk of an output will depend on the study context and, particularly, other information that might be available; for example, external datasets that could be used to reidentify health information using data matching. The checklist is presented according to the types of statistical analysis output occurring in public health research results: statistics such as means, graphical outputs such as Kaplan-Meier plots, modeling output such as relative risk, and tables. For a given output type in the first column of the table, say a Kaplan-Meier plot, the Anonymity Test column presents the applicable anonymity tests, each in its own row. If an output fails an anonymity test – for example, if a Kaplan-Meier plot fails the individual data test by revealing actual dates of death – then this is examined in context to determine if there remains a privacy risk. For example, some plots reveal dates of death in terms of number of days from an unknown start point, or report deaths only by month, so would have low privacy risk. For any outputs with remaining potential privacy concerns, the Anonymization Treatment column explains how to reduce the potential privacy risk; in this case, by smoothing the plot or coarsening time periods in the underlying data. A final column provides some notes, which could be deleted in a production version of the checklist. An important benefit of this checklist approach is that researchers can compare the original output with the anonymized output to ensure that the anonymization treatments applied do not adversely impact the statistical inferences and conclusions they have drawn. Given a particular analysis output, the process involves 3 steps. The first step is: Step 1: Locate the row of the checklist corresponding to that output type, apply each anonymity test in that row, and identify any that fail the test as a potential privacy risk. Thus, for example, if an output comprises a count (number), the threshold n test would identify the output as a potential privacy risk if it is less than n. This test is equivalent to requiring that if a combination of variables occurs for an individual in the data, then it occurs for at least n individuals (this property is often called n-anonymity). We emphasize that it is the reidentification risk of the analysis output that is of concern, not the reidentification risk of the dataset itself. However, attacks on analysis output often involve attempts to reconstruct data values, either exactly or with low uncertainty.3 Applying the checklist tests and treatments to outputs computed on all variables may not always be necessary for privacy protection, and we introduce the concept of quasi-identifier to explain why this is the case. Quasi-identifiers are background knowledge variables about individuals that an intruder could use, individually or in combination, to reidentify a record with high probability. Unlike direct identifiers, quasi-identifiers are normally useful for analytic purposes and cannot be used alone to uniquely identify an individual with certainty.16 The most common way that reidentification of a data record can occur is by matching quasi-identifiers to an external database containing those quasi-identifiers together with identifying variables such as names and addresses. Examples of quasi-identifiers are: sex, date of birth, and age; location (such as post codes or census geography); ethnic origin; total years of education; marital status; criminal history; income; profession; event dates (such as admission, discharge, procedure, or visit); codes; and country of birth. Examples of variables that are generally not quasi-identifiers in the health setting are clinical observations such as blood pressure or body mass index. However, care must still be taken with such clinical observations, since they can present a reidentification risk in some situations. For example, an individual who is the only morbidly obese person in a dataset is at higher risk of being recognized and reidentified. When working with quasi-identifiers, note that it is possible that attributes that are not quasi-identifiers in one situation can be quasi-identifiers in another situation. Further, attributes that are not quasi-identifiers today might become quasi-identifiers in the future, so this practical simplification should be implemented with care. More generally, the actual privacy risk of any research output will depend on the study context and other information available, particularly external datasets facilitating matching of quasi-identifiers; that may change over time. For these reasons, it is generally not possible to give a definitive list of quasi-identifiers. More information on selecting and using quasi-identifiers can be found elsewhere.16,17 Where it is clear that the analysis output involves variables that are not quasi-identifiers, the researcher would apply a second (optional) step: Step 2 (optional): Where step 1 identifies a potential privacy risk, apply the same anonymization test from the checklist to the sub-cell or restricted statistic determined by only the quasi-identifiers. In the example of a count, step 2 would involve recomputing the value while disregarding variables that are obviously not quasi-identifiers. A consequence is that step 1 will normally identify a larger number of outputs as having associated potential privacy concerns. Subsequent application of step 2 will normally significantly reduce this number. Finally, the researcher would conduct a third step: Step 3: Where steps 1 and 2 identify a potential issue, apply one of the suggested anonymization treatments from the checklist. The checklist was designed for the context of publication of results from Australian population health research using linked data. We first produced a draft checklist and then tested it by choosing a sample of publications, which were read by a small panel. The aim was to look for omissions and obscurities in the draft checklist; for example, outputs in the papers read that were not included and descriptions, tests, and treatments that were unclear or unhelpful. This led to a revised checklist.4 METHODS In the present review we broadened the scope of publications considered but retained the emphasis on linked data for two reasons: to give continuity with our previous work, and because linked data may a priori have potentially higher privacy risks due to the generally larger number of variables. We selected a new pool and sample of publications, and a panel of readers to apply the checklist as in steps 1 and 2 above. The readers were all academic researchers with master’s or doctoral degrees in mathematics or statistics, but with very little or no experience in population health research, practical disclosure risk assessment, or anonymization. Despite our efforts to ensure completeness and clarity in the checklist, we recognized that its interpretation could be subject to variation. So we also checked for consistency of interpretation by having a subsample of the papers read by pairs of readers and then compared their results using Cohen’s κ.18 Our initial pool of publications was created from two sources: All 64 journal publications using linked data provided by the Centre for Health Record Linkage in Australia (see www.cherel.org.au) published between 2009 and 2013, and A random selection of 300 of the 2236 journal publications returned by the PubMed search engine (see www.ncbi.nlm.nih.gov/pubmed) for the years 2008–2013, with the keywords “record linkage” and full text available. A sample of 100 publications was selected randomly from the pool of 364 publications. Each of 5 readers was allocated 10 papers, and the remaining 50 papers were allocated in groups of 10 to the pairs of readers (1, 2), (2, 3), (3, 4), (4, 5), and (5, 1). So each reader read 30 papers, 10 by themselves and 20 that were also read independently by another. The sizes of the samples and the overlaps were chosen so the reviewing task was a reasonable size for each reader while enabling a pairwise comparison of readers for consistency. Researchers would normally undertake a training session on the tests and checklist before using them. To simulate this in our study, each reader initially evaluated a common set of 3 publications and then participated in a discussion session to compare results. To understand the prevalence of potential privacy risks identified by applying steps 1 and 2 of the checklist, we collected baseline data about the occurrence of statistical analysis output types in the publications. The readers categorized analysis outputs into the types listed in column 1 of the checklist and counted the number of publications containing at least one output from each type to give some indication of the popularity of each output. So each output type was given a score of 1 or 0 (presence/absence) for each publication. Counting the actual number of outputs of each type in each publication was considered unnecessary for the purpose intended. The readers then used the anonymity tests in the checklist to identify potential privacy risks among the individual statistical outputs found in the publications, ignoring whether variables were quasi-identifiers or not (step 1). The extent of agreement between each pair of readers in their identification of outputs for potential privacy risk was determined by calculating Cohen’s κ.18 This is a statistical measure of interrater agreement for categorical items and is generally regarded as a more robust measure than a simple percent agreement calculation, since it takes into account agreement occurring by chance. Values of κ above 0.4 are generally accepted as representing fair to good agreement.19 Finally, the readers considered which variables were quasi-identifiers and reduced the collection of identified potential privacy risks accordingly. RESULTS Baseline data on occurrence of statistical analysis output types The readers identified 25 different output types in the sample, with a total of 566 occurrences. The majority of the outputs were of the following types: Summary statistics: count/number, percentage, mean, median, mode, ratio, maximum, minimum, percentile, variance, standard deviation Plots: scatterplot, histogram, Kaplan-Meier survival curve Modeling outputs: parameter estimate, relative risk, odds ratio, hazard ratio, confidence interval, P-value, chi-square Among these, the most common classes of output were number, percentage, confidence interval, individual data, P-value, and odds ratio. Figure 1 shows the 25 output types as column labels and the number of publications in the sample containing at least one output of each type as the height of each column. Thus, for example, 57 publications in the sample reported a confidence interval, while 1 publication reported a mode. The most common types were number (91) and percentage (85). So, for instance, the output type number was present in 91 of the 100 papers (91%) and thus represented 16% of all outputs present in the sample. Figure 1. View largeDownload slide Number of outputs of each type among the 566 statistical outputs identified in the publications of the sample of 100 publications, also showing the number of outputs of each type that were identified by one or more of the step 1 anonymity tests as applied by one or more readers. Figure 1. View largeDownload slide Number of outputs of each type among the 566 statistical outputs identified in the publications of the sample of 100 publications, also showing the number of outputs of each type that were identified by one or more of the step 1 anonymity tests as applied by one or more readers. There was a small number of papers in the sample that did not include any statistical analysis of actual linked health data, including one that had no mention of data linkage in the text despite its appearing as a keyword. Application of step 1 anonymity tests There were 128 instances of individual statistical output types identified as potential privacy risks by one or more of the checklist anonymity tests by one or more readers (in step 1). If a type was so identified more than once in a paper, it was only counted once (presence/absence again). Figure 1 displays the 128 instances by type; so, for example, 42 instances of a reported number and 2 instances of a scatter plot were identified. The most common output type identified was individual data. Specific reasons for identification in some of the common output classes were: Inclusion of overly detailed contextual information, such as extremely fine-grained location and dates of the study, that could be interpreted as individual data Reporting of summary statistics based on a small number of values (number, percentage, odds ratio, mean, frequency table, ratio) High precision/unnecessary number of significant figures (confidence interval, P-value) Nonsmoothed curves (Kaplan-Meier estimates) Graphs and plots below a threshold or revealing individual values (graph of grouped data) It can be seen from Figure 1 that the published output types most commonly identified under the tests were individual data, scatter plots, maxima, minima, and Kaplan-Meier plots. As an example of the interpretation of these results, consider the output class Kaplan-Meier estimates. At least one example of a Kaplan-Meier estimate occurred in 11 of the publications, and in 7 of these, one or more estimates were identified by at least one of the step 1 anonymity tests by at least one reader. Note that there is a low correlation between the potential privacy risk of an output type and its frequency of occurrence. This is not unexpected, since an output would be included if it contributes to the analysis, regardless of its potential privacy risk. For the 2 types always identified in step 1, one occurs in nearly half the papers read, the other in just 2 papers. The more frequently occurring one is, at least potentially, the more concerning. Interrater agreement for step 1 anonymity testing In this study, we found full agreement (κ = 1) between the 3 pairs of readers 1 and 2, 2 and 3, and 3 and 4. For readers 5 and 1, κ = −0.01, while for readers 4 and 5, κ = 0.3. Further investigation of the data from these 2 pairings reveals that either the differences are due to differences in the readers’ labeling of output types in the collection of the baseline data, or they can readily be addressed with researcher training. Readers 4 and 5 differed in their reporting of output types across several papers. In one paper, reader 4 reported individual data, number, percentage, confidence interval, P-value, and odds ratio, while reader 5 reported number, percentage, percentile, scatter plot, confidence interval, P-value, chi-square, odds ratio, and logistic regression. On inspection, the publication includes both a plot and the associated data, an odds ratio within the output of a logistic regression, and a chi-square test mentioned in the analysis but not in the text. So the difference between the readers is due to the difference in labeling of these nested output types and subsequent different counts assigned by the readers. A check shows that the 2 readers’ identification of potential risks was, however, effectively identical across all the common publications. Thus, although there is not full agreement between readers, the differences do not affect outcomes from the step 1 anonymity tests and only affect the baseline counts. The difference between readers 1 and 5 is due to the fact that reader 1 did not identify individual data as potentially risky in any of the papers shared with reader 5. This discrepancy can be explained in terms of difference in training. All readers were trained on a common set of 3 publications; however, readers 2, 3, 4, and 5 participated in a similar exercise of applying a draft version of the checklist to a different sample of publications (see the “Background” section), while reader 1 was only involved in the current study. Subsequent investigation showed that the common training publications here did not contain any instances of individual data, hence reader 1 had no training or experience with individual data. The fact that reader 1 had perfect agreement with reader 2 on their papers, which, as it happened, did not include any occurrences of individual data, strongly suggests that the checklist is otherwise clear. Application of step 2 anonymity tests Subsequently, the step 2 anonymity tests were applied. For each output identified as a potential privacy risk in step 1, the tests were reapplied, disregarding variables that were obviously not quasi-identifiers. None of the outputs identified by the step 1 threshold n test were subsequently identified by the step 2 threshold n test, when counts based on only quasi-identifiers were considered. An example of such an output is a count of one child in a reasonably sized geographical area (population 8 million) over a 5-year period with diabetes of an unknown type. While this is a small count, the associated variables of geographic area, 5-year period, and diagnosis are considered unlikely to appear in an external database together with identifying information. There were only 6 individual outputs identified by the step 1 anonymity tests that were still identified under the step 2 tests. Two instances were a confidence interval and a P-value being given to an unnecessarily high number of significant figures. These could have been avoided without any loss of analytic utility by simply rounding the values to fewer significant figures. Two Kaplan-Meier survival curves were cross-classified by at least one quasi-identifier (ethnicity) and were provided in a format that enabled exact time of event to be recovered to the nearest day. However, the time to event was calculated for each participant from an unknown time of birth. The residual risk of reidentification by linkage to an external database in each case was judged to be negligible. Although none of these 4 cases had a high privacy risk in its study context, each could have been avoided by applying one of the suggested anonymization treatments without affecting the final conclusions of each publication. The remaining 2 outputs involved one instance of the researchers providing an unnecessary amount of detail about their study name, location, and dates, and one instance of researchers describing an outlier using values of some quasi-identifiers. In each case, the privacy risk was probably quite low, but could have been avoided if the researchers provided slightly more general information, without affecting the final conclusions of each publication. DISCUSSION AND CONCLUSION While data use agreements, access controls, and anonymization measures applied directly to population health and health services data do provide good privacy protection during access and use by trusted researchers, there can still be privacy concerns associated with analysis outputs published in the academic literature and other publicly accessible formats. Virtual data centers are becoming an increasingly popular setting in which researchers are given access to detailed confidential data. However, there are concerns that the current manual output-checking procedures will not scale up as demand for such services grows. In seeking to explore options for meeting this challenge, this study has demonstrated the usability and reliability of a checklist designed to assist researchers with anonymizing outputs of statistical analysis in population health and health services research. The checklist can be used by researchers for datasets they either hold or access through virtual data centers. The checklist was applied to a sample of 100 academic population health publications distributed among 5 readers. Cohen’s κ was used to measure interrater agreement. Identification of potential privacy risks was found to be generally good between the readers involved in the study. This further suggests that such an approach may have the potential to be developed into a broadly applicable standard providing consistent confidentiality protection across multiple analyses of the same data. In addition to the primary objective, the paper has also indicated little evidence of actual privacy risk from published analysis outputs in our sample of publications involving statistical analysis of linked population health and health services data. However, we also observe that privacy risks could be further reduced by applying simple anonymization treatments without affecting the final conclusions of the research. Our study has underlined the need for training of researchers before they anonymize their own statistical analysis outputs using our checklist tool. Such training should cover the regulatory environment and privacy expectations and requirements, privacy risk, and how disclosures can occur, using the checklist for privacy risk assessment and applying anonymization treatments as needed, and finally how to ensure that anonymization treatments do not adversely impact statistical inferences and conclusions.4 Public health and medical journals are encouraging authors to submit supplementary materials, including datasets and additional analysis results, for online publication. The ease of online publication is likely to increase both the volume and detail of analysis results publicly available, which in turn is likely to increase privacy risk.4 ACKNOWLEDGMENTS The authors thank Joseph Chien, Daniel Elazar, and Joanna Khoo for valuable contributions to the project, and the associate editor and reviewers for their helpful suggestions. The first author thanks the Isaac Newton Institute for Mathematical Sciences, University of Cambridge, for support and hospitality during the data linkage and anonymization program, where some work on this paper was undertaken. FUNDING This work was supported by the Population Health Research Network, an initiative of the Australian government being conducted as part of the Super Science Initiative and funded by the Australian Government Department of Education and Training Education Investment Fund.  This work was partially supported by Engineering and Physical Sciences Research Council grant no. EP/K032208/1 and a grant from the Simons Foundation. COMPETING INTERESTS The authors have no competing interests to declare. CONTRIBUTORS CMO’K led the overall project, participated as a reader, and drafted and revised the manuscript. AI designed the study, participated as a reader, and analyzed the data. TC proposed the study and participated as a reader in the initial testing of the checklist. MW participated as a reader, contributed to the analysis, and contributed to the drafting and revision of the manuscript. MO’S managed the overall project, participated as a reader, and provided comments on the draft manuscript. AK participated as a reader. REFERENCES 1 Safran C, Bloomrosen M, Hammond WE, et al.   Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc.  2007; 14: 1– 9. Google Scholar CrossRef Search ADS PubMed  2 Weiner MG, Embi PJ. Toward reuse of clinical data for research and quality improvement: the end of the beginning? Ann Intern Med.  2009; 151: 359– 60. Google Scholar CrossRef Search ADS PubMed  3 O’Keefe CM, Chipperfield JO. A summary of attack methods and confidentiality protection measures for fully automated remote analysis systems. Int Stat Rev.  2013; 81: 426– 55. Google Scholar CrossRef Search ADS   4 O’Keefe CM, Westcott M, O’Sullivan M, Ickowicz A, Churches T. Anonymization for outputs of population health and health services research conducted via an online data center. J Am Med Inform Assoc.  2017; 24: 544– 49. Google Scholar PubMed  5 O'Keefe CM, Rubin DB. Individual privacy versus public good: protecting confidentiality in health research. Stat Med.  2015; 34: 3081– 103. Google Scholar CrossRef Search ADS PubMed  6 Hundepool A, Domingo-Ferrer J, Franconi L, et al.   Statistical Disclosure Control . Hoboken, NJ: John Wiley & Sons; 2012. Google Scholar CrossRef Search ADS   7 Duncan GT, Elliot M, Salazar-Gonzalez J-J. Statistical Confidentiality . New York: Springer; 2011. Google Scholar CrossRef Search ADS   8 Malin BA, El Emam K, O’Keefe CM. Biomedical data privacy: problems, perspectives, and recent advances. J Am Med Inform Assoc.  2013; 20: 2– 6. Google Scholar CrossRef Search ADS PubMed  9 O'Keefe CM, Westcott M, Ickowicz A, et al.   Protecting confidentiality in statistical analysis outputs from a virtual data centre. Working Paper. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality . Ottawa, Canada; 2013. www.unece.org/stats/documents/2013.10.confidentiality.html. Accessed March 11, 2016. 10 Elliot M, Mackey E, O’Hara K, Tudor C. The anonymisation decision-making framework. UK Anonymisation Network . http://ukanon.net/wp-content/uploads/2015/05/The-Anonymisation-Decision-making-Framework.pdf. Accessed August 20, 2016. 11 US Department of Health & Human Services. Guidance Regarding Methods for De-Identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule . www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/. Accessed August 20, 2016. 12 Xia W, Kantarcioglu M, Wan Z, Heatherly R, Vorobeychik Y, Malin BA. Process-driven data privacy. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management . ACM; 2015: 1021– 30. 13 Malin BA, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Invest Med.  2010; 58 1: 11– 18. http://dx.doi.org/10.2310/JIM.0b013e3181c9b2ea. Accessed October 11, 2016. Google Scholar CrossRef Search ADS   14 El Emam K, Jonker E, Arbuckle L, Malin BA. A systematic review of re-identification attacks on health data. PLoS One.  2011; 6 12: e28071. Google Scholar CrossRef Search ADS PubMed  15 Statistics New Zealand. Data Lab Output Guide . Wellington: Statistics New Zealand; 2011. www.stats.govt.nz/tools_and_services/microdata-access/data-lab.aspx. Accessed April 6, 2016. 16 El Emam K. A Guide to the De-identification of Health Information . New York: CRC Press; 2013. Google Scholar CrossRef Search ADS   17 Dankar FK, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Med Inform Dec Mak.  2012; 12 1: 66. Google Scholar CrossRef Search ADS   18 Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas.  1960; 20: 37– 46. Google Scholar CrossRef Search ADS   19 Fleiss JL. Balanced incomplete block designs for inter-rater reliability studies. Appl Psychol Meas.  1981; 5: 105– 12. Google Scholar CrossRef Search ADS   © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: Mar 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Unlimited reading

Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere.

Stay up to date

Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates.

Organize your research

It’s easy to organize your research with our built-in tools.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve Freelancer

DeepDyve Pro

Price
FREE
$49/month

$360/year
Save searches from
Google Scholar,
PubMed
Create lists to
organize your research
Export lists, citations
Read DeepDyve articles
Abstract access only
Unlimited access to over
18 million full-text articles
Print
20 pages/month
PDF Discount
20% off