Abstract The growth of administrative data repositories worldwide has spurred the development and application of data quality frameworks to ensure that research analyses based on these data can be used to draw meaningful conclusions. However, the research literature on administrative data quality is sparse, and there is little consensus regarding which dimensions of data quality should be measured. Here we present the core dimensions of the data quality framework developed at the Manitoba Centre for Health Policy, a world leader in the use of administrative data for research purposes, and provide examples and context for the application of these dimensions to conducting data quality evaluations. In sharing this framework, our ultimate aim is to promote best practices in rigorous data quality assessment among users of administrative data for research. data quality, administrative data, framework INTRODUCTION Recent developments in information technology have spurred the growth of data repositories and sparked renewed interest in the use of administrative data for research purposes.1 Administrative data, generally described as data derived from the operation of administrative systems (eg, the health care system) for the purpose of registration, transaction, and/or record-keeping,2 are advantageous for research because they do not require time-consuming participant consent or primary data collection.3 The many research possibilities for administrative data have generated “information-rich” environments in many research institutions, built on linking records across datasets and across sectors.4 As the use of routinely collected data in research becomes more common, data quality is an increasingly important consideration. Data quality is a context-specific concept defined broadly as “fitness for use.”5 Administrative data quality can be influenced by many factors: inconsistent methods of data collection, temporary coding problems when introducing new systems, and/or systematic biases in reporting.6 Establishing a rigorous data quality framework is critical to ensuring that comprehensive and consistent evaluations form the basis of increasingly complex data analyses, and it is particularly important for secondary users of information, as they often have little control over the data collection and maintenance processes.6 However, few research institutions have adopted structured frameworks to assess data quality, and there is little consensus in regard to which dimensions of data quality should be measured.7 In addition, the research literature on administrative data quality is sparse and has rarely included discussion of data quality evaluation methods.8 Therefore, in order to promote the use of rigorous data quality assessment and advance the research literature in this emerging area, we describe here the data quality framework developed and implemented at the award-winning Manitoba Centre for Health Policy (MCHP).9 In sharing this framework, our ultimate aim is to support best practices in evaluating administrative data for research purposes. CASE DESCRIPTION The Manitoba Population Research Data Repository at MCHP is a comprehensive population-based collection of administrative data, capturing information on virtually all residents (approximately 1.3 million individuals) of the province of Manitoba, Canada.10,11 The databases are grouped into 6 domains (health, education, social, justice, registries, and support files) and are updated on an annual or semiannual basis. The repository contains no personal identifying information, but datasets are linkable across files and over time by way of scrambled numeric identifiers. Research using linked repository datasets can describe and explain Manitoba residents’ patterns of health and health care use, social services use, and education and justice system contacts, and the findings can inform policy development and implementation in Manitoba.12 Developing a structured process to efficiently and comprehensively assess the quality of different kinds of data has been crucial to MCHP’s success as a leader in population health research.13 METHODS To develop the MCHP data quality framework, we conducted a comprehensive search of the published and gray literature for information on data quality assessment practices in other Canadian and international research institutions. We reviewed data quality practices from the Canadian Institute for Health Information, the Public Health Agency of Canada, Statistics Canada, the Australian Bureau of Statistics, and the Institute for Clinical Evaluative Sciences (ICES) in Ontario, Canada. From these examples, we selected 5 key data quality dimensions based on their relevance to population-based research analyses and the availability of operational indicators at MCHP and incorporated them into our framework. RESULTS Data quality is a broad concept that is both relative and multidimensional in nature, as was evident in all of the frameworks we examined. Table 1 describes the dimensions included in data quality frameworks from Canadian institutions, which were the ones we found most helpful in developing the MCHP framework. To some degree, all of these data quality frameworks share common features. The concepts of accuracy (how well the data reflect the reality of what they were meant to measure) and timeliness (how current the data are) are included universally, and all frameworks also incorporate some measure of how “useful,” “serviceable,” or “relevant” the data are – that is, the degree to which the data meet the needs of users, although the exact definitions of this parameter varied. Table 1. Comparison of Canadian data quality frameworks Dimensions Canadian Institute for Health Information Public Health Agency of Canada Statistics Canada Institute for Clinical Evaluative Sciences Manitoba Centre for Health Policy Accuracy x x x x x Correctness x x Completeness x x Reliability x Reproducibility x Validity x x Measurement error x Level of bias x Consistency x Timeliness x x x x x Comparability x Accessibility x Usability/serviceability/relevance/interpretability x x x x x Coherence x Linkability x Dimensions Canadian Institute for Health Information Public Health Agency of Canada Statistics Canada Institute for Clinical Evaluative Sciences Manitoba Centre for Health Policy Accuracy x x x x x Correctness x x Completeness x x Reliability x Reproducibility x Validity x x Measurement error x Level of bias x Consistency x Timeliness x x x x x Comparability x Accessibility x Usability/serviceability/relevance/interpretability x x x x x Coherence x Linkability x To some extent, the dimensions included in each framework depend on the purpose of individual data repositories. For example, Statistics Canada assesses accessibility, or the ease with which the data can be obtained from the agency. This dimension is of much less concern for research institutions like MCHP, where accessing the data records is a highly regulated process. And while the ICES framework includes measures of anonymity and linkability, all datasets in the MCHP repository are both deidentified (personal information removed) and linkable (using scrambled numeric identifiers); therefore, including these dimensions in our data quality assessment would be redundant. In developing MCHP’s data quality framework, we took a pragmatic approach to ensure that the quality assessments we conducted fit the scope of the available data, would be generalizable across different types of structured data, and could be conducted within the limitations of our legislative environment. Thus, the MCHP data quality framework is based on the dimensions of accuracy, internal validity, external validity, timeliness, and interpretability (Figure 1). A description of the 5 dimensions and their subcomponents follows, with examples of output presented to demonstrate their utility and research relevance. Accuracy: Accuracy is defined as the degree to which the data correctly describe the phenomena they were designed to measure,5 demonstrating their capability to support research conclusions.14 The concept of accuracy encompasses 5 subcomponents: completeness, correctness, measurement error, level of bias, and consistency. Completeness: MCHP measures completeness by the percentage of missing values in a given dataset field.9 Missing values may include blank fields for character variables or periods for numeric variables. The distinct concept of “missingness” assesses trends in missing data over time (Figure 2). This can be useful for indicating whether the amount of missing data is changing over time, and serves as an indicator of potential data quality problems. Correctness: Correctness is measured by the percentage of invalid codes (values that do not match provided formats), and invalid dates or out-of-range numeric values (values that fall outside the possible or established range).9 Outliers or extreme values are also flagged but not removed, as they do not always indicate poor quality. Instead, flags alert users that they should investigate possible reasons for the occurrence. MCHP uses the Canadian Institute for Health Information’s suggested rankings of minimal, moderate, or significant to categorize completeness and correctness.15 MCHP has also developed the valid, invalid, missing, outlier (VIMO) macro to evaluate correctness, generating output that includes variable labels and corresponding percentages of valid, missing, and outlier values (Figure 3).16 VIMO also generates the mean, minimum, maximum, median, and standard deviation for numeric variables, and a list of top 10 most frequent values for character variables. Measurement error: Measurement errors occur when data elements are attributable to incorrect answers or coding.15 Such errors can be caused by confusing definitions or weakness in data collection procedures. An example of measurement error is a data field where either “yes” or “no” would be appropriate, but instead it contains “b.” Or, in documenting cases of hypertension, a patient is erroneously listed as not being hypertensive because he or she is taking medication to manage his or her blood pressure. Good documentation and automated data collection methods can help reduce measurement error.15 Level of bias: Bias refers to systematic differences between reported values and values that should have been reported.15 For example, sex- or age-specific biases can occur in datasets documenting certain types of chronic disease. While true bias is hard to establish concretely, other than through re-abstraction studies, possible biases can be detected when sampling errors occur or when coverage or responses are not complete. Correlated bias can occur when one data element is correlated with another, such as length of observation time with outcome, and is not generally assessed when acquiring data into the repository. If necessary, bias can be evaluated as part of the research enterprise. Consistency: Consistency, also referred to as reliability, is measured by the amount of variation that would occur if repeated measurements were done.15 Consistency is often an issue for subjective data elements that may not have a correct answer, such as a rating on a scale of 1–5, and is effectively evaluated in re-abstraction studies. MCHP often assesses measurement error, level of bias, and consistency at a granular level.9 This may also require linking data across several datasets, something that is only permitted when appropriate ethical approval has been obtained. For these reasons, such quality assessment is not typically conducted during the data acquisition phase. Internal validity: Internal validity measures the consistency between values in 2 data fields derived from the same source.17 At MCHP, measurements for internal validity include the subcomponents’ internal consistency, temporal consistency, and linkability (the ability to readily link 2 data files using a common key or identifier). Internal consistency: Internal consistency is a measure of the numeric agreement between fields or the logical relationship between fields.18 Examples of inconsistencies include a field noting a pregnant man or an 80-year-old woman having a baby. To measure such consistency, we use the VALIDATION macro to scan the dataset and count the number of data inconsistencies based on user-specified validation rules.16 Temporal consistency: Temporal consistency is the degree to which a set of time-related observations conforms to a smooth line or curve over time and the percentage of observations that are classified as outliers from that line or curve.18 However, the trend analysis must be correctly informed before conclusions can be drawn. If a field such as “date of admission” in a trauma program was interpreted without accounting for historical changes in the program, the results could be misleading. MCHP’s TREND macro measures temporal consistency over a specified time (Figure 4).16 The macro fits a series of common models and selects the model with the minimum mean squared error, estimates studentized residuals for each observation, and flags significant observations as potential outliers.9 The macro also flags values as potential problems if it detects repeated observations with the same exact value (indicating no change over time). Linkability: MCHP defines linkability as the percentage of records having common identifiers in 2 or more administrative databases. Linkability is important for determining the data’s utility for research.18 Unique record identifiers that correspond to a personal health insurance number facilitate linkage based on deterministic matching.19 Personal health insurance numbers attached to records are scrambled before MCHP acquires the deidentified data. MCHP’s LINK macro shows the number and percentage of linkable records for a specific dataset or list of datasets.16 External validity: External validity refers to the relationship between the values in a data file and an external source of information, often referred to as “data confrontation.”18 For example, the level of agreement between summaries of the data with available literature, reports, and general knowledge can be an indicator of the quality of external validity. Outside content experts might also be consulted in this assessment. Timeliness: Timeliness refers to how up to date the data are at the time of release.9 Timeliness reflects the time between data request and data acquisition, and the time between data acquisition and data release for research use. Long delays between acquisition and release might suggest resourcing issues that need to be resolved with data providers. The currency of documentation has also been added as a new indicator of timeliness.9 Metadata serve to inform users of important data characteristics and limitations. Decisions to use data must therefore be made on up-to-date documentation. Documentation currency is measured by the difference in time between data release and release of associated metadata.9 Interpretability: Interpretability is defined as the ease with which a user can understand the data.15 This is based on the quality of documentation provided, policies and procedures, formats, and metadata. If documentation is poor, data quality issues may go unrecognized.9 MCHP does not currently have operational measures for evaluating the interpretability of data, but is developing these for future use. Figure 1. View largeDownload slide Schematic of the Manitoba Centre for Health Policy's data quality framework. Figure 1. View largeDownload slide Schematic of the Manitoba Centre for Health Policy's data quality framework. Figure 2. View largeDownload slide Example of data missingness. This output describes trends in missing data over several calendar years. Variables are listed on the Y-axis and years are listed on the X-axis. The cells contain the percent of missing data in that variable during each year. For example, there was 100% of data missing for mets_brain and mets_bone from 2005 to 2009, but then data capture increased so that only ≤30% was missing during the following years. Figure 2. View largeDownload slide Example of data missingness. This output describes trends in missing data over several calendar years. Variables are listed on the Y-axis and years are listed on the X-axis. The cells contain the percent of missing data in that variable during each year. For example, there was 100% of data missing for mets_brain and mets_bone from 2005 to 2009, but then data capture increased so that only ≤30% was missing during the following years. Figure 3. View largeDownload slide Example of VIMO macro output: data correctness. Figure 3. View largeDownload slide Example of VIMO macro output: data correctness. Figure 4. View largeDownload slide Example of TREND macro analysis: data trends over time. This trend line shows participation in a treatment program in Winnipeg, Manitoba. Services increase steadily over time, with a dip in 2005. The trend line and regression are computed from measurements across the data points. Figure 4. View largeDownload slide Example of TREND macro analysis: data trends over time. This trend line shows participation in a treatment program in Winnipeg, Manitoba. Services increase steadily over time, with a dip in 2005. The trend line and regression are computed from measurements across the data points. DISCUSSION The growth of data repositories globally has necessitated the development and application of data quality frameworks to ensure that research using administrative data is based on sound, high-quality information. This case report describes the 5 core dimensions of MCHP’s data quality framework, which may serve as an exemplar for other research institutions working with administrative data and seeking to improve their data quality assessment process. Early work on the MCHP data quality framework informed the development and adoption of a data quality assessment framework by 2 other administrative data research institutions: ICES in Ontario, Canada,8 and the Secure Anonymized Information Linkage Databank in Swansea, UK.20 The information presented here has the potential to initiate this transformative process for other institutions worldwide. Failing to monitor data quality has multiple repercussions: increased project duration, effort, and costs; poorly informed, biased, or outdated decision-making; damaged trust in study results; and decreased end-user satisfaction. Our rigorous framework mitigates these negative consequences and provides several advantages for researchers working with administrative data. First, it provides a metric for the degree to which data quality varies over time. These comparisons can help to determine whether specific fields exhibit sudden changes in missing values or numbers of cases. Second, the framework serves to communicate issues of data quality to data users, and does so in an accessible, user-friendly way. For example, the color-coded VIMO macro output serves as a kind of “dashboard” for assessing quality indicators of a particular dataset at a glance. The ability to easily and rapidly compare data quality across different datasets is increasingly important for studies involving multiple collaborators and/or spanning numerous jurisdictions. Finally, and uniquely among the other available data frameworks we examined, the macros and other coding tools developed as part of the MCHP framework are available for free under a General Public License,21 allowing interested users to adopt or adapt specific framework components to suit their individual needs. Future directions At MCHP, we are developing the capacity to use big data analytic techniques to unlock the potential of unstructured (or free-text) data, such as physicians’ notes in electronic medical records, emergency room triage notes, and imaging reports. Developing the technology to analyze these free-text data sources will add value and sophistication to traditional analytic approaches for structured administrative data, but will also present new challenges for data quality assessment.22,23 We will draw on expertise from other fields (including engineering, computer science, and mathematics) to develop appropriate techniques for assessing the data quality of unstructured health data. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for profit sector. Competing Interests The authors have no competing interests to declare. Contributors The need for the data quality framework described in this study was conceptualized by MS, LL, MA, and LR. MA led the literature review, MS, LL, and MA designed the framework, and SH implemented it. All authors participated in interpreting the framework outputs. The manuscript was drafted by JE and JO, and all authors read and revised the content critically. The final version was approved by all authors, who agree to be accountable for the work presented. References 1 Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014; 2 3. 2 Elias P. Administrative data. In: Dusa A, Nelle D, Stock G, Wagner G, ed. Facing the Future: European Research Infrastructures for the Humanities and Social Sciences . Berlin: SCIVERO; 2014: 47– 48. 3 Weiskopf N, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013; 20 1: 144– 51. Google Scholar CrossRef Search ADS PubMed 4 Roos LL, Gupta S, Soodeen RA, Jebamani L. Data quality in an information-rich environment: Canada as an example. Can J Aging. 2005; 24 ( Suppl 1): 153– 70. Google Scholar CrossRef Search ADS PubMed 5 Statistics Canada. Statistics Canada's Quality Assurance Framework . http://www.statcan.gc.ca/pub/12-586-x/12-586-x2002001-eng.pdf. 2002. Accessed April 5, 2017. 6 Hirdes JP, Poss JW, Caldarello H, et al. An evaluation of data quality in Canada's Continuing Care Reporting System (CCRS): secondary analyses of Ontario data submitted between 1996 and 2011. BMC Med Inform Decis Mak. 2013; 13 27. 7 Chen H, Hailey D, Wang N, Yu P. A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health. 2014; 11 5: 5170– 207. Google Scholar CrossRef Search ADS PubMed 8 Iron K, Manuel DG. Quality Assessment of Administrative Data (QuAAD): An Opportunity for Enhancing Ontario's Health Data . http://www.ices.on.ca/∼/media/Files/Atlases-Reports/2007/Quality-assessment-of-administrative-data/Full%20report.ashx. 2007. Accessed April 5, 2017. 9 Azimaee M, Smith M, Lix L, Ostapyk T, Burchill C, Orr J. MCHP Data Quality Framework . Winnipeg, Manitoba: Manitoba Centre for Health Policy, University of Manitoba; 2015. 10 Roos LL, Nicol JP. A research registry: uses, development, and accuracy. J Clin Epidemiol. 1999; 52 1: 39– 47. Google Scholar CrossRef Search ADS PubMed 11 Roos LLJr, Nicol JP, Cageorge SM. Using administrative data for longitudinal research: comparisons with primary data collection. J Chronic Dis. 1987; 40 1: 41– 49. Google Scholar CrossRef Search ADS PubMed 12 Jutte DP, Roos LL, Brownell MD. Administrative record linkage as a tool for public health research. Annu Rev Public Health. 2011; 32: 91– 108. Google Scholar CrossRef Search ADS PubMed 13 Roos LL, Menec V, Currie RJ. Policy analysis in an information-rich environment. Soc Sci Med. 2004; 58 11: 2231– 41. Google Scholar CrossRef Search ADS PubMed 14 Richesson RL, Horvath MM, Rusincovitch SA. Clinical research informatics and electronic health record data. Yearb Med Inform. 2014; 9: 215– 23. Google Scholar CrossRef Search ADS PubMed 15 The Canadian Institute for Health Information. The CIHI Data Quality Framework . https://www.cihi.ca/en/data_quality_framework_2009_en.pdf. 2009. Accessed April 5, 2017. 16 Manitoba Centre for Health Policy. Data Quality Macros – Development Data Analysis Environment . http://umanitoba.ca/faculties/health_sciences/medicine/units/community_health_sciences/departmental_units/mchp/protocol/media/DQMacros_GPL3_Version2.pdf. 2013. Accessed April 5, 2017. 17 Cook TD, Campbell DT. Quasi-Experimentation: Design and Analysis Issues for Field Settings . 1st ed. Boston: Houghton Mifflin; 1979. 18 Lix LM, Smith S, Azimaee M, et al. A Systematic Investigation of Manitoba's Provincial Laboratory Data . http://mchp-appserv.cpe.umanitoba.ca/reference/cadham_report_WEB.pdf. 2012. Accessed April 5, 2017. 19 Roos LL, Wajda A. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods Inf Med. 1991; 30 2: 117– 23. Google Scholar PubMed 20 Jones KH, Ford DV, Jones C, et al. A case study of the Secure Anonymous Information Linkage (SAIL) Gateway: A privacy-protecting remote access system for health-related research and evaluation. J Biomed Inform. 2014; 50 100: 196– 204. Google Scholar CrossRef Search ADS PubMed 21 Manitoba Centre for Health Policy. Data Quality Resources . http://umanitoba.ca/faculties/health_sciences/medicine/units/chs/departmental_units/mchp/resources/repository/dataquality.html. 2016. Accessed April 5, 2017. 22 Carlo B, Daniele B, Federico C, Simone G. A data quality methodology for heterogeneous data. Int J Database Manage Syst. 2011; 3 1: 60– 79. Google Scholar CrossRef Search ADS 23 Kiefer C. Assessing the quality of unstructured data: an initial overview. Proceedings of the LWDA , Potsdam, Germany: Hasso Plattner Institute, University of Potsdam; 2016. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: firstname.lastname@example.org
Journal of the American Medical Informatics Association – Oxford University Press
Published: Mar 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera