Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/26/5001910 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 1(1), 2018, 26–31 doi: 10.1093/jamiaopen/ooy012 Advance Access Publication Date: 23 May 2018 Application Notes Application Notes tableone: An open source Python package for producing summary statistics for research papers Tom J. Pollard, Alistair E. W. Johnson, Jesse D. Raffa, and Roger G. Mark Massachusetts Institute of Technology (MIT), MIT Laboratory for Computational Physiology, Cambridge, Massachusetts, USA Corresponding Author: Tom Pollard, PhD, Massachusetts Institute of Technology (MIT), Laboratory for Computational Physiology, 77 Massachusetts Ave, Cambridge, MA 02139, USA (email@example.com) Received 7 December 2017; Revised 2 March 2018; Accepted 20 April 2018 ABSTRACT Objectives: In quantitative research, understanding basic parameters of the study population is key for interpre- tation of the results. As a result, it is typical for the ﬁrst table (“Table 1”) of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers. Materials and Methods: The tableone package is developed following good practice guidelines for scientiﬁc computing and all code is made available under a permissive MIT License. A testing framework runs on a con- tinuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged. Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey’s rule for outlier detection and Hartigan’s Dip Test for modality are computed to highlight potential issues in summarizing the data. Discussion and Conclusion: We present open source software for researchers to facilitate carrying out repro- ducible studies in Python, an increasingly popular language in scientiﬁc research. The toolkit is intended to ma- ture over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication. Key words: descriptive statistics, python, quantitative research OBJECTIVES tion “allows readers, especially clinicians, to judge how relevant the results of a trial might be to an individual patient”. Other popular Research is highly dependent on the quality of its underpinning reporting guidelines, such as those found on the EQUATOR data. To assist with the interpretation of an analysis, biomedical re- (Enhancing the QUAlity and Transparency Of health Research) search guidelines typically include recommendations for describing Network, offer similar advice. the data with summary statistics. The CONSORT (CONsolidated It is typical for the first table of a biomedical research paper, the Standards of Reporting Trials) guidelines, for example, indicate the so called “Table 1”, to provide the baseline characteristics of the importance of a “table showing baseline demographic and clinical patient population. The presentation of this table is relatively characteristics for each group”. The authors note that this informa- V The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unre- stricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 26 Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/26/5001910 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 27 Table 1. Example of a table produced by the tableone package BACKGROUND AND SIGNIFICANCE when applied to a small subset of data from MIMIC-III The Statistical Analyses and Methods in the Published Literature Variables Level Is null Overall (SAMPL) Guidelines note that reporting errors are common in pub- lished biomedical literature. Citing several studies, the authors sug- n 1000 gest that the problem of poor statistical reporting is “long-standing, Age (years), median (IQR) 0 68 (53–79) widespread, [and] potentially serious” and that this problem is com- SysABP (mmHg), mean (SD) 291 114.25 (40.16) mon even in “the world’s leading peer-reviewed general medical and Height (cm), mean (SD) 475 170.09 (22.06) specialty journals”. While we might expect statistical errors to arise Weight (pounds), mean (SD) 302 82.93 (23.83) mostly in more complex areas of analysis, it appears that the prob- ICU type, n (%) CCU 0 162 (16.2) CSRU 202 (20.2) lem concerns mostly basic statistics. A commentary on how to detect MICU 380 (38.0) and prevent errors in medical literature suggests that virtually all of SICU 256 (25.6) the errors in question deal with misuse of material discussed in most In-hospital mortality, n (%) 0 0 864 (86.4) introductory statistics textbooks. 1 136 (13.6) As an example, a commonly reported issue is the use of standard error of the mean, rather than standard deviation, as a summary of Warnings about inappropriate summaries of the data are raised during gen- data variability. The suggestion is that this occurs either due to tra- eration and displayed below the table. dition or, more worryingly, as a result of researcher bias because Warning, Hartigans Dip Test reports possible multimodal distributions for: “the standard error of the mean is always smaller than the standard Age, Height, SysABP. Warning, Tukey rule indicates far outliers in: Height. deviation”. In an editorial titled Ten Rules for Reading Clinical Re- IQR: interquartile range; SysABP: systolic arterial blood pressure; ICU: in- search Reports, Yancey insists the reader should “Question the va- tensive care unit. lidity of all descriptive statistics”, echoing this common and inappropriate use of standard error of the mean. The extent to which a biomedical journal can and should review consistent across studies, showing statistics such as number and pro- the methodology of submitted papers is an open question for edi- portions of patients, means and medians, and the frequency of missing tors. In Statistical Reviewing Policies of Medical Journals, the au- data. The measures may be stratified across a categorical variable thor explains that a large barrier to methodologic reviews is the such as the study’s primary outcome in order to show how the popu- 9 availability of resources for doing so. Where a statistical reviewer lation characteristics differ between subgroups. While the computa- does happen to be available, it is still common for data and code to tion of summary statistics is conceptually straightforward, the be unavailable, and our own experiences have shown that simply technical task is typically cumbersome and offers ample opportunities 10,11 reproducing the patient cohort of a study is non-trivial at best. for the introduction of misleading and avoidable errors through flaws According to Glantz, many statisticians would prefer not to spend in data entry, coding mistakes, and incorrect table formatting. their time “grinding out garden-variety statistics for other people”, A recently published Correction in JAMA Psychiatry, titled and that the job of summarizing data is often best done by the inves- “Errors in Table 1”, offers an example: “the rate of 300.096 was 7 tigators themselves. This is not to give the job of a statistician to a replaced with 30.0096; and for a maternal age of older than 40 clinical researcher, but to allow the researcher to carry out introduc- years, the rate of 73.199 was replaced with 7.3199”. Another re- tory statistics, while leaving the more complex statistical tasks and cent correction in the New England Journal of Medicine notes that reviews to the expert statisticians. “‘Nonelective’ should have been ‘Elective’” in the summary of the clinical trial population. These kind of errors are easy to make, dif- ficult to detect, and happen in many studies, not just the examples MATERIALS AND METHODS provided here. Providing software to simplify the creation of Table 1 has several Python is a rapidly growing programming language with a number benefits: reduction in time spent tediously calculating and format- of mature libraries for data analysis. Researchers are increasingly ting results, prevention of common errors when creating summary using Python due to its large and active scientific computing commu- statistics, and greater consistency in reporting summary statistics. nity, ease of interactive data analysis, and utility as a general pur- 5 13 Yoshida and Bohn created a package in the programming language pose programming language. The software library Pandas is R to automatically create the relevant summary statistics in the ap- central to conducting data analysis in Python. Pandas introduces a propriate format. This package has become increasingly popular DataFrame object which simplifies manipulation of structured data- among researchers using R. To date, there is no analogous software sets. When working with a DataFrame, Pandas provides a number to produce a similar table in Python. of convenient routines to calculate averages, medians, and other ag- We sought to provide a simple, reproducible method for creating gregate measures. tableone utilizes DataFrames to summarize and summary statistics for research papers in the Python programming present data, leveraging the popularity of Pandas among the scien- language, which has become increasingly popular for scientific studies tific community and the excellent integration of Pandas with literate 15,16 in recent years. In addition, we sought to encourage better practice computing approaches such as Jupyter Notebooks. for study reporting by highlighting issues relating to the appropriate- Our aim in developing tableone is to provide a simple, reproduc- ness of summary statistics. The package is maintained as a public ible method for providing summary statistics for research papers in project named tableone, enabling the research community to develop the Python programming language. In doing this, we provide fea- a centralized toolkit that can help to promote reproducible, better tures such as: automatic detection of categorical variables; reporting quality reporting of data characteristics as they mature over time. of P-values with adjustments for multiple hypothesis testing; group- These technical tools are intended to complement recommendation ing of measures by a variable such as the primary outcome; and cus- documents and guidelines for reporting on research studies. tomizable formatting options. Variables defined as normally Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/26/5001910 by Ed 'DeepDyve' Gillespie user on 07 November 2018 28 JAMIA Open, 2018, Vol. 1, No. 1 mytable ¼ TableOneðdataÞ distributed are summarized by mean and standard deviation by de- fault, while non-normally distributed variables are summarized by In this case, the package will create a new DataFrame containing median and interquartile range. the summary statistics, automatically identifying continuous and Mean and standard deviation are often poor estimates of the cen- categorical variables within the data and summarizing them appro- ter or dispersion of a variable’s distribution when the distribution: is priately. Once generated, the table may be viewed on screen or asymmetric, has “fat” tails and/or outliers, contains only a very exported to a range of established formats, including LaTeX, CSV, small finite set of values or is multimodal. Median and interquartile and HTML using the “to_format()” methods (for example, range may offer a more robust summary than mean and standard “mytable.to_latex()”). When the table is generated, auto- deviation for skewed distributions or in the presence of outliers, but mated tests will print a series of remarks that highlight potential may be misleading in cases such as multimodality. Several tests have issues to the researcher. For example, if outliers are indicated by therefore been incorporated to raise potential issues with reported Tukey’s rule, the researcher is warned to consider the implications summary statistics. For example, Hartigan’s Dip Test is computed of this with respect to the summary statistics. and a warning message is generated if the test results indicate a pos- We provide an executable Jupyter Notebook alongside the code 7,17 sible multimodal distribution. Similarly, Tukey’s Rule highlights that demonstrates the application of the package to a small cohort outliers in distributions that may distort the mean. While formal sta- of patients in MIMIC-III (Figure 1). MIMIC-III is a large, publicly tistical checks can be useful in detecting potential issues, they often available dataset of critically ill patients admitted to intensive care are not very useful in small sample sizes so these tests should be used units (ICUs) at the Beth Israel Deaconess Medical Center in Boston, alongside standard visualization methods. MA, USA. The example subset corresponds to 1000 patients who When multiple hypotheses are tested, as may be the case when stayed at least 48 h in the ICU and contains demographics, treat- numerous variables are summarized in a table, there is a higher ment, and survival status at hospital discharge. Table 1 shows an ex- chance of observing a rare event. To help address this issue, correc- ample of the output of the tableone package, and Table 2 shows the tions for multiple comparisons have been implemented. By de- first 5 rows of the dataset prior to summarization. Figure 2 shows a fault, the package computes the Bonferroni correction, which kernel smoothed density for the Age and SysABP variables, addresses the issue in a simple way by dividing the prespecified sig- highlighting the multimodality concerns raised by the tableone pack- nificance level (Type I error rate, a) by the number of hypothesis age. Figure 3 shows a box-plot of the data, with circles indicating tests conducted. This approach is known to over-correct, effectively outlying points warned about by Tukey’s test. The package is under reducing the statistical power of the tests, particularly when the continuous development, so for up-to-date information we suggest number of hypotheses are large or when the tests are positively cor- reviewing the package documentation, which is available online. related. There are many alternatives which may be more suitable and also widely used, and which should be considered in situations that would be adversely affected by the conservative nature of the 20–22 DISCUSSION Bonferroni correction. The tableone package was developed following good practice We encourage use of tableone alongside other methods of descrip- guidelines for scientific computing. The code is openly available tive statistics and, in particular, visualization to ensure appropriate on GitHub under a permissive MIT License, enabling continuous, data handling. When used in this way, the package helps researchers collaborative development. Issues are tracked publicly in the re- to create summary statistics for study populations, an integral task pository and guidelines for contributing to the package are pro- for almost any research study. The default settings have been care- vided, promoting transparency and helping to ensure that the fully chosen to match the preferences of most researchers and to ad- software functionality meets the demand of the scientific commu- here to best practices, with the intention that only minor nity. Contributions that address known issues such as feature devel- configurations are generally necessary when generating the table. opments and bug fixes are actively encouraged. A continuous Such configurations would include specifying grouping variables integration server is used to test new contributions, adding an addi- (such as study outcome), adding alternative labels for variable tional level of quality control to proposed changes. Package depen- names, and limiting the number of levels to display for a categorical dencies, defined in the repository, include Pandas, NumPy, SciPy, variable. 25–28 and StatsModels. In providing a reproducible approach to generating a summary table from a dataset, we hope to reduce the contribution of coding and data entry errors to misreported statistics. The consistency of a standardised approach will help to discourage some of the common RESULTS reporting issues discussed previously. Automated tests for issues The tableone package has been published on the Python Package In- such as multimodality and outliers will raise warnings for the re- dex (PyPI), a repository of software for the Python programming searcher, helping to catch and prevent potentially misleading sum- language. It is therefore straightforward to install using the standard mary statistics before they are reported. Plotting the distribution of installation command: “pip install tableone”. The dataset to be sum- each variable by group level via histograms, kernel density estimates marized must be provided as a Pandas DataFrame, structured so and boxplots is a crucial component to data analysis pipelines, how- that each row captures a unique case (eg a patient) and each column ever, and these tests are not intended to replace such methods. Visu- pertains to an observation associated with the case (eg patient age or alization is often is the only way to detect problematic variables in a laboratory test result). many real-life scenarios. After importing the package into the Python environment, the By default we do not support statistical hypothesis tests for com- simplest application of it is to create an instance of the TableOne parison of distributions, because as a general rule we believe that it 1,2,6,31 class with the DataFrame to be summarized (“data”) as a single in- is best practice not to do so. However, as has been highlighted put argument, as follows: elsewhere, many journals still require P-values alongside summary Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/26/5001910 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 29 Figure 1. A executable Jupyter Notebook provides worked examples for applying the TableOne package to exemplar data. Table 2. Example of the data used, showing the ﬁrst 5 rows (usually the first table in the manuscript), significant differences be- tween or among groups (ie, P< 0.05) should be identified in a table Age SysABP Height Weight ICU MechVent LOS death footnote and the P-value should be provided in the format specified above.” To encourage the wider adoption of methods which ac- 54 NaN NaN NaN SICU 0 5 0 count for multiple comparisons, we have implemented methods 76 105.0 175.3 80.6 CSRU 1 8 0 44 148.0 NaN 56.7 MICU 0 19 0 such as the Bonferroni and Sidak corrections. 68 NaN 180.3 84.6 MICU 0 9 0 Sharing a tool such as tableone creates a responsibility to pro- 88 NaN NaN NaN MICU 0 4 0 mote better practice and to avoid propagating poor practice, and we are committed to working with the research community to ensure Each row captures a unique case (eg a patient) and each column pertains to this is done. Documentation and example code will be continuously an observation associated with the case (eg patient age). improved and used to encourage authors to observe study reporting NaN: Not a Number; SysABP: systolic arterial blood pressure; ICU: intensive guidelines. Statistical referees of research studies using tableone care unit; SICU: surgical ICU; CSRU: cardiac surgery recovery unit; MICU: med- should benefit from the fact that their feedback can be fed into the ical ICU; MechVent: mechanical ventilation; LOS: hospital length of stay. package for future users, helping to promote good practice within a community rather simply being directed at the authors of a single statistics. In their guidelines for authors, for example, the New En- study. In addition, referees carrying out detailed methodological gland Journal of Medicine include the following statement: “For code reviews on a study-by-study basis should find it more straight- tables comparing treatment groups at baseline in a randomized trial forward to assess a single function call to tableone (with publicly Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/26/5001910 by Ed 'DeepDyve' Gillespie user on 07 November 2018 30 JAMIA Open, 2018, Vol. 1, No. 1 Conflict of interest statement. None declared. CONTRIBUTORS TJP, AEWJ, and JDR developed the software. TJP, AEWJ, JDR, and RGM contributed to the paper and approved the final submission. ACKNOWLEDGEMENTS We would like to thank Kazuki Yoshida and Justin Bohn for creating the tableone package for R, which inspired this work. We would also like to Figure 2. A test for modality raises a warning message for both “Age” and thank the reviewers, and especially Reviewer 1, for providing thoughtful and “SysABP” (systolic arterial blood pressure). Upon inspection, SysABP shows constructive suggestions for improving the package. clear peaks at both 0 and 120. REFERENCES 1. Schulz KF, Altman DG, Moher D. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010; 340: c332. 2. The EQUATOR Network: Enhancing the QUAlity and Transparency of Health Research. http://www.equator-network.org/ (Accessed March 20, 2018). 3. Meier SM, Mattheisen M, Mors O, Schendel DE, Mortensen PB, Plessen KJ. Correction: errors in Table 1. JAMA Psychiatry 2018; 75 (1): 104. 4. Myles PS, Smith JA, Forbes A, et al. Correction: tranexamic acid in patients undergoing coronary-artery surgery. N Engl J Med 2018; 378 (8): 5. Yoshida K, Bohn J. Package ‘tableone’ for R. https://cran.r-project.org/ web/packages/tableone/tableone.pdf (Accessed December 31, 2017). 6. Lang TA, Altman DG. Basic statistical reporting for articles published in Biomedical Journals: The “Statistical Analyses and Methods in the Pub- lished Literature” or the SAMPL Guidelines. Int J Nurs Stud 2015; 52 (1): 5–9. 7. Glantz SA. Biostatistics: how to detect, correct and prevent errors in the medical literature. Circulation 1980; 61 (1): 1–7. 8. Yancey JM. Ten rules for reading clinical research reports. Am J Orthod Figure 3. Box-plot of 3 variables with whiskers located at a distance of three Dentofacial Orthoped 1996; 109 (5): 558–64. times the interquartile range. Points outside these whiskers are labeled “far outliers” and denoted by circles. A test for far outliers with Tukey’s rule raises 9. Goodman SN, Altman DG, George SL. Statistical Reviewing Policies of a warning for height but not age or systolic arterial blood pressure (SysABP). Medical Journals. J Gen Intern Med 1998; 13 (11): 753–6. 10. Johnson AEW, Stone DJ, Celi LA, Pollard TJ. The MIMIC Code Reposi- tory: enabling reproducibility in critical care research. J Am Med Inform discussed strengths and weaknesses) than to review custom code for Assoc 2018; 25 (1): 32–9. this task in each case. 11. Johnson AEW, Pollard TJ. Reproducibility in critical care: a mortality pre- diction case study. In: Proceedings of Machine Learning for Healthcare. W&C Track Volume 68. http://proceedings.mlr.press/v68/johnson17a/ CONCLUSION johnson17a.pdf. Accessed May 1, 2018. 12. Perkel JM. Programming: pick up Python. Nature 2015; 518 (7537): We describe the release of the tableone package for Python. The 125–6. package provides a reproducible approach for compiling summary 13. McKinney W. Python for Data Analysis: Data Wrangling with Pandas, statistics for research papers into a publishable format. The package NumPy, and IPython. Boston, MA: O’Reilly Media, Inc.; 2012. will be continuously improved and updated, based on community 14. McKinney W. pandas: a foundational Python library for data analysis and feedback, and encourage good practices for scientific reporting. It statistics. In: Proceedings of Python for High Performance and Scientiﬁc should be noted that while we have tried to follow best practices, au- Computing (PyHPC); 2011: 1–9. tomation of even basic statistical tasks can be unsound if done with- 15. Kluyver T, Ragan-Kelley B, P erez F, et al. Jupyter Notebooks—a publish- out supervision. We, therefore, suggest seeking guidance from a ing format for reproducible computational workﬂows. In: Loizides F, Schmidt B, ed. Positioning and Power in Academic Publishing: Players, statistician when using tableone for a research study, especially prior Agents and Agendas . Clifton, VA: IOS Press; 2016: 87–90. to submitting the study for publication. 16. Ragan-Kelley M, Perez F, Granger B, Kluyver T, et al. The Jupyter/IPy- thon architecture: a uniﬁed view of computational research, from interac- tive exploration to communication and publication. In: American FUNDING Geophysical Union, Fall Meeting Abstracts 2014. The authors were supported by grants NIH-R01-EB017205 and NIH-R01- 17. Hartigan JA, Hartigan PM. The dip test of unimodality. Ann Stat 1985; EB001659 from the National Institutes of Health. 13 (1): 70–84. Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/26/5001910 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 31 18. Mohd Razali N, Yap B. Power comparisons of Shapiro-Wilk, Kolmogo- 26. P erez F, Granger BE. IPython: a system for interactive scientiﬁc comput- rov-Smirnov, Lilliefors and Anderson-Darling Tests. J Stat Model Anal ing. Comput Sci Eng 2007; 9 (3): 21. 2011; 2: 21–33. 27. Walt SV, Colbert SC, Varoquaux G. The NumPy array: a structure for ef- 19. Shaffer JP. Multiple hypothesis testing. Annu Rev Psychol 1995; 46 (1): ﬁcient numerical computation. Comput Sci Eng 2011; 13 (2): 22–30. 561–84. 28. Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling 20. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and with python. In: van der Walt S, Millman J, eds. Proceedings of the 9th Py- powerful approach to multiple testing. J R Stat Soc B 1995; 57 (1): 125–33. thon in Science Conference. Austin, TX: SciPy; 2010: 57–62. 21. Holm S. A simple sequentially rejective multiple test procedure. Scand J 29. Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible Stat 1979; 6 (2): 65–70. critical care database. Sci. Data 2016; doi: 10.1038/sdata.2016.35. 22. Sid ak ZK. Rectangular conﬁdence regions for the means of multivariate 30. tableone Documentation. http://tableone.readthedocs.io/en/latest/ normal distributions. J Am Stat Assoc 1967; 62 (318): 626–33. (Accessed March 20, 2018). 23. Wilson G, Bryan J, Cranston K, et al. Good enough practices in scientiﬁc 31. Murray GD. Statistical aspects of research methodology. Br J Surg 1991; computing. PLoS Comput Biol 2017; doi:10.1371/journal.pcbi.1005510. 78 (7): 777–81. 24. Pollard TJ, Johnson AEW. Source code for the tableone package. https:// 32. Palesch YY. Some common misperceptions about p-values. Stroke 2014; github.com/tompollard/tableone (Accessed December 31, 2017). 45 (12): e244–6. 25. Jones E, Oliphant E, Peterson P, et al. SciPy: Open Source Scientiﬁc Tools 33. New England Journal of Medicine: Instructions for Authors. http://www. for Python, http://www.scipy.org/ (Accessed December 31, 2017). nejm.org/author-center/new-manuscripts (Accessed March 20, 2018).
JAMIA Open – Oxford University Press
Published: Jul 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
All the latest content is available, no embargo periods.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud