TY - JOUR AU - M, Verdier, James AB - On 8 December 2015, the American Institute of Biological Sciences (AIBS) convened a meeting titled Addressing Biological Informatics Workforce Needs, bringing together key stakeholders representing federal agencies, universities, scientific societies, research organizations, funders, and others, with the aim of addressing education and training issues related to the biological informatics workforce. The meeting, held in conjunction with AIBS's annual Council of Member Society and Organization meeting, considered biological informatics broadly and defined it as the interdisciplinary science of collecting, curating, analyzing, publishing, planning, documenting, and archiving complex biological data, including molecular and sequence data. The meeting built on the foundation laid by two previous AIBS workshops: Changing Practices in Data Publication and Enhancing Complex Data Integration across Research Domains. The reports from those workshops identified education, training, governance, and infrastructure as potential cross-cutting barriers to the best use of an increasing amount and array of biological and environmental data. This plethora of data is a result of rapid technological progress, notably in instrumentation (including molecular techniques) and from digitization. In this article, we summarize the key discussions and proposed recommendations for future action, with the aim of invigorating the present discussion on workforce training. The areas identified for continued development were undergraduate and graduate training, as well as the need for training and new career paths for informatics researchers and practitioners. Here, we share recommendations for scientific societies, faculty members, academic libraries, and research funders, focusing on several key areas for scientific inquiry. Undergraduate education The importance of providing undergraduate students with exposure to data science was highlighted when participants noted that many doctoral programs rate students who have a bioinformatics background as better qualified than other applicants. Scripting, in particular, is a much-sought-after skill. Python and R are languages that provide a gentle learning curve for scripting and for learning how to handle data. Bioinformatics experience is marketable for those with undergraduate or graduate degrees, with R, Perl, Python, C/C++, Java, and MySQL skills being particularly valued. According to the workshop participants, there are too few high-quality programs providing undergraduate educational experiences. One example of an enabling resource is CourseSource, an online journal supported by the Howard Hughes Medical Institute, which allows people to upload classroom modules. The authors of these modules receive credit for a publication. Departmental homes Computer science departments can be good partners for biology departments that want their students to develop scripting skills, but achieving the necessary compromises between departments is sometimes an obstacle. A participant shared an example of a scientific computing department that grew out of an interdisciplinary science program. Unfortunately, the process was complicated by the challenges associated with aligning computer science and biology departmental objectives for curricular breadth and depth. Continuing needs Biologists often learn to code out of necessity. This model is not ideal and contributes to the development of bad practices. One participant argued that, in general, computer science departments are not good places for biologists to look for best practices in programming. The importance of improved instruction in statistics for all students was reaffirmed and offered as one part of the solution to preventing the adoption of bad practices. DataONE, which has produced many informatics training materials, recommends that every undergraduate student taking an introductory biology course would benefit from at least a 1-hour lecture on how data are acquired and managed. Advanced undergraduates (along with graduate students and postdocs) who want to develop higher-level expertise would benefit from a seminar course that covers best practices and tools for managing data throughout the data and research life cycles. View largeDownload slide Stephanie Hampton, former deputy director of the National Center for Ecological Analysis and Synthesis. View largeDownload slide Stephanie Hampton, former deputy director of the National Center for Ecological Analysis and Synthesis. Undergraduate biology courses rarely require students to have data training, although there are exceptions. One is the Berkeley Data Science Education Program, which is starting to expose undergraduates to computational thinking through hands-on work with real data. However, informatics-training initiatives, which are often organized by professional societies, do not typically reach undergraduates, who tend not to be involved with professional societies. Yet these programs, if offered with sufficient outreach, might offer an opportunity for professional societies to offer tangible, career-focused benefits to students. The question of how to teach the teachers of data science is rarely addressed, and undergraduate education tends to rely on traditional courses and on faculty. In addition, undergraduate data education is often restricted to analysis and does not teach a broader understanding of data and data collection skills. Educators therefore need to define a core set of data skills that are desirable and marketable to employers, but no consensus on that core set of skills exists at present. View largeDownload slide Clifford Duke, presently with the National Academy of Sciences and a member of the BioScience Editorial Board. View largeDownload slide Clifford Duke, presently with the National Academy of Sciences and a member of the BioScience Editorial Board. The Quantitative Undergraduate Biology Education and Synthesis (QUBES) project has done important work centralizing quantitative course materials. However, faculty members can sometimes find it difficult to negotiate space in the curriculum for new material. Bringing research data into undergraduate classrooms has proven a promising way to teach data science (e.g., Teaching Issues and Experiments in Ecology, an initiative of the Ecological Society of America). Such projects could form the basis of a semester-long course or a workshop. Other promising approaches include small-group virtual meetings. For systematics biology and the digitization of biodiversity collections, there is, at the undergraduate level, a need to develop cross-disciplinary integrations of data and a need to have organismal classes that incorporate taxonomy and field collections. There is also a need to teach data literacy and quantitative, geographic, and soft skills. Examples of soft skills are the ability to communicate effectively in written and oral formats, the ability to work independently and collaboratively, the ability to innovate, and the ability to make careful observations. QUBES, NIBLSE (the Network for Integrating Bioinformatics into Life Sciences Education), and AIM-UP! (Advancing Integration of Museums in Undergraduate Programs) have all done valuable work in these areas. Training in programming and the use of databases, as well as specialist tools such as GIS (geographic information systems), can even be delivered to high-school students in summer courses or labs. Teachers can retrieve modularized pedagogical material via the Internet and fit it into existing curricula as “microinsertions.” Overall, educators need to better integrate the various efforts now under way. There is an unmet need for meaningful undergraduate research experiences involving specimen collection and curation, for mentoring of student workers, and for the use of educational modules that incorporate natural history collection data. Graduate training Many thoughtful observers believe that society urgently needs more interdisciplinary research, because the tools for it now exist and solutions to many of society's most pressing problems span multiple disciplines. There remain many barriers to interdisciplinary research in academia, however. Credit and incentives to encourage people to learn skills relevant to interdisciplinary research are not yet common. Perhaps partly in consequence, students with good data skills often leave academia for industry. The Gordon and Betty Moore Foundation's (GBMF) Data-Driven Discovery Initiative, a joint project with the Sloan Foundation, is one important example of a response; it will support researchers with computational, math, and statistics skills, as well as domain expertise. View largeDownload slide Cynthia Parr, US Department of Agriculture Agricultural Research Service. View largeDownload slide Cynthia Parr, US Department of Agriculture Agricultural Research Service. Continuing needs Reports starting in 2012, including from the Council of Graduate Schools, the National Research Council, the National Institutes of Health (NIH), and the American Chemical Society, have been critical of graduate student training broadly in the sciences. Their principal criticisms were that the time needed to obtain a degree was too long; the master's degree was undervalued; the training was often narrow and provided few transferable skills; career mentoring was focused mainly on an envisaged future career in academia; and training was not aligned with disciplinary, workforce, societal, and student needs. Their recommendations for graduate training included expanding and enhancing professional skills, preparing students for multiple career pathways, creating incentives for university–industry partnerships, expanding interdisciplinary training, and using evidence-based approaches to increase retention and reduce time to degree. Responding to these issues, the National Science Foundation (NSF) launched a new flagship graduate traineeship initiative—the NSF Research Traineeship (NRT) 
program. Established in 2014, the program is intended 
to catalyze and advance cutting-edge interdisciplinary research in high-priority areas; increase the capacity of graduate programs to produce interdisciplinary science, technology, engineering, and mathematics (STEM) professionals with technical and transferable professional skills for a range of careers; and develop innovative approaches and knowledge that will promote transformative improvements in graduate education. Since its inception, the NRT has explicitly sought and funded new approaches to integrating data science into graduate education and to helping institutions build training capacity in data-enabled science and engineering. The program consists of two tracks. The NRT traineeship track is a traditional comprehensive, interdisciplinary graduate STEM traineeship in high-priority research areas. Funding is provided to institutions for up to 5 years, with maximum awards of $3 million to support master's or doctoral degree students. The track has one priority interdisciplinary research theme—data-enabled science and engineering (DESE). One common feature of DESE awards is intensive, vertically integrated training, in which faculty and postdocs collaborate. This feature was introduced to counter the loss of expertise that previously occurred when students graduated. The innovations in graduate education track consists of smaller, 3-year awards of up to $500,000. It provides no student support but, rather, pilots graduate education projects. These projects are intended to help students learn how to exploit data and use novel data-driven approaches. Following from the experience of DataONE, meeting participants supported the idea that graduate students should take a seminar course that covers the best practices and tools for managing data throughout the data and research life cycles. The meeting participants supported exposing first-year graduate students to real data challenges. The Biodiversity Collections Network (BCoN) is building on efforts begun by AIM-UP! and working to foster the development of a community of practice that infuses specimen-based learning and exploration into formal and informal science education. Students are often interested in data infrastructure and training, but there are few programs available to which they can be directed. The lack of available training for graduate students in computing and informatics was noted by Hernandez and colleagues in BioScience (doi:10.1525/bio.2012.62.12.8) as a factor limiting data integration: Over 80 percent of the students in California who participated in a survey said they had received no training in computing and informatics. A survey conducted by Strasser and Hampton attributed the deficiency to a shortage of time and to the lack of preparation of students and instructors. The National Center for Ecological Analysis and Synthesis (NCEAS) Distributed Graduate Seminars, which are focused on a scientific synthesis project, are one promising route to providing such training. Another is the NCEAS Summer Institute, which combines hands-on exercises and small group sessions. The program has achieved success in teaching version control, data sharing, data “wrangling,” and collaboration skills. These institutes are, however, very oversubscribed. For example, in 2013, NCEAS received over 400 applications for 22 seats in one course. Training and career paths for researchers and practitioners According to an estimate by Change the Equation, 7.7 million people use complex computing in their jobs, which is 3.9 million more than the US Bureau of Labor Statistics reports. Data entropy is a major problem in science: Data cited in older publications are very often unavailable. For publications more than 15 years old, more than half of the data referred to are no longer accessible, according to one estimate. Handling data constitutes a cycle that involves data production, data reuse, data cleaning, data exploration, and data preservation. Increasingly, we can expect to see automated processes involved. However, when many researchers are asked what metadata standard they use, they answer “none” or “one created in my lab.” The BCoN, as one example, seeks to enhance the training of existing collections staff and create the next generation of biodiversity information managers to improve the existing state of affairs. Continuing needs Early-career training is essential and should recognize that information scientists often do not have domain knowledge in the fields from which they are using data. Programming and the use of databases, as well as specialist tools such as GIS, can be incorporated into such training. Panelists at the meeting recommended intensive “research sprints” or “hackathons” that would bring together groups of researchers to solve problems quickly, particularly as a tool for teaching basic programing, data standards, and semantics. Intellectual property is also a challenge, with important and complex issues that are not necessarily widely understood by researchers. For example, data cannot be copyrighted, but specific data compilations can be. Training should therefore cover the legal aspects of intellectual property, privacy issues, and similar concerns. Instruction in data management can conveniently and usefully be joined with ethics training, such as that mandated by the responsible conduct of research provisions of the America COMPETES Act of 2007. Published standards can and should continue to play an important role in helping researchers to adopt good data practices. Coursework in standards is an essential antidote to the widespread tendency to create a new data “standard” rather than use an existing one. This creates negative consequences: Data ostensibly supporting publications are being published in formats that make them impossible to reuse. Formal semantic schemes are becoming increasingly important, not just for good data practice but also for machine-learning approaches to text and data mining, for annotation, and for data integration. The OBOE ontologies for scientific observations, developed for the NCEAS, are one such valuable development. Researchers must become comfortable with change and be willing to learn on the job, because the tools for data intensive science are in a state of flux. They should be unafraid to ask questions, because a thriving online culture supports learners. GitHub and Stack Overflow are helpful sites for such learning. Some commentators have argued that professional recognition for data skills is generally lacking in academia. A few universities are, however, making appointments in data science, and some federal agencies are setting up specialized data units, so it appears that attitudes are changing. Three data science environments supported by the Gordon and Betty Moore Foundation (at New York University, the University of Washington, and the University of California, Berkeley) have a careers working group that seeks to find professional roles for computationally savvy researchers. The foundation also supports the Jupyter interactive lab notebook for sharing workflows, the scientific programing language Julia, individual investigators, and Data Carpentry, an organization that runs scores of intensive advanced data training workshops around the country. Data Carpentry recognized the demand for computational expertise several years ago. Many domain scientists have very little programming or computational experience when they start training. Data Carpentry is now helping to fill the strong unmet training need by providing domain-specific, hands-on intensive workshops around the country (34 events were held during 2015). These are developed by and for practitioners and are intended to identify best skills and practices, with an emphasis on foundational skills. Their format as add-ons solves the problem of the lack of time in curricula for data training. Data Carpentry is also working on a train-the-trainers approach with Software Carpentry, a slightly older organization with some similarities. Several workshop participants voiced support for the sort of short workshops provided by Data Carpentry, as well as for cross-disciplinary hackathons. A major effort at the NIH is the extramural Big Data to Knowledge (BD2K) initiative, which complements an intramural program. The BD2K expends funds in a disease-agnostic fashion on efforts that support data use across domains; data integration of a wide range of data types is a major focus. The initiative is meant to develop and improve data science skills, build a diverse workforce, ensure that training opportunities are available at all levels from undergraduate to senior faculty, and foster collaborations between data scientists and biomedical scientists. Almost 20 percent of the BD2K budget goes to training. This supports a wide variety of educational resources, courses (including massive open online courses), and training and career development programs. Funding is also available to give students at less-research-intensive universities experience working with data science. The NIH emphasizes making educational resources easy to discover through its training coordination center. One meeting participant and observer of the data deluge in modern biology noted that the term bioinformatics, although it might be taken to refer to any basic biological data, in practice, is often understood as referring to molecular data. Biodiversity informatics, which can help researchers understand functional biodiversity, is very much a developing field. Although some understand it to concern analysis workflows and methods, others include within biodiversity informatics data infrastructure and knowledge assembly and provisioning. These aspects of biodiversity informatics have different audiences. The most fragile part is data infrastructure and training, which seem to be less valued than analytical methods and application services. There are fewer clear career pathways for those interested in the data side than for those on the analysis workflows side. Some incentives exist, but in general, there are too few opportunities on the data side. A workshop held in September 2015 at NCEAS on data-intensive skills across the environmental sciences emphasized the importance of the concordance of the skill classes needed, such as data management and processing, software skills needed for science, analysis, visualization, communication and dissemination, and collaboration and synthesis. To scale up existing efforts, independent training should be encouraged, with workshops and materials coordinated through other organizations (e.g., Data Carpentry). Scale up will also depend on networked assessments to generate a higher-level view of what is needed. A networked graduate course run using Software Carpentry techniques at multiple universities could be part of the solution; it would be focused on a different topic each year. An idea incubated at this meeting was based on the recognition that teaching can be very demanding. It may therefore be useful for less experienced educators, perhaps postdocs, to partner with more experienced instructors for a time, then move on to a different institution to train others once they have gained experience. CyVerse (formerly iPlant Collaborative) has developed powerful computational resources that are now more widely available. CyVerse is dedicated to advancing team science and is recognized by the National Center for Biotechnology Information as a center for data provision. This designation makes it easier for individual researchers to use its services. CyVerse hews to a platform philosophy that recognizes the importance of working at scale and avoids the mistake of assuming that one size fits all. Users can deploy and use different building blocks as they need them in a cloud space called iPlant Atmosphere. They can also custom design appliances by extending existing components and publish their findings. In addition, CyVerse makes available a number of science APIs (application programming interfaces). It thus provides tools that allow people to manage their digital assets and improve computational productivity. Over 40 herbaria are integrated into the collaborative, but usage extends beyond plants and life sciences (e.g., breast cancer research, psychological and social research, and climate research). CyVerse has created a course on applied concepts in cyberinfrastructure and special topics workshops. DataONE has provided informatics training in a range of venues and at varying degrees of depth for the past 6 years, from screencast tutorials only a few minutes long to 2-weeks-long intensive graduate courses. High-quality, modular slide presentations, handouts, and exercises enable faculty members to easily create and modify informatics seminars and lectures for students. Recommendations Breakout groups generated the following recommendations for various stakeholder groups on the basis of the discussions outlined in the previous sections. Recommendations for scientific societies Scientific societies should provide plenary sessions with keynote speakers, as well as sections on pedagogy and symposia exploring the importance of data curation. These sessions should be scheduled for well-attended portions of meetings and not limited to pre- or postmeeting workshops. Professional communities should also promote and enforce data archiving policies by encouraging their journals to publish articles on data science methodology and software. Journal editors and researchers expressed concern that the federal push for data publication was requiring editors to take on an enforcement function (i.e., to ensure that authors publish adequate data), without financial support. A contrasting sentiment expressed was that some journals profit from the products of research while not paying reviewers, so they are morally obliged to house data products that support research. Others wondered how many of the data associated with a research project will have to be made public and whether tangential material, such as researchers’ emails, would be included. The community and funders must resolve these expectations and standards of practice. The push to publish data has prompted biological societies and journals to encourage, if not require, the publication of original data supporting a scientific article with the report, possibly after an embargo period. Many journals require that data be made available to reviewers at the time an article is submitted, but some researchers object to this requirement. Shared standards of practice are required to prevent an uneven playing field and to ensure that data are indeed available. Because reviewers are not trained adequately to evaluate the data submitted with a manuscript, many papers are now being published with data that have been processed, precluding replication of the original analysis or reuse of the data. This tendency undermines the value of data publication. Scientific societies could take responsibility for helping to prevent this damaging outcome. Recommendations for faculty members Universities should require basic data standards instruction for newly hired staff in the same way that they now require human resources training. If developing a course between a biology department and another academic entity, such as a library or information school, proves impossible, creating a new interdisciplinary institute may be a solution. One participant shared the promising idea of having a computer scientist and a biologist coteach a semester-long data science course at a university, with one postdoc partner. The postdoc would then move on to another institution to share the expertise. Additional resources Ag Data Commons https://data.nal.usda.gov Advancing Integration of Museums in Undergraduate Education Research Coordination Network (AIM- UP! ) http://aimup.unm.edu Berkeley Data Science Education Program https://requestinfo.datascience.berkeley.edu Big Data To Knowledge (BD2K) https://datascience.nih.gov/bd2k Biodiversity Collections Network (BCoN) https://bcon.aibs.org Course Source www.coursesource.org CyVerse (the iPlant Collaborative) www.cyverse.org Data Carpentry www.datacarpentry.org DataONE www.dataone.org Data-Driven Discovery Initiative www.moore.org/programs/science/data-driven-discovery Encyclopedia of Life's TraitBank http://eol.org/info/traitbank GitHub https://github.com NCEAS Distributed Graduate Seminars www.nceas.ucsb.edu/research/dgs NCEAS Summer Institute www.nceas.ucsb.edu/outreach/summer-institute/2013/summer-institute-2013 National Science Foundation Research Traineeship (NRT) Program www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf16503 Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE) https://qubeshub.org/groups/niblse OBOE Semantic Tools Project https://github.com/NCEAS/oboe/ Stack Overflow http://stackoverflow.com Teaching Issues and Experiments in Ecology www.esa.org/tiee/misc/about.html The Quantitative Undergraduate Biology Education and Synthesis (QUBES) project https://qubeshub.org Recommendations for academic libraries Libraries were recognized as important potential transmitters of good data practices, and they are well positioned to help in the preparation of data management plans. Forward-looking libraries are already sharing information about basic data management and intellectual property awareness. Recommendations for funders Funders, whether government or foundation, can play an important role in addressing education and training issues related to the biological informatics workforce. Individuals and institutions alike are prone to act when a funding source offers clear incentives. In addition, funding sources should support the hiring of data professionals to advance their requirements. Conclusions Ever-increasing torrents of data have become a hallmark of twenty-first century science, and managing the deluge constitutes one of the principal hurdles facing governmental agencies, scientific societies, researchers, funders, and students. Although data-management challenges are often acute, it is clear that the organizations and individuals who best manage them will lead the way in developing the interdisciplinary tools and techniques needed to conduct sound science in this new era. Furthermore, those who rise to the challenge will be instrumental in the formation of an effective bioinformatics workforce, well positioned to transform data challenges into burgeoning scientific opportunities. Acknowledgments AIBS gratefully acknowledges financial support for the meeting from the New Mexico Experimental Program to Stimulate Competitive Research (EPSCoR), the Society for the Study of Evolution, The Society for Integrative and Comparative Biology, the Ecological Society of America, iDigBio, the iPlant Collaborative (now CyVerse), the Biodiversity Collections Network (NSF Division of Biological Infrastructure grant no.1441785), the Society for the Preservation of Natural History Collections, DataONE, the Computational Biology Institute of George Washington University, and the University of Kansas Biodiversity Institute. Author Biographical Timothy M. Beardsley is the executive editor at the Endocrine Society but was with AIBS during the meeting that is the subject of this report. Robert E. Gropp is the executive director of the American Institute of Biological Sciences. James M. Verdier is the senior editor of BioScience. © The Author(s) 2018. Published by Oxford University Press on behalf of the American Institute of Biological Sciences. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Addressing Biological Informatics Workforce Needs: A Report from the AIBS Council JF - BioScience DO - 10.1093/biosci/biy116 DA - 2018-11-01 UR - https://www.deepdyve.com/lp/oxford-university-press/addressing-biological-informatics-workforce-needs-a-report-from-the-Kk2dXhkHKk SP - 847 VL - 68 IS - 11 DP - DeepDyve ER -