Access the full text.
Sign up today, get DeepDyve free for 14 days.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A. Tanwani, M. Farooq (2009)
The Role of Biomedical Dataset in Classification
(2014)
The role of dataset in training ANFIS system for course advisor
Frank Krüger, David Schindler (2020)
A Literature Review on Methods for the Extraction of Usage Statements of Software and DataComputing in Science & Engineering, 22
Qiuzi Zhang, W. Lu, Yunhan Yang, Haihua Chen, Jiangping Chen (2017)
Automatic Identification of Research Articles Containing Data Usage Statements
Xishuang Dong, Lijun Qian, Y. Guan, Lei Huang, Qiubin Yu, Jinfeng Yang (2016)
A multiclass classification method based on deep learning for named entity recognition in electronic medical records2016 New York Scientific Data Summit (NYSDS)
Aurélie Névéol, W. Wilbur, Zhiyong Lu (2011)
Extraction of data deposition statements from the literature: a method for automatically tracking research resultsBioinformatics, 27
Geraint Duck, A. Kovačević, D. Robertson, R. Stevens, G. Nenadic (2015)
Ambiguity and variability of database and software names in bioinformaticsJournal of Biomedical Semantics, 6
Andre Lamurias, J. Ferreira, Francisco Couto (2015)
Improving chemical entity recognition through h-index based semantic similarityJournal of Cheminformatics, 7
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer (2016)
Neural Architectures for Named Entity Recognition
K. Boland, Frank Krüger (2019)
Distant Supervision for Silver Label Generation of Software Mentions in Social Scientific Publications
Geraint Duck, G. Nenadic, Michele Filannino, A. Brass, D. Robertson, R. Stevens (2016)
A Survey of Bioinformatics Database and Software Usage through Mining the LiteraturePLoS ONE, 11
Martin Krallinger, O. Rabal, F. Leitner, M. Vázquez, David Salgado, Zhiyong Lu, Robert Leaman, Yanan Lu, D. Ji, Daniel Lowe, R. Sayle, R. Batista-Navarro, Rafal Rak, Torsten Huber, Tim Rocktäschel, Sérgio Matos, David Campos, Buzhou Tang, Hua Xu, Tsendsuren Munkhdalai, K. Ryu, S. Ramanan, P. Nathan, Slavko Žitnik, M. Bajec, L. Weber, Matthias Irmer, S. Akhondi, J. Kors, Shuo Xu, Xin An, Utpal Sikdar, Asif Ekbal, Masaharu Yoshioka, T. Dieb, Miji Choi, Karin Verspoor, Madian Khabsa, C. Giles, Hongfang Liu, K. Ravikumar, Andre Lamurias, Francisco Couto, Hong-Jie Dai, Richard Tsai, C. Ata, Tolga Can, Anabel Usie, Rui Alves, Isabel Segura-Bedmar, Paloma Martínez, J. Oyarzábal, A. Valencia (2015)
The CHEMDNER corpus of chemicals and drugs and its annotation principlesJournal of Cheminformatics, 7
A. Yan, Nicholas Weber (2018)
Mining Open Government Data Used in Scientific Research
Suppawong Tuarob, S. Bhatia, P. Mitra, C. Giles (2016)
AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big DataIEEE Transactions on Big Data, 2
Mengnan Zhao, E. Yan, Kai Li (2018)
Data set mentions and citations: A content analysis of full‐text publicationsJournal of the Association for Information Science and Technology, 69
Girish Palshikar (2019)
Techniques for Named Entity Recognition : A Survey
T. Ruokolainen, Pekka Kauppinen, Miikka Silfverberg, Krister Lindén (2019)
A Finnish news corpus for named entity recognitionLanguage Resources and Evaluation, 54
Animesh Prasad, Chenglei Si, Min-Yen Kan (2019)
Dataset Mention Extraction and ClassificationProceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, Daniel Weld (2011)
Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
Yuelin Li, Chang Liu (2019)
Information Resource, Interface, and Tasks as User Interaction Components for Digital Library EvaluationInf. Process. Manag., 56
IEEE Computer Architecture Letters, 1
Jason Wei, Kai Zou (2019)
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Ş. Kafkas, Jee Kim, J. Mcentyre (2013)
Database Citation in Full Text Biomedical ArticlesPLoS ONE, 8
Shaodian Zhang, Noémie Elhadad (2013)
Unsupervised biomedical named entity recognition: Experiments with clinical and biological textsJournal of biomedical informatics, 46 6
P. Romano, M. Helmer-Citterich (2007)
Bioinformatics in Italy: BITS2011, the Eighth Annual Meeting of the Italian Society of BioinformaticsBMC Bioinformatics, 13
Taynan Ferreira, Anna Costa (2020)
DeepBT and NLP Data Augmentation Techniques: A New Proposal and a Comprehensive Study
J. Li, Aixin Sun, Jianglei Han, Chenliang Li (2018)
A Survey on Deep Learning for Named Entity RecognitionIEEE Transactions on Knowledge and Data Engineering, 34
Ş. Kafkas, Jee Kim, Xingjun Pi, J. Mcentyre (2015)
Database citation in supplementary data linked to Europe PubMed Central full text biomedical articlesJournal of Biomedical Semantics, 6
C. Álvarez, J. Corbal, E. Salamí, M. Valero (2002)
Initial Results on Fuzzy Floating Point Computation for Multimedia ProcessorsIEEE Computer Architecture Letters, 1
Connor Shorten, T. Khoshgoftaar (2019)
A survey on Image Data Augmentation for Deep LearningJournal of Big Data, 6
Maxim Grechkin, Hoifung Poon, B. Howe (2017)
Wide-Open: Accelerating public data release by automating detection of overdue datasetsPLoS Biology, 15
Geraint Duck, G. Nenadic, A. Brass, D. Robertson, R. Stevens (2013)
bioNerDS: exploring bioinformatics’ database and software use through literature miningBMC Bioinformatics, 14
Jeffrey Pennington, R. Socher, Christopher Manning (2014)
GloVe: Global Vectors for Word Representation
A. Akbik, Tanja Bergmann, Roland Vollgraf (2019)
Pooled Contextualized Embeddings for Named Entity Recognition
Behnam Ghavimi, Philipp Mayr, C. Lange, S. Vahdati, S. Auer (2016)
A semi-automatic approach for detecting dataset references in social science textsArXiv, abs/1611.01820
Vikas Yadav, Steven Bethard (2018)
A Survey on Recent Advances in Named Entity Recognition from Deep Learning modelsArXiv, abs/1910.11470
E. Parish, K. Duraisamy (2016)
A paradigm for data-driven predictive modeling using field inversion and machine learningJ. Comput. Phys., 305
L. Barba (2019)
Engineers Code: Reusable Open Learning Modules for Engineering ComputationsComputing in Science & Engineering, 22
Lishuang Li, Liuke Jin, Zhenchao Jiang, Dingxin Song, Degen Huang (2015)
Biomedical named entity recognition based on extended Recurrent Neural Networks2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Braja Patra, Kirk Roberts, Hulin Wu (2020)
A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repositoryDatabase: The Journal of Biological Databases and Curation, 2020
This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain.Design/methodology/approachFirstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities.FindingsIn the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition.Originality/valueThis paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.
The Electronic Library – Emerald Publishing
Published: Nov 4, 2021
Keywords: Data set entity recognition; Distant supervision; Scientific literature; Data augmentation; Long-tailed entities; Library automation; Distance learning; Database management
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.