Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

LoGE: an unsupervised local-global document extension generation in information retrieval for long documents

LoGE: an unsupervised local-global document extension generation in information retrieval for... This paper aims to manage the word gap in information retrieval (IR) especially for long documents belonging to specific domains. In fact, with the continuous growth of text data that modern IR systems have to manage, existing solutions are needed to efficiently find the best set of documents for a given request. The words used to describe a query can differ from those used in related documents. Despite meaning closeness, nonoverlapping words are challenging for IR systems. This word gap becomes significant for long documents from specific domains.Design/methodology/approachTo generate new words for a document, a deep learning (DL) masked language model is used to infer related words. Used DL models are pretrained on massive text data and carry common or specific domain knowledge to propose a better document representation.FindingsThe authors evaluate the approach of this study on specific IR domains with long documents to show the genericity of the proposed model and achieve encouraging results.Originality/valueIn this paper, to the best of the authors’ knowledge, an original unsupervised and modular IR system based on recent DL methods is introduced. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Web Information Systems Emerald Publishing

LoGE: an unsupervised local-global document extension generation in information retrieval for long documents

Loading next page...
 
/lp/emerald-publishing/loge-an-unsupervised-local-global-document-extension-generation-in-uLQB0oFlzi

References (44)

Publisher
Emerald Publishing
Copyright
© Emerald Publishing Limited
ISSN
1744-0084
eISSN
1744-0084
DOI
10.1108/ijwis-07-2023-0109
Publisher site
See Article on Publisher Site

Abstract

This paper aims to manage the word gap in information retrieval (IR) especially for long documents belonging to specific domains. In fact, with the continuous growth of text data that modern IR systems have to manage, existing solutions are needed to efficiently find the best set of documents for a given request. The words used to describe a query can differ from those used in related documents. Despite meaning closeness, nonoverlapping words are challenging for IR systems. This word gap becomes significant for long documents from specific domains.Design/methodology/approachTo generate new words for a document, a deep learning (DL) masked language model is used to infer related words. Used DL models are pretrained on massive text data and carry common or specific domain knowledge to propose a better document representation.FindingsThe authors evaluate the approach of this study on specific IR domains with long documents to show the genericity of the proposed model and achieve encouraging results.Originality/valueIn this paper, to the best of the authors’ knowledge, an original unsupervised and modular IR system based on recent DL methods is introduced.

Journal

International Journal of Web Information SystemsEmerald Publishing

Published: Nov 28, 2023

Keywords: Unsupervised document expansion; BERT; Information retrieval; BM25

There are no references for this article.