Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Towards Kurdish Information Retrieval

Towards Kurdish Information Retrieval Towards Kurdish Information Retrieval KYUMARS SHEYKH ESMAILI, Technicolor, France SHAHIN SALAVATI, University of Kurdistan, Iran ANWITAMAN DATTA, Nanyang Technological University, Singapore The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This article reports on the outcomes of a project aimed at providing essential resources for processing Kurdish texts. A principal output of this project is Pewan, the first standard Test Collection to evaluate Kurdish Information Retrieval systems. The other language resources that we have built include a lightweight stemmer and a list of stopwords. Our second principal contribution is using these newly-built resources to conduct a thorough experimental study on Kurdish documents. Our experimental results show that normalization, and to a lesser extent, stemming, can greatly improve the performance of Kurdish IR systems. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Design, Measurement, Experimentation, Performance Additional Key Words and Phrases: Kurdish language, Sorani Kurdish, Kurmanji Kurdish, test collection, stemming, cross-lingual information retrieval ACM Reference Format: Sheykh http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Asian Language Information Processing (TALIP) Association for Computing Machinery

Loading next page...
 
/lp/association-for-computing-machinery/towards-kurdish-information-retrieval-tL0zNiyYml

References (55)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2014 by ACM Inc.
ISSN
1530-0226
DOI
10.1145/2556948
Publisher site
See Article on Publisher Site

Abstract

Towards Kurdish Information Retrieval KYUMARS SHEYKH ESMAILI, Technicolor, France SHAHIN SALAVATI, University of Kurdistan, Iran ANWITAMAN DATTA, Nanyang Technological University, Singapore The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This article reports on the outcomes of a project aimed at providing essential resources for processing Kurdish texts. A principal output of this project is Pewan, the first standard Test Collection to evaluate Kurdish Information Retrieval systems. The other language resources that we have built include a lightweight stemmer and a list of stopwords. Our second principal contribution is using these newly-built resources to conduct a thorough experimental study on Kurdish documents. Our experimental results show that normalization, and to a lesser extent, stemming, can greatly improve the performance of Kurdish IR systems. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Design, Measurement, Experimentation, Performance Additional Key Words and Phrases: Kurdish language, Sorani Kurdish, Kurmanji Kurdish, test collection, stemming, cross-lingual information retrieval ACM Reference Format: Sheykh

Journal

ACM Transactions on Asian Language Information Processing (TALIP)Association for Computing Machinery

Published: Jun 1, 2014

There are no references for this article.