Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A set of parameters for automatically annotating a Sentiment Arabic Corpus

A set of parameters for automatically annotating a Sentiment Arabic Corpus This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.Design/methodology/approachThe approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).FindingsThe results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.Originality/valueThe principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Web Information Systems Emerald Publishing

A set of parameters for automatically annotating a Sentiment Arabic Corpus

Loading next page...
 
/lp/emerald-publishing/a-set-of-parameters-for-automatically-annotating-a-sentiment-arabic-X9Kxdo0NV0
Publisher
Emerald Publishing
Copyright
© Emerald Publishing Limited
ISSN
1744-0084
DOI
10.1108/ijwis-03-2019-0008
Publisher site
See Article on Publisher Site

Abstract

This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.Design/methodology/approachThe approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).FindingsThe results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.Originality/valueThe principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.

Journal

International Journal of Web Information SystemsEmerald Publishing

Published: Oct 15, 2019

Keywords: Arabic sentiment analysis; Algerian dialect; Sentiment lexicon; Sentiment corpus; Doc2vec

References