Evaluating a 12-million-Word Corpus as a Source of Dictionary Data

Evaluating a 12-million-Word Corpus as a Source of Dictionary Data Abstract In this paper, we aim to evaluate the 12-million-word Helsinki Corpus of Swahili as a source of dictionary data used, among others, for the creation of the lemma list for a new Swahili-Polish dictionary. We analyse the dictionary log-files in order to answer a question already asked by De Schryver et al. (2006), Koplenig et al. (2014) and Trap-Jensen (2014) about whether dictionary users actually look up frequent words. However, the issue of utmost importance to us is whether a ten-thousand-item frequency list derived from a 12-million-word corpus meets the needs of a Swahili-Polish dictionary user. 1. Introduction Swahili, as the most widely-used language of Sub-Saharan Africa with tens of millions of people using it as a lingua franca throughout the East African region, has attracted the interest of a large variety of scholars, including many who have published numerous Swahili dictionaries around the world. The Swahili lexicographic tradition is over 150 years old and has been discussed in many papers that often focus on specific subject matters, for example the construction of a dictionary article in a student's dictionary (Mdee 1984), the role of a dictionary in teaching a language (Chuwa 1999, Mbaabu 1995), or the standardization of the Swahili language (Kiango 1995, Mdee 1999). Critical analyses and various attempts at comprehensive summaries have also been undertaken (cf. Chuwa 1996, Herms 1995). However, only a few publications have been devoted to a modern corpus-based Swahili lexicography (cf. De Schryver et al. 2006, De Pauw and De Schryver 2008, De Pauw et al. 2009, De Schryver and Prinsloo 2001, Prinsloo and De Schryver 2001), with few of them presenting the only publicly available annotated electronic corpus of Swahili – the Helsinki Corpus of Swahili (HCS 2004) as a source of dictionary data (e.g. Hurskainen 1994, 2002, 2003, Wójtowicz 2016). The aim of this paper is to evaluate the HCS as a basis for lemma list compilation for a Swahili-based bilingual dictionary. Our objective is to analyse the log files of a Swahili-Polish dictionary whose macrostructure was based on a HCS-derived frequency list. We will investigate whether dictionary users actually look up words that are frequent in the corpus. The research was also motivated by other related questions linked to such issues as updating an already existing corpus-based dictionary, as well as establishing how to improve the dictionary to best meet users' expectations in regard to lemma selection, and whether the frequency list should be the main source of new lemma candidates. 2. Towards a modern lexicography of Swahili The main sources of knowledge on Swahili lexicography are the dictionaries themselves. Expanded introductions, which are preludes to the actual dictionaries in the oldest editions, document the 19th-century beginnings of this new field on the African continent. They help to reconstruct a coherent history and provide insight into the work of the lexicographers. Since the 19th century, hundreds of Swahili-language dictionary publications of various types have been issued (cf. Ohly 2002). These include bilingual dictionaries, both to and from such languages as English, French, German, Russian, Japanese, etc., monolingual dictionaries, and various specialized dictionaries. However, the vast majority of Swahili dictionaries in no way specify the methods used in their compilation. What we can learn from some of the introductory notes is that most of the dictionaries were based on their predecessors. First, Krapf's manuscripts as well as the dictionary itself (Krapf 1882) were sources referenced both by contemporary scholars and by later Swahili-language researchers. Together with the works of other German missionaries, they also served as a basis for other publications (cf. Benson 1964). Next, the root-based dictionary by Johnson (1939a) revolutionized Swahili lexicography for years to come and became a fundamental source for future lexicographers and the most widely recognized work in the field (cf. Chuwa 1996). The publication of the monolingual Swahili dictionary by the African TUKI institute in 1981 marks the beginning of the next lexicographical period and fulfils Benson's (1964) hope of the increased involvement of African researchers. The dictionary's importance is of special note, as in the following years it became the fundamental source for subsequent bilingual works. It was aimed at supplanting the already 40-year-old dictionary by Johnson (1939b), and replacing it as the main language standardization resource. It has enjoyed unwavering and overwhelming popularity (cf. Herms 1995, also Hurskainen 1994), and eighteen reprints had been published by 1992. A new era of Swahili lexicography was initiated with the publication of the Swahili-Finnish-Swahili dictionary (Abdulla et al. 2002). The dictionary compiled at the Helsinki University was the first to be based on the modern dictionary method exploiting electronic text corpora. The authors of this traditional paper publication relied on data from an electronic corpus of the Swahili language – the Helsinki Corpus of Swahili. This initiated the beginning of a modern corpus-based lexicography, by then fairly standard in the Western World. The other HCS corpus-based dictionary is the Swahili-Polish Dictionary, which was released as a simplified print version of its electronic counterpart (Wójtowicz 2013). Only two years after the publication of the Finnish dictionary, the TshwaneDJe Swahili–English Dictionary (De Schryver et al. 2006, De Pauw et al. 2009) joined the market. It was the first corpus-driven electronic dictionary of Swahili with a new approach to the lemmatisation of headwords (cf. De Schryver et al. 2006). The content of this dictionary is based on web-based corpus data – ‘a balanced and representative Swahili corpus of around fifteen million running words' (De Schryver et al. 2006: 70). As of 2004, it was made freely available online, and can also be downloaded a stand-alone download for a fee. The dictionary includes over 15,000 entries. An important innovation was its inclusion among the headwords of orthographic forms, in addition to the usual stems, selected on the basis of a frequency count. It features basic morphological decomposition, corpus-based examples, and a system of cross-references linking root-derivative pairs. 3. Size of the corpus With the new Swahili-Polish dictionary to be based on corpus data, the main concern was the size of the available corpus. Use of electronic corpora for English lexicography was already well established in the 1980s (Atkins and Rundell 2008). Lesser-resourced languages have struggled to keep up-to-date, compiling and exploiting language corpora to enhance their lexicography. But in the case of such languages, the compilation of a corpus is more challenging and the issue of corpus size becomes meaningful. As Kilgarriff (2013) clearly states, ‘for lexicography, the relevant kind of data is a large collection of text: a corpus. A corpus is just that: a collection of data – text and speech – when viewed from the perspective of language research'. But the question of how to construct an appropriate collection to address our needs remains pertinent. Another question raised refers to whether in fact we have to construct anything at all. Similarly to Kilgarriff, Fuertes-Olivera (2012: 51) defines a lexicographical corpus as ‘any collection of texts where lexicographers can find inspiration'. It has generally been assumed that corpora for lexicography should be large and diverse. It has been demonstrated many times that the bigger the corpus is, the better (cf. Hanks 2002, Atkins and Rundell 2008, Kilgarriff et al. 2012). If the corpus is big enough, it will provide evidence about anything that should be in the dictionary. Otherwise it may lack certain terms. Until the 1980s, the standard size of a corpus amounted to one million words, after which the expected corpus sizes grew larger (cf. De Schryver and Prinsloo 2000). At the turn of the 21st century, Hanks (2002: 157) claimed that ‘in a corpus of 100 million words, a simple right- or left-sorted concordance shows clearly most of the normal patterns of usage for all words except the very rare or very unusual’. However, the judgement depends on the needs. A 100-million-word corpus may also be regarded as insufficient for a larger dictionary, as shown by Kilgarriff et al. (2012), who state that ‘for a 40,000-headword dictionary, a corpus of 2 billion words is substantially better, missing much less, than a corpus of 100 million words’. 100 million words is large enough for many empirical studies about language, as it provides information on the dominant meanings and usage-patterns for the top 10,000 core words that have been demonstrated for English, but it lacks evidence for rarer words, rare meanings of common words, and combinations of words (Kilgarriff and Grefenstette 2003). In the 1990s, the British National Corpus, comprising 100 million words of spoken and written British English, was such a model lexicographic corpus. For a time, it was a source for lexicographic research, but since then the amount of data used to build various corpora has increased significantly. The size of the corpora grew as material collected from web pages was added, and the notion of representativeness was set aside. For example, the Oxford English Corpus collected from web pages contains nearly 2.5 billion English words from all parts of the world and new texts are continually being added. As another example of very large corpora, Prinsloo (2015) mentions Google Books with 155 billion words for American English, 45 billion for Spanish and 34 billion for British English. These are extreme examples; nevertheless, a figure of at least a few hundred million words is regarded as the unofficial standard for electronic corpora nowadays. However, this is only true for better-resourced languages. 3.1. Corpora of lesser-resourced languages Things are much different when it comes to resources for lesser-resourced languages. Typically, only limited, small, unbalanced and raw corpora are available, if any at all. The scarcity of printed material, and often its unavailability, makes a compilation of ten million words a big corpus (cf. Prinsloo 2015). De Schryver and Prinsloo (2000) mention the compilation of the Lubà corpus, for which there were only 100 written sources available at that time. In the field of African language corpora, the acknowledged pioneers include Prinsloo (1991) for Pedi, and Hurskainen (1992a, 1992b) for Swahili. The first corpora were collected in a traditional way (cf. De Schryver 2002), while nowadays corpora for lesser-resourced languages, as in the case of most African tongues, are typically compiled automatically from web pages. For example, the Crúbadán Project (Scannell 2007) offers corpora for over 2 000 languages created by crawling the web, many of them for African languages. Although the idea of Web as Corpus raises a number of objections, one being its opportunistic nature (cf. Gatto 2011), for some languages it is in fact the only free and available source of language data; therefore, its value has come to be widely recognized, as often corpora for lesser-resourced languages, if at all available, are of an opportunistic type as well (Kilgarriff and Grefenstette 2003, Ghani et al. 2001, De Schryver and Prinsloo 2000, Fuertes-Olivera 2012, Tarp and Fuertes-Olivera 2016). The other arguments against using the web as a corpus concern its constantly changing content, the presence of duplicates, and its unsupervised and unedited texts. One practical consequence of the web’s dynamic nature is that it is impossible to strictly reproduce a corpus study. Nevertheless, the web as source data for African languages has been exploited, for example, by De Schryver (2002) and De Schryver and Prinsloo (2000). The papers discuss its potential and the present endeavour of collecting corpora for some African languages. Data are not available for all languages spoken in Africa, but at least some of them can take advantage of this method. The first collected corpora were small and have grown substantially since the beginning of the century, such as the Kiswahili Internet Corpus, which has increased in size from 1.7 million to 20 million words (De Schryver and Prinsloo 2000, De Pauw et al. 2009), and a version of it was used to build the Swahili-English Dictionary (De Schryver et al. 2006). Although Swahili and South African languages have been leading the field of corpus compilation and annotation for years, new projects are also being undertaken. Most of the (web-crawled) African language corpora lack any linguistic annotation, since tools for performing such annotations do not yet exist, but this is not always the case.1 An annotated Amharic Corpus of 23 million tokens is available, while three million tagged words are accessible for Somali (Jama Musse Jama 2016). But how big should the corpus be to provide reliable evidence on the language for lexicographic purposes? Obviously, for those working with the lesser-resourced languages, it would be interesting to establish what minimum corpus size is required. The issue of corpus size has been raised by De Schryver and Prinsloo (2000), who presented a comparison of data from different phases of the Pedi corpus compilation. Corpus size varied from a hundred thousand words in phase one to over one million words in phase three. The experiments revealed that ‘as far as high-frequency items are concerned, it can be predicted from these comparisons that increasing the size of a small-size corpus does not substantially influence the stability of the corpus’ and ‘that less-frequently-used items detected in a small-size corpus retain a low frequency value even if the corpus is substantially enlarged’. This means that in a corpus five or ten times bigger than the initial one, the relevant frequency of items remains more or less stable. However, a bigger corpus still has the advantage of capturing a greater number of types. Similar research by Prinsloo (2015) on corpora for Pedi, English and Afrikaans examines how relatively small and often unbalanced corpora could be utilized for lexicographic purposes in the absence of large collections. The author investigates how enlarging a corpus size from one to ten million, and then from ten million to hundred million words influences its usability. While it was possible to compare the English corpus frequency lists with the basic words for English from the Macmillan English Dictionary for Advanced Learners, the experiment of comparing lists derived from corpora of different size was conducted for Afrikaans and Pedi. An overlap of 72.8% for Pedi and 83.4% for Afrikaans was observed when the list derived from a one-million-word corpus was compared to that consisting of ten million. Predictably, a substantial difference appeared when it came to the raw number of occurrences of items. An item that occurs 11 times in the one-million-word (Afrikaans) corpus will occur 100 times in one consisting of ten million words and over 1,000 times in a hundred-million-word corpus. This matters enormously while working on a microstructural level, when we need examples to describe the meaning of a word, its senses, collocations, etc. A one-million corpus may produce an insufficient number of concordance lines. When it comes to real examples, Prinsloo found a one-million corpus to be quite adequate for commonly used words, the conclusion being that even small corpora of one million words can assist the lexicographer quite well in the compilation of small bilingual and monolingual dictionaries of approximately 5 000 lemmas. When the corpus is enlarged to ten million words, coverage of commonly used words becomes more reliable, and less work is needed to find missing items. All in all, the research confirms that the bigger the corpus the better, as the author concludes by stating that ‘a 100 million corpus will be extremely valuable’ (Prinsloo 2015: 299). Much as a bigger corpus is desirable, such an amount of data is simply not always available for lesser-resourced languages and in the study itself the author had to make do with a 10-million-word corpus of Pedi, while for Afrikaans a 100-million-word corpus was used. 4. The Helsinki Corpus of Swahili The Helsinki Corpus of Swahili is the only large and annotated corpus of standard Swahili available to the linguistic community for free academic use. Only recently, in 2016, was it moved to a new location and is available in Kielipankki – the Language Bank of Finland. The previous version of the corpus contained around 12 million words, and an annotated version was available after a signed agreement through the Lemmie web-browser and on a Linux server. At present, an expanded version of the corpus – the Helsinki Corpus of Swahili 2.0 of about 25 million words – is available in two formats. The annotated version available in Korp may be accessed after logging in with university or CLARIN credentials, or by applying for access. It contains morphological and syntactic annotation. The non-annotated version, comprised of the same content, but in the form of plain text without any linguistic codes, is available for free download. The HCS 2.0 consists of two parts: (1) old material; and (2) new material added after 2004. The old material contains two types of resources: books and news from before 2003. The data are basically the same as those found in the HCS 1.0, with the only difference being that whole texts and not only sections of books are included. In addition, in the new version the sentences are shuffled (for reasons related to copyright licensing) while in the old corpus they were in the original order. The books are mostly by such renowned Swahili authors as Shaaban Robert, E. Kezilahabi, E. Hussein, A. Lihamba, Mohamed S. Mohamed, and others. However, most of the texts come from Internet newspapers from between 1998 and 2003. The new material added after 2004 consists of two main sources – Bunge, the transcripts of Tanzanian Parliament debates from 2004–2006, and news texts from 2004–2015. The sizes of both parts, the old and the new, are more or less equal. Most of the corpus material was retrieved from the Web, especially as of 2000, after texts on the Web became increasingly available. The texts come from news media and open government pages. Some texts, such as books, were scanned and proofread. Furthermore, some of the oldest news material was manually copied and transferred online, and all material has gone through a series of formatting and correction routines, based on which a few thousand errors were identified and corrected. The texts have the annotation layer provided by SALAMA – Swahili Language Manager (cf. Hurskeinen 1999, 2008), an environment for the computational processing of the Swahili language. SALAMA includes a comprehensive language analyser of Swahili text; therefore, the corpus contains various kinds of linguistic information attached to each token. Each word in the annotated corpus contains the following types of information: the token, the stem, part-of-speech, morphological description, an English gloss, a syntactic tag, and other descriptors for verbs. The other component, SALAMA-DC – the SALAMA Dictionary Compiler – is a comprehensive system for producing dictionary entries from any word-form in Swahili. It produces entries with appropriate linguistic information, single-word headwords, multiword headwords, various types of cross-references, and a selection of usage examples in context. The example text may be further translated into English. 4.1. The HCS as a source of dictionary data Despite its unimposing size as compared to current standards, the HCS is one of the biggest annotated corpora for African or other lesser-resourced languages. The data from the HCS was used to build a new electronic dictionary of Swahili – the Swahili-Polish dictionary available from 2013 for free online use. Work on the dictionary started in 2010, using a corpus version based on HCS 1.0. It consisted of over 12.5 million words taken from numerous literary books and current news sources, most of which came from texts written after 2000. The HCS was the best resource we could get considering our dictionary needs. Although it was neither representative nor balanced, the corpus seemed to contain appropriate data taking into account our dictionary target users, that is learners of Swahili in Poland and other travellers to Swahili-speaking African countries. Prinsloo’s findings (2015) showed that a small-sized corpus should supply lexicographers with data to compile a good dictionary for our target group, as we aimed at creating a dictionary of up to 10,000 entries. The issue of the balance and representativeness of a corpus was discussed by De Schryver and Prinsloo (2000), who after presenting various viewpoints conclude that ‘it is clear that linguists disagree whether a corpus should try to be balanced or representative’ (De Schryver and Prinsloo 2000: 92). The literature does not provide an answer to the question of what the corpus should be representative of. Atkins, Clear and Ostler (1992) introduced the concept of organic corpora, which fits the situation for African languages and reflects a living language (Gouws and Prinsloo 2005). Lexicographers need to be aware that the type of data they are using should be matched to the aims of the particular dictionary (cf. Atkins, Clear and Ostler 1992, Atkins and Rundell 2008, Kilgarriff 2012). What has to be borne in mind is that each collection contains noise and biases stemming from the types of texts used. Hurskainen (2003) notes that a corpus, no matter its size, usually does not provide all words needed, even for a fairly modest dictionary. Even basic vocabulary may be missing in the corpus, as everyday matters may not be reflected in the component texts. Apart from its acceptable size, the corpus consists of texts that an average student of Swahili comes across in his day-to-day learning process. The literary works of renowned Swahili authors are read in language classes, while news items written in Swahili are the most easily accessible texts on the Internet. We used the corpus data for the compilation of the lemma list and for microstructural aspects, such as sense distinction, collocations, idioms and examples of usage. At present, the dictionary contains over 6,000 Swahili entries and over 7,000 entries in the searchable Polish index. The dictionary can be searched in both directions, but the user has to choose which language he is querying. The lemma list of the dictionary is mainly based on a HCS-derived frequency list of over ten thousand lemmatised entries. According to lexicographic practice, the dictionary compilation process should take into account the following three components: copying, introspection and looking at data (Kilgarriff 2013, Atkins and Rundell 2008). Therefore, apart from basing the lemma list on the frequency count, we also looked at other dictionaries and the list was compared with the vocabulary from students’ books (McGrath and Marten 2003 and Muaka and Muaka 2006). We hoped to identify missing basic vocabulary this way. Furthermore, closed sets such as days of the week, months, pronouns, and names of countries and continents, were verified. Entries identified as missing were added and also additional vocabulary was supplemented by students who worked on selected sets they found useful, such as animals, musical instruments, or means of transport (cf. Wójtowicz 2016). Every dictionary entry includes POS information and additional morphological features. Derivatives are treated as main entries and we have decided to follow standard solutions of the Swahili lexicographic tradition, used within nearly all Swahili dictionaries, in regard to the process of lemmatization (see Kiango 2000 for a thorough discussion, and De Schryver et al. 2006 for a novel approach). Therefore, we have ignored prefixes and listed the stems alone for verbs, numerals, and inflected adjectives. On the other hand, in response to beginner learners’ needs, we have broken with this tradition in some other cases and listed, for example, pronouns in their full form (with the stems also included as separate entries). 5. Analysing dictionary log files Whether our decision to use the HCS as a source of dictionary data was accurate, may be evaluated by the users and hopefully their searches may shed some light on that issue. To that end, a study of the dictionary’s log files was conducted. While research into dictionary use has become a respectable domain in its own right (cf. Lew 2011b, Lew and De Schryver 2014), the digital revolution in lexicography has introduced, among others, the possibility of tracking users’ behaviour while using a digital dictionary. Lexicographers can tap this record of user behaviour to inform lemma list selection, and in updating a dictionary. According to Bergenholtz and Johnsen (2005), analyses of log files may be used as a tool for improving Internet dictionaries. De Schryver and Joffe (2004: 188) go further in suggesting ‘that an automated analysis of the log files will enable the dictionary to tailor itself to each and every particular user’. While this remains an attractive prospect, for the time being a manual analysis of log files may help reveal which headwords have been successfully retrieved, as well as the ones that have been sought but not found. Based on this information, we can modify the content of a dictionary to meet the users’ needs. However, the limitations of this method have to be taken under consideration as well (cf. Lew 2011a, Bergenholtz and Johnsen 2007). The possibility of analysing log files attracted the attention of researchers already in the 1980s, but only recently have more research experiments been described (cf. De Schryver and Joffe 2004, Lew 2011a, Koplenig et al. 2014). The basis for the research differs significantly from over 20,000 look-ups in De Schryver and Joffe (2004), to over one million look-ups in Bergenholtz and Johnsen (2005), to almost 30 million in Trap-Jensen et al. (2014). In this regard, our study has a moderate empirical base, as it concerns a less popular language pair. 5.1. Swahili-Polish dictionary log files The Swahili-Polish dictionary was placed online, and since January 2013 the dictionary user queries have been saved in four log-files: Swahili found entries, Swahili not-found, Polish found and Polish not-found. While conducting a search, the users need to indicate which language they are searching for and as a result the searches are saved in different files. The files include the strings that users have typed in the search box, along with the number of searches for each string. No IP or any other identification has been stored. The data were saved in .csv files and the analysis was carried out by the author manually and with the use of regular expressions only. In the Swahili found file, in addition to the string and the number of searches, information on POS and ID of an entry that was returned to the user is provided, so polysemous entries can be traced easily. Additional analyses may be conducted using data from Google Analytics. In order to answer the research question of whether dictionary users most frequently search for entries from the HCS-based frequency list, we will investigate mainly the Swahili found and Swahili not-found log files. As the log files only record what the users have entered into the search box, it should be stressed that we can never be sure of who the user was or the user’s actual intentions (cf. Lew 2011a). Over a four-year period, up to 15 February 2017, a total of 53,592 queries were made. This makes for an average of 36 look-ups per day, with an increase observed in searches over time from 25 in 2013–14 to 46.5 since 2015. The number drops during holidays, especially the long summer vacations, to an average of 15 look-ups per day, which supports our assumption that the dictionary is mostly used by students of the language. The number of look-ups noted in each file is accordingly 25,466 (48% of the total) in the Swahili found, 14,708 (27%) in the Swahili not-found, 8,052 (15%) in the Polish found, and 5,366 (10%) in the Polish not-found. Thus, the majority of the look-ups (75%) were in the Swahili-Polish direction (cf. Figure 1). Figure 1. View largeDownload slide Breakdown of all searches as noted per file. Figure 1. View largeDownload slide Breakdown of all searches as noted per file. When we compare only the searches for Swahili entries, 63% of them are noted as found and 37% as not-found. Since the user has to manually switch direction to look for Polish words, unfortunately as many as 11% of the strings searched that are noted in the Swahili not-found file were identified as Polish words, and these items constitute 25% of the strings searched for at least twice. Out of the 8,912 strings, the majority (73%) were searched for only once. The majority of these were misspellings, wrongly decomposed verbs, full orthographic words, and multiword expressions. 5.2. Correlation between look-ups and corpus frequency In the study, we aim at comparing the words that users actually look up with the corpus representation of these words, that is their frequencies in the HCS. In the Swahili found file, 4,430 different strings were searched for over 25,000 times. This also includes searches for separate letters, like k or f, or other strings that do not represent a lemma. Before we look at the analysis of the files, we shall list the main assumptions of the dictionary that influence user search behaviour. The search is carried out on Swahili headwords and plural forms of nouns, which are included in the dictionary in their full form. Derivatives such as pronouns with class prefixes and irregular verbal forms are treated as separate entries and users can also search for them. The most difficult operation is the decomposition of the verbal complex. The user has to cut off all prefixal morphemes and search for the verbal root or extended root instead. Top 100 lemmas searched for most frequently constitute 16% of all look-ups, while 64% of all searches were carried out for the top 1,000 lemmas. Out of almost 4,500 strings searched for, only 1,600 have been looked up 5 times and more. Within the top 100 searches, the number of look-ups falls from 162 searches to only 28. In the Swahili found file, eight out of the top ten Swahili searches are for verbs (cf. Table 1). Only two items do not represent verbs; these are the most often looked up adjective tamu ‘sweet, delicious’, and, at rank ten, kama, which may be a verb or a conjunction. Table 1. Top 10 Swahili searches Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like Table 1. Top 10 Swahili searches Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like The top ten most often searched items do not correspond to the top ten most frequent words in the corpus, which was expected (cf. De Schryver et al. 2006). However, it is more interesting to see to what extent the users look up words derived from our modest corpus, as opposed to other items. If one compares the look-ups with the HCS-based frequency list, it turns out that 3,220 (out of 4,430), that is 73% of the strings that represent unique lemmas of the dictionary, can be found in the corpus list. Since the dictionary allows searching on the plural forms of nouns, the other 454 searches were for plural forms of nouns that are on the list as well, so 83% of all the Swahili found strings were entries from the corpus frequency list. Searches for these entries were carried out over 23,000 times, that is 91% of all look-ups. If one inspects the top 100 searches more closely, one notices that 34 lemmas of the top 100 searches can also be found in the corpus top 100, 79 in the corpus top 500, and 88 in the corpus top 1,000. This corresponds with the findings of De Schryver and Joffe (2004: 190) that the ‘users indeed look up the frequent words of the language’. It should be noted that in their study on Sesotho sa Leboa the number of look-ups was slightly lower. With their frequency list derived from a 6.1-million-word Sesotho sa Leboa corpus, 30 of the top 100 searches were found in the corpus top 100, and 63 in the corpus top 1,000. On the other hand, a close examination of the frequency list as compared to all of the searches reveals that as many as 99 items have been queried from the top 100 of the most frequent Swahili words (cf. Figure 2). The only entry that has never been looked up, or in other words, entered in precisely such a form, was the multiword expression wa na ‘have; lit. to be with’. Moving down the list, out of the top 500 lemmas, 480 (96%) have been searched for, while out of the top 1,000 lemmas – 935 (94%), and from among the top 2,000 lemmas – 1,751 (88%). If we disregard the 42 multiword expressions that are on the list but have not been searched for, the numbers go up to 97% for the top 500, 94% for the top 1,000, and 90% for the top 2,000. Figure 2. View largeDownload slide Proportion of lemmas in the frequency list NOT searched for. Figure 2. View largeDownload slide Proportion of lemmas in the frequency list NOT searched for. The lemmas that have never been looked up mainly come from newspapers and are associated with football (e.g. soka ‘soccer’), technology (teknolojia ‘technology’, kompyuta ‘computer’), business (bosi ‘boss’, menejimenti ‘management’, bilioni ‘billion’, milioni ‘million’), the Catholic press (katoliki ‘Catholic’, padri ‘priest’, ubatizo ‘baptism’), and politics (demokrasia ‘democracy’). A similar analysis was carried out two years ago, when only half of the searches had been conducted as compared to now. At that time, from among the top 500 lemmas, 456 (91%) had been searched for, out of the top 1,000 lemmas – 832 (83%), and from among the top 2,000 lemmas – 1,436 (72%). The numbers were initially slightly lower, but rising steadily in the discussed period, so the coverage of the looked up lemmas increased over time. The data show that the users look up frequent words of the language and that the more look-ups have been made, the better the coverage of the top frequent lemmas. As the frequency goes down, the number of corresponding lemmas searched for also falls. On this basis, we can state that usage should be an essential requirement for a word to be included in a small general-purpose dictionary. At the same time, it has been pointed (De Schryver et al. 2006, Verlinde and Binon 2010) that we cannot predict user behaviour beyond the top few thousand words. On the other hand, research by Koplenig et al. (2014), followed up in Müller-Spitzer et al. (2015) and Trap-Jensen et al. (2014) has shown that dictionary compilers do not overestimate the value of corpus-based lexicography since the users frequently look up the frequent words even beyond the first few thousands. So, there is a relationship between corpus frequency and the frequency of look-ups. Müller-Spitzer et al. (2015) conclude that ‘frequency does matter – even in lower frequency bands’. While 83% of the look-ups were for items from the frequency list, the other 10% did not match any of the dictionary entries. These are strings that stand for separate letters, like k or f, or other strings that do not represent a lemma but were treated in accordance with the regular expression ‘lemma begins with’. The remaining 7% of the searches were for entries added manually. These are pronouns, conjunctions, days of the week, nationalities, languages, geographical terms, and names of fruit, some musical instruments, or animals. Only 4% (160 items) among them were for these additional lexemes. The other 3% of the searches were for full forms of pronouns and adjectives with class prefixes, irregular forms of imperatives, and multiword expressions, representing only 0.5% of the searches. These strings were searched for 1,425 times, only 5.6% of all the queries. In the Swahili not-found file, when we leave out searches identified as Polish words, we have the following top ten searches: mzuri, kuja, kwenda, nzuri, hakuna matata, kufa, mufasa, ninakupenda, njema, patia. Seven items represent orthographic forms of entries that are present in the dictionary. These are adjectives with a class prefix mzuri and nzuri ‘nice’, njema ‘good’, the infinitives kuja ‘to come’, kwenda ‘to go’, kufa ‘to die’ and the inflected verbal form ninakupenda ‘I love you’. There is also a multiword expression and the name Mufasa known from the film ‘The Lion King’. Only one item, patia, may be identified as lemma lacuna. The top 500 strings of the Swahili not-found file were analysed one by one and annotated as a Polish word, a mistake, an orthographic word, a proper name and a lemma lacuna. Out of the 500, 144 (28%) strings were identified as lemma lacuna – possible new lemma candidates. Among them, 92% represent lemmas that are also present on the HCS frequency list – mostly low-frequency words, like burudika ‘be appeased’, egesha ‘bring up close’, manukato ‘perfume’. Only 12 are not on the frequency list, like sarabi ‘mirage’, erevuka ‘be enlightened’, pepa ‘sway’, kauri ‘cowrie shell’. Based on these data, we aim to expand the dictionary further based on the HCS frequency-list lemmas. The 12-million-word HCS corpus can be validated as a good source of data for the compilation of a lemma list for a small general-purpose dictionary, such as the six-thousand-entry Swahili-Polish dictionary. Among the look-ups, only a modest number of searches were for entries that are not present on the ten-thousand-item frequency list. It has to be noted, though, that this result may be related to the fact that many uses of the dictionary are presumably by Polish students of Swahili, working with assignments that themselves include controlled, limited vocabulary. 6. Conclusion The analysis of the log files of the Swahili-Polish dictionary has shown that dictionary users do look up frequent words of the language. The dictionary was based on a limited 12-million-word corpus of the Swahili language and only few look-ups fall outside the ten thousand most frequent corpus items. The majority of the new lemma candidates identified in the study are also included in the frequency list. The finding that many of the unsuccessful look-ups are due to wrongly recognised stems or searches for full orthographic words suggests that it is the lemmatisation strategy that should be further evaluated. References Abdulla A. , Halme R. , Harjula L. and Pesari-Pajunen M. (eds). 2002 . Swahili–Suomi–Swahili-sanakirja . Helsinki : Suomalaisen Kirjallisuuden Seura . Amharic Corpus . Created by Maria Obedkova under guidance of Boris Orekhov . Accessed on 5 April 2017. http://web-corpora.net/AmharicCorpus . HCS – Helsinki Corpus of Swahili . 2004 . Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC — Scientific Computing Ltd . HCS 2.0 – Helsinki Corpus of Swahili 2.0 . Accessed on 5 April 2017. http://urn.fi/urn:nbn:fi:lb-2014032624 . Johnson F. 1939a . A Standard Swahili-English Dictionary (founded on Madan’s Swahili-English Dictionary) . Nairobi, Dar-es-Salaam : Oxford University Press . Johnson F. 1939b . Kamusi ya Kiswahili yaani Kitabu cha Maneno ya Kiswahili . Nairobi : Oxford Univeristy Press in association with Sheldon Press . Krapf L. 1882 . A Dictionary of the Suahili Language . London : Trubner and Company Ludgate Hill . Oxford English Corpus , Oxford University Press . Accessed on 5 April 2017. https://en.oxforddictionaries.com/explore/oxford-english-corpus . (TUKI) Taasisi ya Uchunguzi wa Kiswahili . 1981 . Kamusi ya Kiswahili Sanifu . Nairobi/DSM : Oxford University Press . Wójtowicz B. 2013 . Słownik suahili-polski . Warszawa : Elipsa . Accessed on 5 April 2017. http://kamusi.pl/ . Atkins B. T. , Clear J. and Ostler N. . 1992 . ‘ Corpus Design Criteria .’ Literary and Linguistic Computing 7 . 1 : 1 – 16 . Google Scholar Crossref Search ADS Atkins B. T. and Rundell M. . 2008 . The Oxford Guide to Practical Lexicography . Oxford : Oxford University Press . Benson T. G. 1964 . ‘ A Century of Bantu Lexicography .’ African Language Studies 5 : 64 – 91 . Bergenholtz H. and Johnsen M. . 2005 . ‘ Log Files as a Tool for Improving Internet Dictionaries .’ Hermes, Journal of Linguistics 34 : 117 – 141 . Bergenholtz H. and Johnsen M. . 2007 . ‘ Log Files Can and Should Be Prepared for a Functionalistic Approach .’ Lexikos 17 : 1 – 20 . Chuwa A. 1996 . ‘ Problems in Swahili Lexicography .’ Lexikos 6 : 323 – 329 . Chuwa A. 1999 . ‘ Umuhimu wa Kamusi katika Ufundishaji wa Kiswahili ’ In Tumbo-Masabo Z. Z. and Chiduo E. K. F. (eds), Kiswahili katika elimu . Dar-es-Salaam : TUKI , 125 – 136 . De Pauw G. , and de Schryver G-M. . 2008 . ‘ Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes .’ Lexikos 18 : 303 – 318 . De Pauw G. , de Schryver G-M. and Wagacha P. W. . 2009 . ‘ A Corpus-based Survey of Four Electronic Swahili–English Bilingual Dictionaries .’ Lexikos 19 : 340 – 352 . Google Scholar Crossref Search ADS De Schryver G.-M. 2002 . ‘ Web for/as Corpus: A Perspective for the African Languages .’ Nordic Journal of African Studies 11 . 2 : 266 – 282 . De Schryver G.-M. , and Joffe D. . 2004 . ‘ On How Electronic Dictionaries are Really Used ’ In Williams G. and Vessier S. (eds), Proceedings of the 11th EURALEX International Congress. France: Université de Bretagne Sud, 187–196 . De Schryver G.-M. , Joffe D. , Joffe P. and Hillewaert S. . 2006 . ‘ Do Dictionary Users Really Look Up Frequent Words? — On the Overestimation of the Value of Corpus-based Lexicography .’ Lexikos 16 : 67 – 83 . De Schryver G.-M. and Prinsloo D. J. . 2000 . ‘ The Compilation of Electronic Corpora, with Special Reference to the African Languages .’ Southern African Linguistics and Applied Language Studies 18 : 89 – 106 . De Schryver G.-M. and Prinsloo D. J. . 2001 . ‘ Towards a Sound Lemmatisation Strategy for the Bantu Verb through the Use of Frequency-based Tail Slots – with Special Reference to Cilubà, Sepedi and Kiswahili ’ In Mdee J. S. and Mwansoko H. J. M. (eds), Makala ya kongamano la kimataifa Kiswahili 2000. Proceedings . Dar es Salaam : TUKI , Chuo Kikuu cha Dar es Salaam, 216 – 242 . Fuertes-Olivera P. A. 2012 . ‘ Lexicography and the Internet as a (Re-)source .’ Lexicographica 28 : 49 – 70 . Gatto M. 2011 . The ‘Body’ and the ‘Web’: The Web as Corpus Ten Years On . ICAME Journal 35 : 35 – 58 . Ghani R. , Jones R. and Mladenic D. . 2001 . ‘ Mining the Web to Create Minority Language Corpora ’ In Paques H. , Liu L. and Grossmann D. (eds), Proceedings of the 10th international conference on Information and knowledge management, Atlanta, GA, USA — November 05 - 10, 2001. New York: ACM, 279–286 . Gouws R. H. and Prinsloo D. J. . 2005 . Principles and Practices of South African Lexicography . Stellenbosch : African Sun Media . Hanks P. 2002 . ‘ Mapping Meaning onto Use ’ In Corréard M. H. (ed.), Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins. UK : Euralex , 156 – 198 . Herms I. 1995 . ‘ Swahili – Lexikographie: Eine Kritische Bilanz .’ Afrikanistische Arbeitspapiere 42 : 192 – 196 . Hurskainen A. 1992a . ‘ A Two-Level Computer Formalism for the Analysis of Bantu Morphology. An Application to Swahili .’ Nordic Journal of African Studies 1 . 1 : 87 – 122 . Hurskainen A. 1992b . ‘ Computer Archives of Swahili Language and Folklore – What is it? ’ Nordic Journal of African Studies 1 . 1 : 123 – 127 . Hurskainen A. 1994 . ‘ Kamusi ya Kiswahili Sanifu in Test: A Computer System for Analyzing Dictionaries and for Retrieving Lexical Data .’ Afrikanistische Arbeitspapiere 37 (Swahili Forum I): 169 – 179 . Hurskainen A. 1999 . ‘ SALAMA: Swahili language manager .’ Nordic Journal of African Studies 8(2) : 139 – 157 . Hurskainen A. 1994 . ‘ Kamusi ya Kiswahili Sanifu in Test: a Computer System for Analyzing Dictionaries and Retrieving Lexical Data .’ Afrikanistische Arbeitspapiere 37 : 169 – 179 . Hurskainen A. 2002 . ‘ Tathmini ya Kamusi Tano za Kiswahili .’ Nordic Journal of African Studies 11(2) : 283 – 301 . Hurskainen A. 2003 . ‘ New Advances in Corpus-based Lexicography .’ Lexikos 13 : 111 – 132 . Hurskainen A. 2008 . ‘ SALAMA Dictionary Compiler - a Method for Corpus-Based Dictionary Compilation .’ Technical Reports in Language Technology , Report No 2. Accessed on 5 April 2017. http://www.njas.helsinki.fi/salama/salama-dictionary-compiler.pdf . Jama Musse Jama . 2016 . ‘ Somali Corpus: State of the Art, and Tools for Linguistic Analysis .’ Accessed on 5 April 2017. http://www.somalicorpus.com/documents/JamaMusse-J-Somali-Corpus-StateOfTheArt.pdf . Kiango J. G. (ed.) 1995 . Dhima ya Kamusi katika Kusanifisha Lugha . Dar es Salaam : TUKI / UDSM . Kiango J. G. 2000 . Bantu Lexicography: a Critical Survey of the Principles and Process of Constructing Dictionary Entries . Tokyo : Tokyo University of Foreign Studies . Kilgarriff A. 2012 . ‘ Getting to Know your Corpus ’ In Sojka P. , Horak A. , Kopecek I. and Pala K. (eds), Text Speech Dialogue. 15th International Conference, TSD 2012, Brno, Czech Republic, September 3–7, 2012, Proceedings. Springer, Lecture Notes in Computer Science, 3–15 . Kilgarriff A. 2013 . ‘ Using Corpora as Data Sources for Dictionaries ’ In Jackson H. (ed.), The Bloomsbury Companion to Lexicography . London : Bloomsbury , 77 – 96 . Kilgarriff A. and Grefenstette G. . 2003 . ‘ Introduction to the Special Issue on the Web as Corpus .’ Computational Linguistics 29 . 3 : 333 – 347 . Google Scholar Crossref Search ADS Kilgarriff A. , Pomikalek J. , Jakubíček M. , and Whitelock P. . 2012 . ‘ Setting up for Corpus Lexicography ’ In Fjeld R. V. and Torjusen J. M. (eds), Proceedings of the 15th EURALEX International Congress. 7–11 August 2012. Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo, 778–785 . Koplenig A. , Meyer P. , and Müller-Spitzer C. . 2014 . ‘ Dictionary Users do Look up Frequent Words. A Log File Analysis ’ In Müller-Spitzer C. (ed.), Using Online Dictionaries . Berlin : Walter de Gruyter (Lexicographica Series Maior 145) , 229 – 249 . Lew R. 2011a . ‘ User Studies: Opportunities and Limitations ’ In Akasu K. and Satoru U. (eds), ASIALEX2011 Proceedings Lexicography: Theoretical and practical perspectives. Kyoto: Asian Association for Lexicography, 7–16 . Lew R. 2011b . ‘ Studies in Dictionary Use: Recent Developments .’ International Journal of Lexicography 24 . 1 : 1 – 4 . Google Scholar Crossref Search ADS Lew R. and de Schryver G.-M. . 2014 . ‘ Dictionary Users in the Digital Revolution .’ International Journal of Lexicography 27 . 4 : 341 – 359 . Google Scholar Crossref Search ADS Mbaabu I. 1995 . ‘ Dhima ya Kamusi katika Kufundisha na Kujifunza Kiswahili ’ In Kiango J. G. (ed.), Dhima ya Kamusi katika Kusanifisha Lugha . Dar es Salaam : TUKI / UDSM , 47 – 59 . Mdee J. S. 1984 . ‘ Constructing an Entry in a Learner’s Dictionary of Standard Kiswahili ’ In Hartmann R. R. K. (ed.), Lexeter ’83 Proceedings. Papers from the International Conference on Lexicography at Exeter, 9–12 September 1983. Tübingen: Max Niemeyer Verlag, 237–241 . Mdee J. S. 1999 . ‘ Dictionaries and the Standardization of Spelling in Swahili .’ Lexikos 9 : 119 – 134 . Muaka L. and Muaka A. . 2006 . Tusome Kiswahili . Let’s Read Swahili : Intermediate Level . Madison Wisconsin : NALRC Press . McGrath D. and Marten L. . 2003 . Colloquial Swahili: The Complete Course for Beginners . London : Routledge . Müller-Spitzer C. , Wolfer S. and Koplenig A. . 2015 . ‘ Observing Online Dictionary Users: Studies Using Wiktionary Log Files .’ International Journal of Lexicography 28 ( 1 ): 1 – 26 . Google Scholar Crossref Search ADS Ohly R. 2002 . ‘ Globalization of Language and Terminology in African Context .’ Africana Bulletin 50 : 135 – 157 . Prinsloo D. J. 1991 . ‘ Towards Computer-Assisted Word Frequency Studies in Northern Sotho .’ South African Journal of African Languages 11 ( 2 ): 54 – 60 . Prinsloo D. J. 2015 . ‘ Corpus-based Lexicography for Lesser-resourced Languages – Maximizing the Limited Corpus .’ Lexikos 25 : 285 – 300 . Google Scholar Crossref Search ADS Prinsloo D. J. and de Schryver G.-M. . 2001 . ‘ Taking Dictionaries for Bantu Languages into the New Millennium – with special reference to Kiswahili, Sepedi and isiZulu ’ In Mdee J. S. and Mwansoko H. J. M. (eds), Makala ya kongamano la kimataifa Kiswahili 2000. Proceedings . Dar es Salaam : TUKI , Chuo Kikuu cha Dar es Salaam, 188 – 215 . Scannell K. 2007 . ‘ The Crúbadán Project: Corpus Building for Under-resourced Languages ’ In Fairon C. , Naets H. , Kilgarriff A. and de Schryver G-M. (eds), Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop. Louvainla-Neuve, Belgium, 5–15. Accessed on 5 April 2017. https://borel.slu.edu/pub/wac3.pdf . Tarp S. and Fuertes-Olivera P. A. . 2016 . ‘ Advantages and Disadvantages in the Use of Internet as a Corpus: The Case of the Online Dictionaries of Spanish Valladolid-UVa .’ Lexikos 26 : 273 – 295 . Google Scholar Crossref Search ADS Trap-Jensen L. , Lorentzen H. , and Sørensen N. H. . 2014 . ‘ An Odd Couple - Corpus Frequency and Look-Up Frequency: What Relationship? ’ Slovenščina 2.0, 2: 94–113. Accessed on 5 April 2017. https://dsl.dk/medarbejdere/medarbejdere-publikationer-m-m/ltj/an-odd-couple . Verlinde S. and Binon J. . 2010 . ‘ Monitoring Dictionary Use in the Electronic Age ’ In Dykstra A. and Schoonheim T. (eds), Proceedings of the XIV Euralex International Congress. 6–10 July 2010. Leeuwarden/Ljouwert: Fryske Akademy – Afûk, 1144–1151 . Wójtowicz B. 2016 . ‘ Learner Features in a New Corpus-based Swahili Dictionary .’ Lexikos 26 : 402 – 415 . Footnotes 1The collecting of data on resources for African Languages was undertaken by http://aflat.org/ Unfortunately, the site seems to no longer be updated [accessed 21.02.2017]. © 2017 Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Lexicography Oxford University Press

Evaluating a 12-million-Word Corpus as a Source of Dictionary Data

Loading next page...
 
/lp/ou_press/evaluating-a-12-million-word-corpus-as-a-source-of-dictionary-data-qzM38V2xk3
Publisher
Oxford University Press
Copyright
© 2017 Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
0950-3846
eISSN
1477-4577
D.O.I.
10.1093/ijl/ecx011
Publisher site
See Article on Publisher Site

Abstract

Abstract In this paper, we aim to evaluate the 12-million-word Helsinki Corpus of Swahili as a source of dictionary data used, among others, for the creation of the lemma list for a new Swahili-Polish dictionary. We analyse the dictionary log-files in order to answer a question already asked by De Schryver et al. (2006), Koplenig et al. (2014) and Trap-Jensen (2014) about whether dictionary users actually look up frequent words. However, the issue of utmost importance to us is whether a ten-thousand-item frequency list derived from a 12-million-word corpus meets the needs of a Swahili-Polish dictionary user. 1. Introduction Swahili, as the most widely-used language of Sub-Saharan Africa with tens of millions of people using it as a lingua franca throughout the East African region, has attracted the interest of a large variety of scholars, including many who have published numerous Swahili dictionaries around the world. The Swahili lexicographic tradition is over 150 years old and has been discussed in many papers that often focus on specific subject matters, for example the construction of a dictionary article in a student's dictionary (Mdee 1984), the role of a dictionary in teaching a language (Chuwa 1999, Mbaabu 1995), or the standardization of the Swahili language (Kiango 1995, Mdee 1999). Critical analyses and various attempts at comprehensive summaries have also been undertaken (cf. Chuwa 1996, Herms 1995). However, only a few publications have been devoted to a modern corpus-based Swahili lexicography (cf. De Schryver et al. 2006, De Pauw and De Schryver 2008, De Pauw et al. 2009, De Schryver and Prinsloo 2001, Prinsloo and De Schryver 2001), with few of them presenting the only publicly available annotated electronic corpus of Swahili – the Helsinki Corpus of Swahili (HCS 2004) as a source of dictionary data (e.g. Hurskainen 1994, 2002, 2003, Wójtowicz 2016). The aim of this paper is to evaluate the HCS as a basis for lemma list compilation for a Swahili-based bilingual dictionary. Our objective is to analyse the log files of a Swahili-Polish dictionary whose macrostructure was based on a HCS-derived frequency list. We will investigate whether dictionary users actually look up words that are frequent in the corpus. The research was also motivated by other related questions linked to such issues as updating an already existing corpus-based dictionary, as well as establishing how to improve the dictionary to best meet users' expectations in regard to lemma selection, and whether the frequency list should be the main source of new lemma candidates. 2. Towards a modern lexicography of Swahili The main sources of knowledge on Swahili lexicography are the dictionaries themselves. Expanded introductions, which are preludes to the actual dictionaries in the oldest editions, document the 19th-century beginnings of this new field on the African continent. They help to reconstruct a coherent history and provide insight into the work of the lexicographers. Since the 19th century, hundreds of Swahili-language dictionary publications of various types have been issued (cf. Ohly 2002). These include bilingual dictionaries, both to and from such languages as English, French, German, Russian, Japanese, etc., monolingual dictionaries, and various specialized dictionaries. However, the vast majority of Swahili dictionaries in no way specify the methods used in their compilation. What we can learn from some of the introductory notes is that most of the dictionaries were based on their predecessors. First, Krapf's manuscripts as well as the dictionary itself (Krapf 1882) were sources referenced both by contemporary scholars and by later Swahili-language researchers. Together with the works of other German missionaries, they also served as a basis for other publications (cf. Benson 1964). Next, the root-based dictionary by Johnson (1939a) revolutionized Swahili lexicography for years to come and became a fundamental source for future lexicographers and the most widely recognized work in the field (cf. Chuwa 1996). The publication of the monolingual Swahili dictionary by the African TUKI institute in 1981 marks the beginning of the next lexicographical period and fulfils Benson's (1964) hope of the increased involvement of African researchers. The dictionary's importance is of special note, as in the following years it became the fundamental source for subsequent bilingual works. It was aimed at supplanting the already 40-year-old dictionary by Johnson (1939b), and replacing it as the main language standardization resource. It has enjoyed unwavering and overwhelming popularity (cf. Herms 1995, also Hurskainen 1994), and eighteen reprints had been published by 1992. A new era of Swahili lexicography was initiated with the publication of the Swahili-Finnish-Swahili dictionary (Abdulla et al. 2002). The dictionary compiled at the Helsinki University was the first to be based on the modern dictionary method exploiting electronic text corpora. The authors of this traditional paper publication relied on data from an electronic corpus of the Swahili language – the Helsinki Corpus of Swahili. This initiated the beginning of a modern corpus-based lexicography, by then fairly standard in the Western World. The other HCS corpus-based dictionary is the Swahili-Polish Dictionary, which was released as a simplified print version of its electronic counterpart (Wójtowicz 2013). Only two years after the publication of the Finnish dictionary, the TshwaneDJe Swahili–English Dictionary (De Schryver et al. 2006, De Pauw et al. 2009) joined the market. It was the first corpus-driven electronic dictionary of Swahili with a new approach to the lemmatisation of headwords (cf. De Schryver et al. 2006). The content of this dictionary is based on web-based corpus data – ‘a balanced and representative Swahili corpus of around fifteen million running words' (De Schryver et al. 2006: 70). As of 2004, it was made freely available online, and can also be downloaded a stand-alone download for a fee. The dictionary includes over 15,000 entries. An important innovation was its inclusion among the headwords of orthographic forms, in addition to the usual stems, selected on the basis of a frequency count. It features basic morphological decomposition, corpus-based examples, and a system of cross-references linking root-derivative pairs. 3. Size of the corpus With the new Swahili-Polish dictionary to be based on corpus data, the main concern was the size of the available corpus. Use of electronic corpora for English lexicography was already well established in the 1980s (Atkins and Rundell 2008). Lesser-resourced languages have struggled to keep up-to-date, compiling and exploiting language corpora to enhance their lexicography. But in the case of such languages, the compilation of a corpus is more challenging and the issue of corpus size becomes meaningful. As Kilgarriff (2013) clearly states, ‘for lexicography, the relevant kind of data is a large collection of text: a corpus. A corpus is just that: a collection of data – text and speech – when viewed from the perspective of language research'. But the question of how to construct an appropriate collection to address our needs remains pertinent. Another question raised refers to whether in fact we have to construct anything at all. Similarly to Kilgarriff, Fuertes-Olivera (2012: 51) defines a lexicographical corpus as ‘any collection of texts where lexicographers can find inspiration'. It has generally been assumed that corpora for lexicography should be large and diverse. It has been demonstrated many times that the bigger the corpus is, the better (cf. Hanks 2002, Atkins and Rundell 2008, Kilgarriff et al. 2012). If the corpus is big enough, it will provide evidence about anything that should be in the dictionary. Otherwise it may lack certain terms. Until the 1980s, the standard size of a corpus amounted to one million words, after which the expected corpus sizes grew larger (cf. De Schryver and Prinsloo 2000). At the turn of the 21st century, Hanks (2002: 157) claimed that ‘in a corpus of 100 million words, a simple right- or left-sorted concordance shows clearly most of the normal patterns of usage for all words except the very rare or very unusual’. However, the judgement depends on the needs. A 100-million-word corpus may also be regarded as insufficient for a larger dictionary, as shown by Kilgarriff et al. (2012), who state that ‘for a 40,000-headword dictionary, a corpus of 2 billion words is substantially better, missing much less, than a corpus of 100 million words’. 100 million words is large enough for many empirical studies about language, as it provides information on the dominant meanings and usage-patterns for the top 10,000 core words that have been demonstrated for English, but it lacks evidence for rarer words, rare meanings of common words, and combinations of words (Kilgarriff and Grefenstette 2003). In the 1990s, the British National Corpus, comprising 100 million words of spoken and written British English, was such a model lexicographic corpus. For a time, it was a source for lexicographic research, but since then the amount of data used to build various corpora has increased significantly. The size of the corpora grew as material collected from web pages was added, and the notion of representativeness was set aside. For example, the Oxford English Corpus collected from web pages contains nearly 2.5 billion English words from all parts of the world and new texts are continually being added. As another example of very large corpora, Prinsloo (2015) mentions Google Books with 155 billion words for American English, 45 billion for Spanish and 34 billion for British English. These are extreme examples; nevertheless, a figure of at least a few hundred million words is regarded as the unofficial standard for electronic corpora nowadays. However, this is only true for better-resourced languages. 3.1. Corpora of lesser-resourced languages Things are much different when it comes to resources for lesser-resourced languages. Typically, only limited, small, unbalanced and raw corpora are available, if any at all. The scarcity of printed material, and often its unavailability, makes a compilation of ten million words a big corpus (cf. Prinsloo 2015). De Schryver and Prinsloo (2000) mention the compilation of the Lubà corpus, for which there were only 100 written sources available at that time. In the field of African language corpora, the acknowledged pioneers include Prinsloo (1991) for Pedi, and Hurskainen (1992a, 1992b) for Swahili. The first corpora were collected in a traditional way (cf. De Schryver 2002), while nowadays corpora for lesser-resourced languages, as in the case of most African tongues, are typically compiled automatically from web pages. For example, the Crúbadán Project (Scannell 2007) offers corpora for over 2 000 languages created by crawling the web, many of them for African languages. Although the idea of Web as Corpus raises a number of objections, one being its opportunistic nature (cf. Gatto 2011), for some languages it is in fact the only free and available source of language data; therefore, its value has come to be widely recognized, as often corpora for lesser-resourced languages, if at all available, are of an opportunistic type as well (Kilgarriff and Grefenstette 2003, Ghani et al. 2001, De Schryver and Prinsloo 2000, Fuertes-Olivera 2012, Tarp and Fuertes-Olivera 2016). The other arguments against using the web as a corpus concern its constantly changing content, the presence of duplicates, and its unsupervised and unedited texts. One practical consequence of the web’s dynamic nature is that it is impossible to strictly reproduce a corpus study. Nevertheless, the web as source data for African languages has been exploited, for example, by De Schryver (2002) and De Schryver and Prinsloo (2000). The papers discuss its potential and the present endeavour of collecting corpora for some African languages. Data are not available for all languages spoken in Africa, but at least some of them can take advantage of this method. The first collected corpora were small and have grown substantially since the beginning of the century, such as the Kiswahili Internet Corpus, which has increased in size from 1.7 million to 20 million words (De Schryver and Prinsloo 2000, De Pauw et al. 2009), and a version of it was used to build the Swahili-English Dictionary (De Schryver et al. 2006). Although Swahili and South African languages have been leading the field of corpus compilation and annotation for years, new projects are also being undertaken. Most of the (web-crawled) African language corpora lack any linguistic annotation, since tools for performing such annotations do not yet exist, but this is not always the case.1 An annotated Amharic Corpus of 23 million tokens is available, while three million tagged words are accessible for Somali (Jama Musse Jama 2016). But how big should the corpus be to provide reliable evidence on the language for lexicographic purposes? Obviously, for those working with the lesser-resourced languages, it would be interesting to establish what minimum corpus size is required. The issue of corpus size has been raised by De Schryver and Prinsloo (2000), who presented a comparison of data from different phases of the Pedi corpus compilation. Corpus size varied from a hundred thousand words in phase one to over one million words in phase three. The experiments revealed that ‘as far as high-frequency items are concerned, it can be predicted from these comparisons that increasing the size of a small-size corpus does not substantially influence the stability of the corpus’ and ‘that less-frequently-used items detected in a small-size corpus retain a low frequency value even if the corpus is substantially enlarged’. This means that in a corpus five or ten times bigger than the initial one, the relevant frequency of items remains more or less stable. However, a bigger corpus still has the advantage of capturing a greater number of types. Similar research by Prinsloo (2015) on corpora for Pedi, English and Afrikaans examines how relatively small and often unbalanced corpora could be utilized for lexicographic purposes in the absence of large collections. The author investigates how enlarging a corpus size from one to ten million, and then from ten million to hundred million words influences its usability. While it was possible to compare the English corpus frequency lists with the basic words for English from the Macmillan English Dictionary for Advanced Learners, the experiment of comparing lists derived from corpora of different size was conducted for Afrikaans and Pedi. An overlap of 72.8% for Pedi and 83.4% for Afrikaans was observed when the list derived from a one-million-word corpus was compared to that consisting of ten million. Predictably, a substantial difference appeared when it came to the raw number of occurrences of items. An item that occurs 11 times in the one-million-word (Afrikaans) corpus will occur 100 times in one consisting of ten million words and over 1,000 times in a hundred-million-word corpus. This matters enormously while working on a microstructural level, when we need examples to describe the meaning of a word, its senses, collocations, etc. A one-million corpus may produce an insufficient number of concordance lines. When it comes to real examples, Prinsloo found a one-million corpus to be quite adequate for commonly used words, the conclusion being that even small corpora of one million words can assist the lexicographer quite well in the compilation of small bilingual and monolingual dictionaries of approximately 5 000 lemmas. When the corpus is enlarged to ten million words, coverage of commonly used words becomes more reliable, and less work is needed to find missing items. All in all, the research confirms that the bigger the corpus the better, as the author concludes by stating that ‘a 100 million corpus will be extremely valuable’ (Prinsloo 2015: 299). Much as a bigger corpus is desirable, such an amount of data is simply not always available for lesser-resourced languages and in the study itself the author had to make do with a 10-million-word corpus of Pedi, while for Afrikaans a 100-million-word corpus was used. 4. The Helsinki Corpus of Swahili The Helsinki Corpus of Swahili is the only large and annotated corpus of standard Swahili available to the linguistic community for free academic use. Only recently, in 2016, was it moved to a new location and is available in Kielipankki – the Language Bank of Finland. The previous version of the corpus contained around 12 million words, and an annotated version was available after a signed agreement through the Lemmie web-browser and on a Linux server. At present, an expanded version of the corpus – the Helsinki Corpus of Swahili 2.0 of about 25 million words – is available in two formats. The annotated version available in Korp may be accessed after logging in with university or CLARIN credentials, or by applying for access. It contains morphological and syntactic annotation. The non-annotated version, comprised of the same content, but in the form of plain text without any linguistic codes, is available for free download. The HCS 2.0 consists of two parts: (1) old material; and (2) new material added after 2004. The old material contains two types of resources: books and news from before 2003. The data are basically the same as those found in the HCS 1.0, with the only difference being that whole texts and not only sections of books are included. In addition, in the new version the sentences are shuffled (for reasons related to copyright licensing) while in the old corpus they were in the original order. The books are mostly by such renowned Swahili authors as Shaaban Robert, E. Kezilahabi, E. Hussein, A. Lihamba, Mohamed S. Mohamed, and others. However, most of the texts come from Internet newspapers from between 1998 and 2003. The new material added after 2004 consists of two main sources – Bunge, the transcripts of Tanzanian Parliament debates from 2004–2006, and news texts from 2004–2015. The sizes of both parts, the old and the new, are more or less equal. Most of the corpus material was retrieved from the Web, especially as of 2000, after texts on the Web became increasingly available. The texts come from news media and open government pages. Some texts, such as books, were scanned and proofread. Furthermore, some of the oldest news material was manually copied and transferred online, and all material has gone through a series of formatting and correction routines, based on which a few thousand errors were identified and corrected. The texts have the annotation layer provided by SALAMA – Swahili Language Manager (cf. Hurskeinen 1999, 2008), an environment for the computational processing of the Swahili language. SALAMA includes a comprehensive language analyser of Swahili text; therefore, the corpus contains various kinds of linguistic information attached to each token. Each word in the annotated corpus contains the following types of information: the token, the stem, part-of-speech, morphological description, an English gloss, a syntactic tag, and other descriptors for verbs. The other component, SALAMA-DC – the SALAMA Dictionary Compiler – is a comprehensive system for producing dictionary entries from any word-form in Swahili. It produces entries with appropriate linguistic information, single-word headwords, multiword headwords, various types of cross-references, and a selection of usage examples in context. The example text may be further translated into English. 4.1. The HCS as a source of dictionary data Despite its unimposing size as compared to current standards, the HCS is one of the biggest annotated corpora for African or other lesser-resourced languages. The data from the HCS was used to build a new electronic dictionary of Swahili – the Swahili-Polish dictionary available from 2013 for free online use. Work on the dictionary started in 2010, using a corpus version based on HCS 1.0. It consisted of over 12.5 million words taken from numerous literary books and current news sources, most of which came from texts written after 2000. The HCS was the best resource we could get considering our dictionary needs. Although it was neither representative nor balanced, the corpus seemed to contain appropriate data taking into account our dictionary target users, that is learners of Swahili in Poland and other travellers to Swahili-speaking African countries. Prinsloo’s findings (2015) showed that a small-sized corpus should supply lexicographers with data to compile a good dictionary for our target group, as we aimed at creating a dictionary of up to 10,000 entries. The issue of the balance and representativeness of a corpus was discussed by De Schryver and Prinsloo (2000), who after presenting various viewpoints conclude that ‘it is clear that linguists disagree whether a corpus should try to be balanced or representative’ (De Schryver and Prinsloo 2000: 92). The literature does not provide an answer to the question of what the corpus should be representative of. Atkins, Clear and Ostler (1992) introduced the concept of organic corpora, which fits the situation for African languages and reflects a living language (Gouws and Prinsloo 2005). Lexicographers need to be aware that the type of data they are using should be matched to the aims of the particular dictionary (cf. Atkins, Clear and Ostler 1992, Atkins and Rundell 2008, Kilgarriff 2012). What has to be borne in mind is that each collection contains noise and biases stemming from the types of texts used. Hurskainen (2003) notes that a corpus, no matter its size, usually does not provide all words needed, even for a fairly modest dictionary. Even basic vocabulary may be missing in the corpus, as everyday matters may not be reflected in the component texts. Apart from its acceptable size, the corpus consists of texts that an average student of Swahili comes across in his day-to-day learning process. The literary works of renowned Swahili authors are read in language classes, while news items written in Swahili are the most easily accessible texts on the Internet. We used the corpus data for the compilation of the lemma list and for microstructural aspects, such as sense distinction, collocations, idioms and examples of usage. At present, the dictionary contains over 6,000 Swahili entries and over 7,000 entries in the searchable Polish index. The dictionary can be searched in both directions, but the user has to choose which language he is querying. The lemma list of the dictionary is mainly based on a HCS-derived frequency list of over ten thousand lemmatised entries. According to lexicographic practice, the dictionary compilation process should take into account the following three components: copying, introspection and looking at data (Kilgarriff 2013, Atkins and Rundell 2008). Therefore, apart from basing the lemma list on the frequency count, we also looked at other dictionaries and the list was compared with the vocabulary from students’ books (McGrath and Marten 2003 and Muaka and Muaka 2006). We hoped to identify missing basic vocabulary this way. Furthermore, closed sets such as days of the week, months, pronouns, and names of countries and continents, were verified. Entries identified as missing were added and also additional vocabulary was supplemented by students who worked on selected sets they found useful, such as animals, musical instruments, or means of transport (cf. Wójtowicz 2016). Every dictionary entry includes POS information and additional morphological features. Derivatives are treated as main entries and we have decided to follow standard solutions of the Swahili lexicographic tradition, used within nearly all Swahili dictionaries, in regard to the process of lemmatization (see Kiango 2000 for a thorough discussion, and De Schryver et al. 2006 for a novel approach). Therefore, we have ignored prefixes and listed the stems alone for verbs, numerals, and inflected adjectives. On the other hand, in response to beginner learners’ needs, we have broken with this tradition in some other cases and listed, for example, pronouns in their full form (with the stems also included as separate entries). 5. Analysing dictionary log files Whether our decision to use the HCS as a source of dictionary data was accurate, may be evaluated by the users and hopefully their searches may shed some light on that issue. To that end, a study of the dictionary’s log files was conducted. While research into dictionary use has become a respectable domain in its own right (cf. Lew 2011b, Lew and De Schryver 2014), the digital revolution in lexicography has introduced, among others, the possibility of tracking users’ behaviour while using a digital dictionary. Lexicographers can tap this record of user behaviour to inform lemma list selection, and in updating a dictionary. According to Bergenholtz and Johnsen (2005), analyses of log files may be used as a tool for improving Internet dictionaries. De Schryver and Joffe (2004: 188) go further in suggesting ‘that an automated analysis of the log files will enable the dictionary to tailor itself to each and every particular user’. While this remains an attractive prospect, for the time being a manual analysis of log files may help reveal which headwords have been successfully retrieved, as well as the ones that have been sought but not found. Based on this information, we can modify the content of a dictionary to meet the users’ needs. However, the limitations of this method have to be taken under consideration as well (cf. Lew 2011a, Bergenholtz and Johnsen 2007). The possibility of analysing log files attracted the attention of researchers already in the 1980s, but only recently have more research experiments been described (cf. De Schryver and Joffe 2004, Lew 2011a, Koplenig et al. 2014). The basis for the research differs significantly from over 20,000 look-ups in De Schryver and Joffe (2004), to over one million look-ups in Bergenholtz and Johnsen (2005), to almost 30 million in Trap-Jensen et al. (2014). In this regard, our study has a moderate empirical base, as it concerns a less popular language pair. 5.1. Swahili-Polish dictionary log files The Swahili-Polish dictionary was placed online, and since January 2013 the dictionary user queries have been saved in four log-files: Swahili found entries, Swahili not-found, Polish found and Polish not-found. While conducting a search, the users need to indicate which language they are searching for and as a result the searches are saved in different files. The files include the strings that users have typed in the search box, along with the number of searches for each string. No IP or any other identification has been stored. The data were saved in .csv files and the analysis was carried out by the author manually and with the use of regular expressions only. In the Swahili found file, in addition to the string and the number of searches, information on POS and ID of an entry that was returned to the user is provided, so polysemous entries can be traced easily. Additional analyses may be conducted using data from Google Analytics. In order to answer the research question of whether dictionary users most frequently search for entries from the HCS-based frequency list, we will investigate mainly the Swahili found and Swahili not-found log files. As the log files only record what the users have entered into the search box, it should be stressed that we can never be sure of who the user was or the user’s actual intentions (cf. Lew 2011a). Over a four-year period, up to 15 February 2017, a total of 53,592 queries were made. This makes for an average of 36 look-ups per day, with an increase observed in searches over time from 25 in 2013–14 to 46.5 since 2015. The number drops during holidays, especially the long summer vacations, to an average of 15 look-ups per day, which supports our assumption that the dictionary is mostly used by students of the language. The number of look-ups noted in each file is accordingly 25,466 (48% of the total) in the Swahili found, 14,708 (27%) in the Swahili not-found, 8,052 (15%) in the Polish found, and 5,366 (10%) in the Polish not-found. Thus, the majority of the look-ups (75%) were in the Swahili-Polish direction (cf. Figure 1). Figure 1. View largeDownload slide Breakdown of all searches as noted per file. Figure 1. View largeDownload slide Breakdown of all searches as noted per file. When we compare only the searches for Swahili entries, 63% of them are noted as found and 37% as not-found. Since the user has to manually switch direction to look for Polish words, unfortunately as many as 11% of the strings searched that are noted in the Swahili not-found file were identified as Polish words, and these items constitute 25% of the strings searched for at least twice. Out of the 8,912 strings, the majority (73%) were searched for only once. The majority of these were misspellings, wrongly decomposed verbs, full orthographic words, and multiword expressions. 5.2. Correlation between look-ups and corpus frequency In the study, we aim at comparing the words that users actually look up with the corpus representation of these words, that is their frequencies in the HCS. In the Swahili found file, 4,430 different strings were searched for over 25,000 times. This also includes searches for separate letters, like k or f, or other strings that do not represent a lemma. Before we look at the analysis of the files, we shall list the main assumptions of the dictionary that influence user search behaviour. The search is carried out on Swahili headwords and plural forms of nouns, which are included in the dictionary in their full form. Derivatives such as pronouns with class prefixes and irregular verbal forms are treated as separate entries and users can also search for them. The most difficult operation is the decomposition of the verbal complex. The user has to cut off all prefixal morphemes and search for the verbal root or extended root instead. Top 100 lemmas searched for most frequently constitute 16% of all look-ups, while 64% of all searches were carried out for the top 1,000 lemmas. Out of almost 4,500 strings searched for, only 1,600 have been looked up 5 times and more. Within the top 100 searches, the number of look-ups falls from 162 searches to only 28. In the Swahili found file, eight out of the top ten Swahili searches are for verbs (cf. Table 1). Only two items do not represent verbs; these are the most often looked up adjective tamu ‘sweet, delicious’, and, at rank ten, kama, which may be a verb or a conjunction. Table 1. Top 10 Swahili searches Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like Table 1. Top 10 Swahili searches Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like Entry Look-up rank No of searches Corpus rank English tamu 1 162 1989 sweet pata 2 99 14 get ja 3 93 60 come toa 4 72 17 put, offer acha 5 61 177 stop, quit weza 6 57 4 can chukua 7 57 143 take tumia 8 55 28 use tegemea 9 55 257 rely on shika 10 54 437 hold kama 10 54 8 squeeze/if; like The top ten most often searched items do not correspond to the top ten most frequent words in the corpus, which was expected (cf. De Schryver et al. 2006). However, it is more interesting to see to what extent the users look up words derived from our modest corpus, as opposed to other items. If one compares the look-ups with the HCS-based frequency list, it turns out that 3,220 (out of 4,430), that is 73% of the strings that represent unique lemmas of the dictionary, can be found in the corpus list. Since the dictionary allows searching on the plural forms of nouns, the other 454 searches were for plural forms of nouns that are on the list as well, so 83% of all the Swahili found strings were entries from the corpus frequency list. Searches for these entries were carried out over 23,000 times, that is 91% of all look-ups. If one inspects the top 100 searches more closely, one notices that 34 lemmas of the top 100 searches can also be found in the corpus top 100, 79 in the corpus top 500, and 88 in the corpus top 1,000. This corresponds with the findings of De Schryver and Joffe (2004: 190) that the ‘users indeed look up the frequent words of the language’. It should be noted that in their study on Sesotho sa Leboa the number of look-ups was slightly lower. With their frequency list derived from a 6.1-million-word Sesotho sa Leboa corpus, 30 of the top 100 searches were found in the corpus top 100, and 63 in the corpus top 1,000. On the other hand, a close examination of the frequency list as compared to all of the searches reveals that as many as 99 items have been queried from the top 100 of the most frequent Swahili words (cf. Figure 2). The only entry that has never been looked up, or in other words, entered in precisely such a form, was the multiword expression wa na ‘have; lit. to be with’. Moving down the list, out of the top 500 lemmas, 480 (96%) have been searched for, while out of the top 1,000 lemmas – 935 (94%), and from among the top 2,000 lemmas – 1,751 (88%). If we disregard the 42 multiword expressions that are on the list but have not been searched for, the numbers go up to 97% for the top 500, 94% for the top 1,000, and 90% for the top 2,000. Figure 2. View largeDownload slide Proportion of lemmas in the frequency list NOT searched for. Figure 2. View largeDownload slide Proportion of lemmas in the frequency list NOT searched for. The lemmas that have never been looked up mainly come from newspapers and are associated with football (e.g. soka ‘soccer’), technology (teknolojia ‘technology’, kompyuta ‘computer’), business (bosi ‘boss’, menejimenti ‘management’, bilioni ‘billion’, milioni ‘million’), the Catholic press (katoliki ‘Catholic’, padri ‘priest’, ubatizo ‘baptism’), and politics (demokrasia ‘democracy’). A similar analysis was carried out two years ago, when only half of the searches had been conducted as compared to now. At that time, from among the top 500 lemmas, 456 (91%) had been searched for, out of the top 1,000 lemmas – 832 (83%), and from among the top 2,000 lemmas – 1,436 (72%). The numbers were initially slightly lower, but rising steadily in the discussed period, so the coverage of the looked up lemmas increased over time. The data show that the users look up frequent words of the language and that the more look-ups have been made, the better the coverage of the top frequent lemmas. As the frequency goes down, the number of corresponding lemmas searched for also falls. On this basis, we can state that usage should be an essential requirement for a word to be included in a small general-purpose dictionary. At the same time, it has been pointed (De Schryver et al. 2006, Verlinde and Binon 2010) that we cannot predict user behaviour beyond the top few thousand words. On the other hand, research by Koplenig et al. (2014), followed up in Müller-Spitzer et al. (2015) and Trap-Jensen et al. (2014) has shown that dictionary compilers do not overestimate the value of corpus-based lexicography since the users frequently look up the frequent words even beyond the first few thousands. So, there is a relationship between corpus frequency and the frequency of look-ups. Müller-Spitzer et al. (2015) conclude that ‘frequency does matter – even in lower frequency bands’. While 83% of the look-ups were for items from the frequency list, the other 10% did not match any of the dictionary entries. These are strings that stand for separate letters, like k or f, or other strings that do not represent a lemma but were treated in accordance with the regular expression ‘lemma begins with’. The remaining 7% of the searches were for entries added manually. These are pronouns, conjunctions, days of the week, nationalities, languages, geographical terms, and names of fruit, some musical instruments, or animals. Only 4% (160 items) among them were for these additional lexemes. The other 3% of the searches were for full forms of pronouns and adjectives with class prefixes, irregular forms of imperatives, and multiword expressions, representing only 0.5% of the searches. These strings were searched for 1,425 times, only 5.6% of all the queries. In the Swahili not-found file, when we leave out searches identified as Polish words, we have the following top ten searches: mzuri, kuja, kwenda, nzuri, hakuna matata, kufa, mufasa, ninakupenda, njema, patia. Seven items represent orthographic forms of entries that are present in the dictionary. These are adjectives with a class prefix mzuri and nzuri ‘nice’, njema ‘good’, the infinitives kuja ‘to come’, kwenda ‘to go’, kufa ‘to die’ and the inflected verbal form ninakupenda ‘I love you’. There is also a multiword expression and the name Mufasa known from the film ‘The Lion King’. Only one item, patia, may be identified as lemma lacuna. The top 500 strings of the Swahili not-found file were analysed one by one and annotated as a Polish word, a mistake, an orthographic word, a proper name and a lemma lacuna. Out of the 500, 144 (28%) strings were identified as lemma lacuna – possible new lemma candidates. Among them, 92% represent lemmas that are also present on the HCS frequency list – mostly low-frequency words, like burudika ‘be appeased’, egesha ‘bring up close’, manukato ‘perfume’. Only 12 are not on the frequency list, like sarabi ‘mirage’, erevuka ‘be enlightened’, pepa ‘sway’, kauri ‘cowrie shell’. Based on these data, we aim to expand the dictionary further based on the HCS frequency-list lemmas. The 12-million-word HCS corpus can be validated as a good source of data for the compilation of a lemma list for a small general-purpose dictionary, such as the six-thousand-entry Swahili-Polish dictionary. Among the look-ups, only a modest number of searches were for entries that are not present on the ten-thousand-item frequency list. It has to be noted, though, that this result may be related to the fact that many uses of the dictionary are presumably by Polish students of Swahili, working with assignments that themselves include controlled, limited vocabulary. 6. Conclusion The analysis of the log files of the Swahili-Polish dictionary has shown that dictionary users do look up frequent words of the language. The dictionary was based on a limited 12-million-word corpus of the Swahili language and only few look-ups fall outside the ten thousand most frequent corpus items. The majority of the new lemma candidates identified in the study are also included in the frequency list. The finding that many of the unsuccessful look-ups are due to wrongly recognised stems or searches for full orthographic words suggests that it is the lemmatisation strategy that should be further evaluated. References Abdulla A. , Halme R. , Harjula L. and Pesari-Pajunen M. (eds). 2002 . Swahili–Suomi–Swahili-sanakirja . Helsinki : Suomalaisen Kirjallisuuden Seura . Amharic Corpus . Created by Maria Obedkova under guidance of Boris Orekhov . Accessed on 5 April 2017. http://web-corpora.net/AmharicCorpus . HCS – Helsinki Corpus of Swahili . 2004 . Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC — Scientific Computing Ltd . HCS 2.0 – Helsinki Corpus of Swahili 2.0 . Accessed on 5 April 2017. http://urn.fi/urn:nbn:fi:lb-2014032624 . Johnson F. 1939a . A Standard Swahili-English Dictionary (founded on Madan’s Swahili-English Dictionary) . Nairobi, Dar-es-Salaam : Oxford University Press . Johnson F. 1939b . Kamusi ya Kiswahili yaani Kitabu cha Maneno ya Kiswahili . Nairobi : Oxford Univeristy Press in association with Sheldon Press . Krapf L. 1882 . A Dictionary of the Suahili Language . London : Trubner and Company Ludgate Hill . Oxford English Corpus , Oxford University Press . Accessed on 5 April 2017. https://en.oxforddictionaries.com/explore/oxford-english-corpus . (TUKI) Taasisi ya Uchunguzi wa Kiswahili . 1981 . Kamusi ya Kiswahili Sanifu . Nairobi/DSM : Oxford University Press . Wójtowicz B. 2013 . Słownik suahili-polski . Warszawa : Elipsa . Accessed on 5 April 2017. http://kamusi.pl/ . Atkins B. T. , Clear J. and Ostler N. . 1992 . ‘ Corpus Design Criteria .’ Literary and Linguistic Computing 7 . 1 : 1 – 16 . Google Scholar Crossref Search ADS Atkins B. T. and Rundell M. . 2008 . The Oxford Guide to Practical Lexicography . Oxford : Oxford University Press . Benson T. G. 1964 . ‘ A Century of Bantu Lexicography .’ African Language Studies 5 : 64 – 91 . Bergenholtz H. and Johnsen M. . 2005 . ‘ Log Files as a Tool for Improving Internet Dictionaries .’ Hermes, Journal of Linguistics 34 : 117 – 141 . Bergenholtz H. and Johnsen M. . 2007 . ‘ Log Files Can and Should Be Prepared for a Functionalistic Approach .’ Lexikos 17 : 1 – 20 . Chuwa A. 1996 . ‘ Problems in Swahili Lexicography .’ Lexikos 6 : 323 – 329 . Chuwa A. 1999 . ‘ Umuhimu wa Kamusi katika Ufundishaji wa Kiswahili ’ In Tumbo-Masabo Z. Z. and Chiduo E. K. F. (eds), Kiswahili katika elimu . Dar-es-Salaam : TUKI , 125 – 136 . De Pauw G. , and de Schryver G-M. . 2008 . ‘ Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes .’ Lexikos 18 : 303 – 318 . De Pauw G. , de Schryver G-M. and Wagacha P. W. . 2009 . ‘ A Corpus-based Survey of Four Electronic Swahili–English Bilingual Dictionaries .’ Lexikos 19 : 340 – 352 . Google Scholar Crossref Search ADS De Schryver G.-M. 2002 . ‘ Web for/as Corpus: A Perspective for the African Languages .’ Nordic Journal of African Studies 11 . 2 : 266 – 282 . De Schryver G.-M. , and Joffe D. . 2004 . ‘ On How Electronic Dictionaries are Really Used ’ In Williams G. and Vessier S. (eds), Proceedings of the 11th EURALEX International Congress. France: Université de Bretagne Sud, 187–196 . De Schryver G.-M. , Joffe D. , Joffe P. and Hillewaert S. . 2006 . ‘ Do Dictionary Users Really Look Up Frequent Words? — On the Overestimation of the Value of Corpus-based Lexicography .’ Lexikos 16 : 67 – 83 . De Schryver G.-M. and Prinsloo D. J. . 2000 . ‘ The Compilation of Electronic Corpora, with Special Reference to the African Languages .’ Southern African Linguistics and Applied Language Studies 18 : 89 – 106 . De Schryver G.-M. and Prinsloo D. J. . 2001 . ‘ Towards a Sound Lemmatisation Strategy for the Bantu Verb through the Use of Frequency-based Tail Slots – with Special Reference to Cilubà, Sepedi and Kiswahili ’ In Mdee J. S. and Mwansoko H. J. M. (eds), Makala ya kongamano la kimataifa Kiswahili 2000. Proceedings . Dar es Salaam : TUKI , Chuo Kikuu cha Dar es Salaam, 216 – 242 . Fuertes-Olivera P. A. 2012 . ‘ Lexicography and the Internet as a (Re-)source .’ Lexicographica 28 : 49 – 70 . Gatto M. 2011 . The ‘Body’ and the ‘Web’: The Web as Corpus Ten Years On . ICAME Journal 35 : 35 – 58 . Ghani R. , Jones R. and Mladenic D. . 2001 . ‘ Mining the Web to Create Minority Language Corpora ’ In Paques H. , Liu L. and Grossmann D. (eds), Proceedings of the 10th international conference on Information and knowledge management, Atlanta, GA, USA — November 05 - 10, 2001. New York: ACM, 279–286 . Gouws R. H. and Prinsloo D. J. . 2005 . Principles and Practices of South African Lexicography . Stellenbosch : African Sun Media . Hanks P. 2002 . ‘ Mapping Meaning onto Use ’ In Corréard M. H. (ed.), Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins. UK : Euralex , 156 – 198 . Herms I. 1995 . ‘ Swahili – Lexikographie: Eine Kritische Bilanz .’ Afrikanistische Arbeitspapiere 42 : 192 – 196 . Hurskainen A. 1992a . ‘ A Two-Level Computer Formalism for the Analysis of Bantu Morphology. An Application to Swahili .’ Nordic Journal of African Studies 1 . 1 : 87 – 122 . Hurskainen A. 1992b . ‘ Computer Archives of Swahili Language and Folklore – What is it? ’ Nordic Journal of African Studies 1 . 1 : 123 – 127 . Hurskainen A. 1994 . ‘ Kamusi ya Kiswahili Sanifu in Test: A Computer System for Analyzing Dictionaries and for Retrieving Lexical Data .’ Afrikanistische Arbeitspapiere 37 (Swahili Forum I): 169 – 179 . Hurskainen A. 1999 . ‘ SALAMA: Swahili language manager .’ Nordic Journal of African Studies 8(2) : 139 – 157 . Hurskainen A. 1994 . ‘ Kamusi ya Kiswahili Sanifu in Test: a Computer System for Analyzing Dictionaries and Retrieving Lexical Data .’ Afrikanistische Arbeitspapiere 37 : 169 – 179 . Hurskainen A. 2002 . ‘ Tathmini ya Kamusi Tano za Kiswahili .’ Nordic Journal of African Studies 11(2) : 283 – 301 . Hurskainen A. 2003 . ‘ New Advances in Corpus-based Lexicography .’ Lexikos 13 : 111 – 132 . Hurskainen A. 2008 . ‘ SALAMA Dictionary Compiler - a Method for Corpus-Based Dictionary Compilation .’ Technical Reports in Language Technology , Report No 2. Accessed on 5 April 2017. http://www.njas.helsinki.fi/salama/salama-dictionary-compiler.pdf . Jama Musse Jama . 2016 . ‘ Somali Corpus: State of the Art, and Tools for Linguistic Analysis .’ Accessed on 5 April 2017. http://www.somalicorpus.com/documents/JamaMusse-J-Somali-Corpus-StateOfTheArt.pdf . Kiango J. G. (ed.) 1995 . Dhima ya Kamusi katika Kusanifisha Lugha . Dar es Salaam : TUKI / UDSM . Kiango J. G. 2000 . Bantu Lexicography: a Critical Survey of the Principles and Process of Constructing Dictionary Entries . Tokyo : Tokyo University of Foreign Studies . Kilgarriff A. 2012 . ‘ Getting to Know your Corpus ’ In Sojka P. , Horak A. , Kopecek I. and Pala K. (eds), Text Speech Dialogue. 15th International Conference, TSD 2012, Brno, Czech Republic, September 3–7, 2012, Proceedings. Springer, Lecture Notes in Computer Science, 3–15 . Kilgarriff A. 2013 . ‘ Using Corpora as Data Sources for Dictionaries ’ In Jackson H. (ed.), The Bloomsbury Companion to Lexicography . London : Bloomsbury , 77 – 96 . Kilgarriff A. and Grefenstette G. . 2003 . ‘ Introduction to the Special Issue on the Web as Corpus .’ Computational Linguistics 29 . 3 : 333 – 347 . Google Scholar Crossref Search ADS Kilgarriff A. , Pomikalek J. , Jakubíček M. , and Whitelock P. . 2012 . ‘ Setting up for Corpus Lexicography ’ In Fjeld R. V. and Torjusen J. M. (eds), Proceedings of the 15th EURALEX International Congress. 7–11 August 2012. Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo, 778–785 . Koplenig A. , Meyer P. , and Müller-Spitzer C. . 2014 . ‘ Dictionary Users do Look up Frequent Words. A Log File Analysis ’ In Müller-Spitzer C. (ed.), Using Online Dictionaries . Berlin : Walter de Gruyter (Lexicographica Series Maior 145) , 229 – 249 . Lew R. 2011a . ‘ User Studies: Opportunities and Limitations ’ In Akasu K. and Satoru U. (eds), ASIALEX2011 Proceedings Lexicography: Theoretical and practical perspectives. Kyoto: Asian Association for Lexicography, 7–16 . Lew R. 2011b . ‘ Studies in Dictionary Use: Recent Developments .’ International Journal of Lexicography 24 . 1 : 1 – 4 . Google Scholar Crossref Search ADS Lew R. and de Schryver G.-M. . 2014 . ‘ Dictionary Users in the Digital Revolution .’ International Journal of Lexicography 27 . 4 : 341 – 359 . Google Scholar Crossref Search ADS Mbaabu I. 1995 . ‘ Dhima ya Kamusi katika Kufundisha na Kujifunza Kiswahili ’ In Kiango J. G. (ed.), Dhima ya Kamusi katika Kusanifisha Lugha . Dar es Salaam : TUKI / UDSM , 47 – 59 . Mdee J. S. 1984 . ‘ Constructing an Entry in a Learner’s Dictionary of Standard Kiswahili ’ In Hartmann R. R. K. (ed.), Lexeter ’83 Proceedings. Papers from the International Conference on Lexicography at Exeter, 9–12 September 1983. Tübingen: Max Niemeyer Verlag, 237–241 . Mdee J. S. 1999 . ‘ Dictionaries and the Standardization of Spelling in Swahili .’ Lexikos 9 : 119 – 134 . Muaka L. and Muaka A. . 2006 . Tusome Kiswahili . Let’s Read Swahili : Intermediate Level . Madison Wisconsin : NALRC Press . McGrath D. and Marten L. . 2003 . Colloquial Swahili: The Complete Course for Beginners . London : Routledge . Müller-Spitzer C. , Wolfer S. and Koplenig A. . 2015 . ‘ Observing Online Dictionary Users: Studies Using Wiktionary Log Files .’ International Journal of Lexicography 28 ( 1 ): 1 – 26 . Google Scholar Crossref Search ADS Ohly R. 2002 . ‘ Globalization of Language and Terminology in African Context .’ Africana Bulletin 50 : 135 – 157 . Prinsloo D. J. 1991 . ‘ Towards Computer-Assisted Word Frequency Studies in Northern Sotho .’ South African Journal of African Languages 11 ( 2 ): 54 – 60 . Prinsloo D. J. 2015 . ‘ Corpus-based Lexicography for Lesser-resourced Languages – Maximizing the Limited Corpus .’ Lexikos 25 : 285 – 300 . Google Scholar Crossref Search ADS Prinsloo D. J. and de Schryver G.-M. . 2001 . ‘ Taking Dictionaries for Bantu Languages into the New Millennium – with special reference to Kiswahili, Sepedi and isiZulu ’ In Mdee J. S. and Mwansoko H. J. M. (eds), Makala ya kongamano la kimataifa Kiswahili 2000. Proceedings . Dar es Salaam : TUKI , Chuo Kikuu cha Dar es Salaam, 188 – 215 . Scannell K. 2007 . ‘ The Crúbadán Project: Corpus Building for Under-resourced Languages ’ In Fairon C. , Naets H. , Kilgarriff A. and de Schryver G-M. (eds), Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop. Louvainla-Neuve, Belgium, 5–15. Accessed on 5 April 2017. https://borel.slu.edu/pub/wac3.pdf . Tarp S. and Fuertes-Olivera P. A. . 2016 . ‘ Advantages and Disadvantages in the Use of Internet as a Corpus: The Case of the Online Dictionaries of Spanish Valladolid-UVa .’ Lexikos 26 : 273 – 295 . Google Scholar Crossref Search ADS Trap-Jensen L. , Lorentzen H. , and Sørensen N. H. . 2014 . ‘ An Odd Couple - Corpus Frequency and Look-Up Frequency: What Relationship? ’ Slovenščina 2.0, 2: 94–113. Accessed on 5 April 2017. https://dsl.dk/medarbejdere/medarbejdere-publikationer-m-m/ltj/an-odd-couple . Verlinde S. and Binon J. . 2010 . ‘ Monitoring Dictionary Use in the Electronic Age ’ In Dykstra A. and Schoonheim T. (eds), Proceedings of the XIV Euralex International Congress. 6–10 July 2010. Leeuwarden/Ljouwert: Fryske Akademy – Afûk, 1144–1151 . Wójtowicz B. 2016 . ‘ Learner Features in a New Corpus-based Swahili Dictionary .’ Lexikos 26 : 402 – 415 . Footnotes 1The collecting of data on resources for African Languages was undertaken by http://aflat.org/ Unfortunately, the site seems to no longer be updated [accessed 21.02.2017]. © 2017 Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

International Journal of LexicographyOxford University Press

Published: Sep 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off