TY - JOUR
AU1 - Fišer,, Darja
AU2 - Ljubešić,, Nikola
AB - Abstract This paper gives an overview of distributional modelling of word meaning for contemporary lexicography. We also apply it in a case study on automatic semantic shift detection in Slovene tweets. We use word embeddings to compare the semantic behaviour of frequent words from a reference corpus of Slovene with their behaviour on Twitter. Words with the highest model distance between the corpora are considered as semantic shift candidates. They are manually analysed and classified in order to evaluate the proposed approach as well as to gain a better qualitative understanding of the problem. Apart from the noise due to pre-processing errors (45%), the approach yields a lot of valuable candidates, especially the novel senses occurring due to daily events and the ones produced in informal communication settings. 1. Introduction Meanings of words are not fixed but undergo changes, either due to the advent of new word senses or the established ones taking new shades of meaning or becoming obsolete (Mitra et al. 2015). Word meanings can expand/become more generalized, narrow down to include fewer referents or shift/transfer to include a new set of referents (Sagi et al. 2009). A classic example of meaning expansion is the noun ‘miška’ (Eng. ‘mouse’), which used to refer to the small rodent but is now also used for describing the computer pointing device. The reverse process occurred with the noun ‘faks’ (Eng. ‘fax’), which used to mean both the machine for telephonic transmission of printed documents and higher education institution, only the latter of which continues to be of use in contemporary colloquial Slovene. Also frequent are amelioration and pejoration (Cook and Stevenson 2010): processes through which words acquire new positive or negative connotations. Amelioration, which is especially frequent in slang, can be observed in the use of the adverb ‘hudo’ (Eng. ‘terrific’), which has a strong negative connotation in standard Slovene but has acquired a distinctly positive one in colloquial Slovene. Pejoration, a semantic shift in the opposite direction, can be observed in the use of the noun ‘blondinka’ (Eng. ‘blonde’), which is neutral in standard Slovene, but used distinctly pejoratively in informal settings. In this paper, which builds upon our pilot study (Fišer and Ljubešič 2016), we give an overview of distributional approaches to modelling word meaning, and present the results of a pilot study on identifying semantic shifts via distributional models in Slovene tweets. While the method we propose is quite basic and so unable to detect the type of semantic shift, we show that it is quite robust and could easily be integrated in a lexicographers’ workbench to facilitate dictionary updates. 2. Distributional modelling of meaning The thriving area of Distributional Modelling of Meaning of Words (Lenci 2008) is based on the well-known distributional hypothesis popularized by Firth’s (1957) quotation: ‘You shall know a word by the company it keeps’. The distributional approach to modelling word meaning uses all the contexts of a word’s occurrences in a large corpus and encodes them in some way. There are three main uses of such models: analysing typical co-occurrence patterns (e.g., a ‘groom’ can be ‘handsome’, while a ‘view’ is rather ‘beautiful’); measuring semantic similarity of word pairs (e.g., ‘dog’ and ‘cat’ are much more similar than ‘dog’ and ‘car’); and using word meaning representations in various downstream prediction tasks, from part-of-speech tagging via machine translation to image captioning. 2.1. The distributional vector space model In computational linguistics, the most popular representation of words is the Vector Space Model (Salton et al. 1975). In modelling word meaning with the distributional hypothesis, the phenomena of interest are words, while the dimensions of our vector space are the words occurring in the proximity of the words we model, called the context window, which has a specific size. For example, a window of size 2 means that for each occurrence of a word we want to model, we take into account two words occurring to the left and two to the right of the word we model. To distinguish between the word whose meaning we want to model and the words occurring in a window of that word, we call the former the focus word and the latter its features. If we encode only whether a word has occurred at any point in the window around the focus word, given the set of sentences S={‘A furry cat runs from a big dog’, ‘The cute dog runs after the red car’, ‘A speeding car hit a cute cat’}, the set of words we want to model V={‘car’, ‘cat’, ‘dog’}, a window size of 1, and the set of dimensions representing our vector space F={‘big’, ‘cute’, ‘furry’, ‘hit’, ‘red’, ‘runs’, ‘speeding’} (we define as our features only words occurring one position to any side from the vocabulary we want to model), the vector representation of our three words would be as in Table 1. Table 1. Example of word vector representations. Each row is a vector representation of a word. big cute furry hit red runs speeding car 0 0 0 1 1 0 1 cat 0 1 1 0 0 1 0 dog 1 1 0 0 0 1 0 big cute furry hit red runs speeding car 0 0 0 1 1 0 1 cat 0 1 1 0 0 1 0 dog 1 1 0 0 0 1 0 View Large Table 1. Example of word vector representations. Each row is a vector representation of a word. big cute furry hit red runs speeding car 0 0 0 1 1 0 1 cat 0 1 1 0 0 1 0 dog 1 1 0 0 0 1 0 big cute furry hit red runs speeding car 0 0 0 1 1 0 1 cat 0 1 1 0 0 1 0 dog 1 1 0 0 0 1 0 View Large By inspecting the content of those three vectors, we can easily observe that the content of the vectors for ‘cat’ and ‘dog’ are more similar (identical values in 5 out of 8 cases) than the vector of ‘car’ and either ‘cat’ or ‘dog’ (both having a single identical value to ‘car’). While our toy corpus is opportunistic, it shows the overall principle and one can imagine that, once such a co-occurrence matrix of words in a corpus is constructed, words with similar contexts, i.e., meanings, will have similar vector representations. In the remainder of this section, we look more closely at two basic types of distributional models: counting-based models that are based on counting the frequency of occurrences of features around a focus word; and prediction-based models that are based on learning parameters that optimize the task of predicting the focus word given the features (or vice versa). 2.2. Counting-based models Counting-based models count the number of co-occurrences of specific features and focus words, encoding these frequencies, or their derivatives, in the corresponding dimensions of the vector representation. 2.2.1. Feature weighting Given the Zipfian distribution of words (Zipf 1935), i.e. the fact that some words occur very frequently, while most words occur infrequently, it becomes clear that encoding simple frequencies in vector representations of focus words is probably not the best approach, as the dominant dimensions in each word representation will be function words. Intuitively, we can assume that the meaning of a word is better encoded through co-occurring lexical words (e.g., ‘runs’ for ‘dog’ and ‘cat’) rather than function words such as ‘the’, especially in a simple model (as in Table 1), in which word order does not play a role, just word proximity. There are many methods for weighting features other than just counting the number of times they co-occur with the focus word, the most prominent being: PMI (Pointwise Mutual Information) (Church and Hanks 1990); TF-IDF (Term Frequency - Inverse Document Frequency; Sparck Jones 1972); and Dice (1945). We will not go into the technical details of the weighting techniques here, but will try to outline their basic principle: all these weights encode how informative a feature is for a given focus word. If a feature co-occurs with most focus words (like the feature ‘the’), it is less informative than a feature that co-occurs with just some focus words (like the feature ‘runs’). 2.2.2. Feature definition Features other than the co-occurring word forms can be used in the same way. For example, in morphologically complex languages, in which lexemes occur in many different forms, it could be useful for features not to be surface forms (e.g., ‘runs’), but lemmas of co-occurring words (e.g., ‘run’). Furthermore, we may want to encode additional information, such as the order in which words co-occur. Encoding that the feature ‘eats’ occurs regularly before the focus word ‘bananas’ includes the useful information that bananas do not eat, but are eaten. We can encode this information by defining features as pairs of co-occurring words, and their relative position in the window. For example, in the sentence ‘The monkey eats bananas’, for the focus word ‘bananas’ and a window size of 2, we would encode, not just the features ‘monkey’ and ‘eats’, but also the more complex positional features (‘monkey’, -2) and (‘eats’, -1). Such an approach introduces many more features and thereby increases the number of dimensions of the vector space significantly. As a consequence, it simultaneously increases its sparsity, i.e. most of the values in a vector space being zeroes. This problem will be addressed in subsection 2.2.4. Finally, if we are interested in syntactic rather than semantic similarity of words, we may not wish to encode words as features at all but, for instance, their parts-of-speech together with their relative position in the window. In the previous example, assuming that the part-of-speech of ‘monkey’ is ‘NN’ and that of ‘eats’ is ‘VB’, the two features of ‘bananas’ would be (‘NN’,-2) and (‘VB’,-1). Calculating the most similar representations to a word of interest in this way would not produce words similar in meaning, but rather in their syntactic roles in the sentence. 2.2.3. Window size The size of the window that we take into account when building the representation of a word has a major impact on the final content of that representation. Using smaller window sizes (most usually between 2 and 5) encodes the specific semantics of the word, while using larger window sizes, like the whole context of the document, encodes topic or domain information (Levy and Goldberg 2014a). Calculating the most similar focus words represented through a narrower window would for the focus word ‘cat’ probably return ‘dog’ and ‘mouse’, while for wider windows those focus words would include items like ‘sleep’ and ‘jump’. 2.2.4. Dimensionality reduction Given that distributional models are built from large corpora, which contain many different words, counting-based models will have a very high dimensionality. There are various problems associated with highly dimensional spaces of word meaning representations, one of them being space-inefficiency, i.e., most of the features of a vector representation do not contain any information. Another issue of sparse matrices is the ‘curse of dimensionality’, i.e. the effect that distance metrics are less reliable in high-dimensional vector spaces (Aggarwal et al. 2001). The traditional way of dealing with the large number of dimensions is to perform some sort of Dimensionality Reduction (Fodor 2002). Dimensionality reduction techniques search for a representation of the original vector space in another space with a significantly lower number of features, thereby preserving most of the information from the original vector space. While distributional word representations prior to dimensionality reduction can consist of millions of features / dimensions, researchers use these procedures to shrink them to a few hundred. The most widely used dimensionality reduction techniques are Random Indexing (Kanerva et al. 2000), Singular Value Decomposition (Golub and Reinsch 1970) and Principal Component Analysis (Pearson 1901). The biggest drawback of these techniques is that the vector space loses its transparency. As long as features correspond to specific words, we can contrast two very similar vector representations like ‘cat’ and ‘dog’, and identify those dimensions on which they differ most, like ‘meows’ and ‘barks’. On the other hand, projecting the original feature space into a lower-dimensional space enables the identification of new dimensions which encode information not previously encoded in any specific dimension. For instance, if we extend our two illustrative focus words ‘cat’ and ‘dog’ with ‘tiger’ and ‘wolf’, dimensionality reduction could result in one dimension in which ‘cat’ and ‘tiger’ on one side, and ‘dog’ and ‘wolf’ on the other are closer together (the ‘species’ dimension), and another dimension in which ‘cat’ and ‘dog’ on one side, and ‘tiger’ and ‘wolf’ on the other are closer together (the ‘pet vs. predator’ dimension; Agres et al. 2015). 2.3. Prediction-based models While the basic idea behind counting-based models is to simply count the number of co-occurrences of a focus word and various features in the context window, prediction-based models, also called Word Embeddings (Mikolov et al. 2013a), are more technically challenging. Those models are based on supervised machine learning techniques which build predictive models from existing annotated data (e.g., corpora manually annotated with part-of-speech information), and then perform predictions on unseen data (e.g., corpora to be annotated). An increasingly popular supervised learning technique are Neural Networks (McCulloch and Pitts 1943, Schmidhuber 2015), which are used for constructing prediction-based distributional models. While for most supervised tasks obtaining annotated data requires costly and time-consuming manual annotation campaigns, the beauty of using supervised learning for building prediction-based distributional models of words is that plain text is converted into a series of prediction tasks. 2.3.1. Basic approaches The two basic approaches to building prediction-based models, Continuous Bag of Words (CBOW) and skip-gram (Mikolov et al. 2013a), rely on a trick on how to present a large collection of texts as prediction tasks. In the CBOW model, the task is to predict the focus word from the available context features, while in the skip-gram model the task is inverted: given the focus word, predict the context features. Returning to our simple example, ‘The monkey eats bananas’, in the CBOW model the task is to learn to predict that, if the context features are ‘monkey’ and ‘eats’, the focus word is ‘bananas’ (or possibly ‘fruit’), while in the skip-gram model we want to learn to predict the context features ‘eats’ and ‘monkey’ given the focus word ‘bananas’. During model training, a specified set of parameters per focus word is estimated. Given the features, these parameters are meant to predict the focus word (or vice versa) and, intuitively, for words that have similar contexts, these parameters will be similar. Those learned parameters per focus word are, once the training has been completed, the distributional representation of each word. While these procedures are quite different from the counting-based ones, Levy and Goldberg (2014b) have shown that the results of a specific skip-gram model are very similar to a counting-based model with dimensionality reduction. Various properties of count-based and prediction-based word representations can also be exploited in a multilingual setting as methods for calculating multilingual word embeddings, i.e., representations of words of various languages in a single vector space, are becoming increasingly efficient (Ammar et al. 2016; Smith et al., 2017). 2.3.2. Analogical properties The learned word representations have some interesting properties, the best known being analogy: if we take the vector representation of ‘king’, subtract from this vector the vector ‘man’ and add the vector ‘woman’, the vector closest to our resulting vector will be that for ‘queen’. Similarly, if we take the vector of ‘Paris’, subtract ‘France’ and add ‘England’, we will obtain a vector very close to the one for ‘London’ (Mikolov et al. 2013a). The analogy feature of the distributional representations does not cover only the semantic cases above, but can be used for identifying words with other linguistic properties, like the inflectional adjectival base-superlative relations (‘best’-‘good’+‘ugly’=‘ugliest’) or the verbal past-3rd person singular present (‘sees’-‘saw’+‘returns’=‘returned’) (Mikolov et al. 2013b). 3. Distributional modelling for semantic shift detection In this section we present a case study in which we employed the distributional modelling approach on an emerging lexicographic task of automated semantic shift detection. The case study is performed on Slovene as a proof-of-concept but the approach is language-independent and can be easily extended to another language. We focus on Twitter data which, due to its increasing popularity and heterogeneous use(r)s, is a very attractive source of novel word usage, typically not covered by the authoritative lexical and language resources. We first give an overview of related work. We then describe the resources and the method used, after which we perform a manual analysis and discuss the results obtained with the approach. 3.1. Related work While automatic discovery of word senses has been studied extensively (Sparck Jones 1986, Ide and Véronis 1998, Schütze 1998, Navigli 2009), changes in the range of meanings expressed by a word have received much less attention, despite the fact that it is a very important challenge in lexicography, where it is needed to keep the description of dictionary entries up-to-date. Apart from lexicography, up-to-date semantic inventories are also required for a wide range of human-language technologies, such as question answering and machine translation. As more and more diachronic, genre- and domain-specific corpora are becoming available, automatic semantic shift detection is becoming an increasingly attainable goal. Most work in semantic shift detection focuses on diachronic changes in word usage and meaning by utilizing large historical corpora spanning several decades or even centuries (Mitra et al. 2015, Tahmasebi et al. 2011, Hamilton et al. 2016). Another popular focus is the comparison of word senses in two or more corpora containing texts from different time points or genres. Cook et al. (2013), for example, induce word senses and identify novel senses by comparing the new ‘focus corpus’ with the ‘reference corpus’, using topic modelling for word sense induction. Simpler and potentially more robust approaches do not need to discriminate between specific senses but measure contextual difference of a lexeme in two or more corpora. Gulordava and Baroni (2011), for example, detect semantic change based on distributional similarity between word vectors built from two different corpora. In the case study presented in this paper, we employ distributional modelling to identify novel senses in Slovene Twitterese, an approach similar to Gulordava and Baroni (2011). Compared to diachronic approaches, we work with substantially less data, and do not take advantage of any metadata available in the corpus. However, we propose a more fine-grained and informative typology of semantic shifts as well as conduct a better-documented and more comprehensive linguistic analysis of semantic shift candidates (cf. Section 3.4). 3.2. Corpora In order to detect novel senses in non-standard Internet Slovene, we compared the usage of selected focus words in tweets from the Janes corpus (Fišer et al. 2016) with the standard written Slovene as found in the reference corpus Gigafida (Logar et al. 2012). 3.2.1. The Janes corpus The Janes corpus 1.0 (Fišer et al. 2016) is the first extensive and publicly available corpus of Slovene computer-mediated communication (CMC) containing five different text types of public user-generated content of varying lengths and communicative purposes: tweets, forum posts, user comments from on-line news portals, talk and user pages from Wikipedia, and blog posts along with user comments on these blogs. The corpus contains around 9 million texts, comprising roughly 200 million tokens. In this case study we only use the subcorpus Janes-Tweet v0.4, comprising 107 million tokens (henceforth: the Twitter corpus). The novel step in the annotation chain is the normalisation of non-standard word tokens to their standard form (Ljubešić et al. 2016), which was performed with character-level statistical machine translation in order to translate non-standard variants (e.g., ‘jest’, ‘jst’, ‘jas’, ‘js’) to their standard equivalent (‘jaz’, Eng. ‘I’). This achieves two goals: first, it becomes possible to search for a word without having to consider or be aware of all its spelling variants, and second, standard language processing tools can be used for part-of-speech tagging and lemmatisation (Ljubešić and Erjavec 2016). 3.2.2. The Gigafida corpus The Gigafida corpus is a 1.2 billion-word reference corpus of Slovene and is one of the largest and the most extensive text collections of Slovene (Logar et al. 2012). It contains texts of various types and genres such as newspaper articles, magazines, literary texts, textbooks, non-fiction literature and web pages of major public institutions and private companies as well as educational, research and cultural institutions that were published between 1995 and 2011. Each document in the corpus is equipped with the following metadata: source of text, year of publication of the text, type of text, title of the text, author of the text. The corpus was split into paragraphs and sentences, tokenized, part-of-speech tagged and lemmatised with the Obeliks tagger (Grčar et al. 2012). Since the tagging and lemmatisation of each corpus used in the case study was performed with a different tool, there will be some of the noise in the results (see Section 3.4). 3.3. Method We build two distributional models for each focus word (lemma), one representing the focus word in the standard language (from the Gigafida reference corpus) and the other one in non-standard language (from the Twitter corpus). Given that representation depends on the data from both corpora, the representation learning from both has to be performed in a single process that required encoding whether an occurrence of a focus word came from the standard or non-standard dataset in the form of a prefix to the headword itself (e.g., ‘s_miška#Nc’ for the occurrence of the common noun ‘mouse’ in standard data and ‘n_miška#Nc’ for its occurrence in non-standard data). Because of this, their representation cannot be learned from running text as usual. Context features cannot have the corpus information encoded, as this is the information that is supposed to be shared between the two corpora. Since we cannot feed a tool a sequence of words but have to feed it pairs of focus words and features, we used the only tool we know of that is suitable for this type of input data, word2vecf1 (Levy and Goldberg 2014a). As context features we use surface forms, thereby avoiding the significant noise introduced during tagging and lemmatisation of non-standard texts. The features are taken from a punctuation-free window of two words to each side of the headword. The relative position of each feature to the headword is not encoded. By following the described method, we produced vector representations of 200 dimensions for each of the 5425 intersecting lemmas passing the frequency threshold of 500 in each of the two corpora. More vocabulary could easily be covered by lowering the rather strict frequency threshold of 500, but our primary goal in this case study was general vocabulary that is very commonly used across genres. We calculate the semantic shift as a cosine similarity, transformed to a distance measure, between the representations of a word built from standard and non-standard data. Our hypothesis is that the representations of words in which no semantic shift has occurred will be much closer to each other than the representations of words which are used in novel ways. The presented method is rather simple and ignores the fact that most words in each of the corpora are used in multiple meanings. The alternative to our approach would be to use word sense disambiguation, an automated procedure of discriminating between various senses of a word. However, word sense disambiguation is more suitable for coarse-grained and frequent senses. We believe that our simplistic approach can be a very useful tool for the lexicographer, as it can point to lexemes that are either used in different senses or with a different frequency distribution of senses in the two corpora, which are both relevant for the process of describing word usage. 3.4. Analysis We performed a manual linguistic analysis of the 200 top-ranking lemmas from the reference and Twitter corpora that, according to the method, display the largest differences in their contexts.2 The comparative analysis is performed by comparing Word Sketches of the same lemma in both corpora in the Sketch Engine tool3 (Kilgarriff et. al. 2014), using the Sketch Grammar for Slovene developed by Krek and Kilgarriff (2006). Word Sketches are one-page summaries of the grammatical and collocational behaviour of a focus word. They show the focus word’s collocates categorised by grammatical relations, such as words that serve as object of the verb, words that serve as subject of the verb, words that modify the word etc., as shown in Figure 1. While Word Sketches significantly speed up and simplify the analysis, other similar tools, or even simple concordances could be used for analysis as well. Figure 1. View largeDownload slide Word Sketch for ‘politik’ (Eng. ‘politician’) in the Twitter corpus. The labels at the top of each column are names of grammatical relations, e.g. S_kakšen? (Eng. ‘Premodified by’). The phrases in grey show examples of how the word combines with its collocates, e.g. ‘najbolj priljubljen politik’ (Eng. ‘most popular politician’). The collocations in bold offer further word sketches for multi-word phrases, e.g. ‘pravnomočno obsojen politik’ (Eng. ‘legally convicted politician’). Frequency counts for each collocation contain hyperlinks to their concordances. Figure 1. View largeDownload slide Word Sketch for ‘politik’ (Eng. ‘politician’) in the Twitter corpus. The labels at the top of each column are names of grammatical relations, e.g. S_kakšen? (Eng. ‘Premodified by’). The phrases in grey show examples of how the word combines with its collocates, e.g. ‘najbolj priljubljen politik’ (Eng. ‘most popular politician’). The collocations in bold offer further word sketches for multi-word phrases, e.g. ‘pravnomočno obsojen politik’ (Eng. ‘legally convicted politician’). Frequency counts for each collocation contain hyperlinks to their concordances. Based on the comparison of the Word Sketches in the Twitter and the Gigafida corpus, we perform an analysis of semantic shifts as illustrated in Table 2. We build Word Sketches for the selected focus word from both corpora, using the following (default) settings: minimum frequency threshold: 5; maximum number of items per grammatical relation: 25; association score to sort the collocations: logDice (Rychlý 2008). Table 2. Top five collocates for the three most productive Word Sketches for the word ‘pirat’ (Eng. ‘pirate’) in the Twitter and the Gigafida corpus. The words in bold indicate a novel sense in the Twitter corpus not attested in the Gigafida corpus. Twitter focus word: pirat, Eng. Pirate frequency: 1,034 (9.65 per million) Gigafida focus word: pirat, Eng. Pirate frequency: 9,941 (7.05 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW somalski Somalian 7 / 10.48 somalijski Somalian 203 / 11.19 somalijski Somalian 5 / 9.61 somalski Somalian 48 / 9.10 islandski Islandic 6 / 8.80 zdelan worn 28 / 8.29 vesoljski space 6 / 7.70 karibski Caribbean 60 / 8.18 spleten web 8 / 3.97 novodoben modern 38 / 6.71 verbs with FW as object voliti vote 6 / 7.45 preganjati persecute 20 / 6.65 podpreti support 5 / 6.11 kaznovati punish 6 / 5.12 / / / snemati shoot 15 / 5.07 / / / loviti catch 12 / 4.56 / / / prehiteti be faster 6 / 4.44 nouns followed by FW sestanek meeting 5 / 8.05 bitka battle 18 / 7.15 / / / zatočišče shelter 9 / 7.05 / / / preganjanje persecution 18 / 6.69 / / / jahta yaht 6 / 6.68 / / / prekletstvo curse 8 / 6.53 Twitter focus word: pirat, Eng. Pirate frequency: 1,034 (9.65 per million) Gigafida focus word: pirat, Eng. Pirate frequency: 9,941 (7.05 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW somalski Somalian 7 / 10.48 somalijski Somalian 203 / 11.19 somalijski Somalian 5 / 9.61 somalski Somalian 48 / 9.10 islandski Islandic 6 / 8.80 zdelan worn 28 / 8.29 vesoljski space 6 / 7.70 karibski Caribbean 60 / 8.18 spleten web 8 / 3.97 novodoben modern 38 / 6.71 verbs with FW as object voliti vote 6 / 7.45 preganjati persecute 20 / 6.65 podpreti support 5 / 6.11 kaznovati punish 6 / 5.12 / / / snemati shoot 15 / 5.07 / / / loviti catch 12 / 4.56 / / / prehiteti be faster 6 / 4.44 nouns followed by FW sestanek meeting 5 / 8.05 bitka battle 18 / 7.15 / / / zatočišče shelter 9 / 7.05 / / / preganjanje persecution 18 / 6.69 / / / jahta yaht 6 / 6.68 / / / prekletstvo curse 8 / 6.53 View Large Table 2. Top five collocates for the three most productive Word Sketches for the word ‘pirat’ (Eng. ‘pirate’) in the Twitter and the Gigafida corpus. The words in bold indicate a novel sense in the Twitter corpus not attested in the Gigafida corpus. Twitter focus word: pirat, Eng. Pirate frequency: 1,034 (9.65 per million) Gigafida focus word: pirat, Eng. Pirate frequency: 9,941 (7.05 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW somalski Somalian 7 / 10.48 somalijski Somalian 203 / 11.19 somalijski Somalian 5 / 9.61 somalski Somalian 48 / 9.10 islandski Islandic 6 / 8.80 zdelan worn 28 / 8.29 vesoljski space 6 / 7.70 karibski Caribbean 60 / 8.18 spleten web 8 / 3.97 novodoben modern 38 / 6.71 verbs with FW as object voliti vote 6 / 7.45 preganjati persecute 20 / 6.65 podpreti support 5 / 6.11 kaznovati punish 6 / 5.12 / / / snemati shoot 15 / 5.07 / / / loviti catch 12 / 4.56 / / / prehiteti be faster 6 / 4.44 nouns followed by FW sestanek meeting 5 / 8.05 bitka battle 18 / 7.15 / / / zatočišče shelter 9 / 7.05 / / / preganjanje persecution 18 / 6.69 / / / jahta yaht 6 / 6.68 / / / prekletstvo curse 8 / 6.53 Twitter focus word: pirat, Eng. Pirate frequency: 1,034 (9.65 per million) Gigafida focus word: pirat, Eng. Pirate frequency: 9,941 (7.05 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW somalski Somalian 7 / 10.48 somalijski Somalian 203 / 11.19 somalijski Somalian 5 / 9.61 somalski Somalian 48 / 9.10 islandski Islandic 6 / 8.80 zdelan worn 28 / 8.29 vesoljski space 6 / 7.70 karibski Caribbean 60 / 8.18 spleten web 8 / 3.97 novodoben modern 38 / 6.71 verbs with FW as object voliti vote 6 / 7.45 preganjati persecute 20 / 6.65 podpreti support 5 / 6.11 kaznovati punish 6 / 5.12 / / / snemati shoot 15 / 5.07 / / / loviti catch 12 / 4.56 / / / prehiteti be faster 6 / 4.44 nouns followed by FW sestanek meeting 5 / 8.05 bitka battle 18 / 7.15 / / / zatočišče shelter 9 / 7.05 / / / preganjanje persecution 18 / 6.69 / / / jahta yaht 6 / 6.68 / / / prekletstvo curse 8 / 6.53 View Large Although, for space reasons, we only list the strongest five collocates from the three most productive Word Sketches in Table 2, we analysed all the extracted relations and collocations in both corpora. If needed, we also examined the concordances of the relevant collocation. As can be seen from the collocations in Table 2, three senses can be identified in the Gigafida corpus, some with overlapping collocations: person who robs ships (e.g. ‘somalijski/Somalian’, ‘jahta/yacht’, ‘zatočišče/shelter’); person who illegally reproduces copyrighted content (e.g. ‘novodoben/modern’, ‘preganjati/persecute’, ‘kaznovati/punish’); and metaphorical / book, movie, TV show title (e.g. ‘zdelan/worn’, ‘prekletstvo/curse’, ‘snemati/shoot’). In the Twitter corpus, similar senses can be defined based on the following collocations in the Word Sketch: ‘somalski/Somalian’; ‘spleten/web’; and ‘vesoljski/space’. But in addition to those, collocations such as ‘islandski/Islandic’, ‘voliti/vote’, ‘podpreti/support’ and ‘sestanek/meeting’ that are not found in the Gigafida corpus indicate a new sense in tweets published in 2014 and 2015, namely: persons who are members of new political parties from Slovenia and other European countries. This sense is missing from the Gigafida corpus, not because it is only used on social media or in informal communication settings, but because the corpus only includes texts published up to 2011, whereas the political movement gained momentum after their election success in Germany in 2011, Iceland in 2013, and the EU in 2014. Clearly, not all the discrepancies between the corpora point to novel senses, but more subtle differences in usage, such as semantic narrowing or redistribution of sense frequencies due to differences in topics, genre and registers represented in the corpora. This is why we distinguished between major and minor semantic shifts and further classified them into subcategories as follows: Major semantic shifts Event-based Register-based Medium-based Minor semantic shifts Shifts in sense distribution Pattern restrictions Semantic narrowing Errors Pre-processing errors False positives 3.4.1 Major semantic shifts In order to gain a more thorough understanding of the more substantial differences in the semantic footprint of the vocabulary that displays the biggest difference in usage on Twitter with respect to the reference corpus, we categorised them into three classes: event-based, register-based and medium-based. 3.4.1.1 Event-based semantic shifts As the first type of major shifts we consider novel usage of words that is a consequence of daily events, political situations, natural disasters, or social circumstances. One such example is the already mentioned ‘pirat’ (‘pirate’), who used to be confined to the sea, but can now be found on the internet as well and even in the politics as members of the new party, only the latter in distinctly positive contexts. Another such example is the noun ‘vztrajnik’ (Eng. ‘protester’ / ‘flywheel’) which is quite rare in Gigafida and appears only in the sense of ‘flywheel’ but is used on Twitter much more frequently, almost exclusively to refer to persistent protesters (see Table 3). Table 3. Top five collocates for the three most productive Word Sketches for the word ‘vztrajnik’ (Eng. ‘protester/flywheel’) in the Twitter and the Gigafida corpus. The words in bold indicate a novel sense in the Twitter corpus that is not observed in the Gigafida corpus. Twitter focus word: vztrajnik, Eng. protester frequency: 816 (7.62 per million) Gigafida focus word: vztrajnik, Eng. Flywheel frequency: 423 (0.30 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW drag dear 7 / 4.53 dvomasen dual-mass 12 / 11.41 pravi true 5 / 2.24 vrteč rotating 5 / 6.24 / / / magneten magnetic 16 / 5.37 / / / lahek light 15 / 2.45 / / / težek heavy 7 / 0.50 verbs with FW as subject zapeti sing 7 / 8.81 skrbeti care 6 / 1.97 vztrajati persist 5 / 8.05 / / / / / / / / / / / / / / / / / / / / / nouns followed by FW Viktor Viktor 8 / 10.78 motor motor 13 / 3.43 Odbor Committee 10 / 7.33 sistem system 5 / 0.30 / / / / / / / / / / / / / / / / / / Twitter focus word: vztrajnik, Eng. protester frequency: 816 (7.62 per million) Gigafida focus word: vztrajnik, Eng. Flywheel frequency: 423 (0.30 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW drag dear 7 / 4.53 dvomasen dual-mass 12 / 11.41 pravi true 5 / 2.24 vrteč rotating 5 / 6.24 / / / magneten magnetic 16 / 5.37 / / / lahek light 15 / 2.45 / / / težek heavy 7 / 0.50 verbs with FW as subject zapeti sing 7 / 8.81 skrbeti care 6 / 1.97 vztrajati persist 5 / 8.05 / / / / / / / / / / / / / / / / / / / / / nouns followed by FW Viktor Viktor 8 / 10.78 motor motor 13 / 3.43 Odbor Committee 10 / 7.33 sistem system 5 / 0.30 / / / / / / / / / / / / / / / / / / View Large Table 3. Top five collocates for the three most productive Word Sketches for the word ‘vztrajnik’ (Eng. ‘protester/flywheel’) in the Twitter and the Gigafida corpus. The words in bold indicate a novel sense in the Twitter corpus that is not observed in the Gigafida corpus. Twitter focus word: vztrajnik, Eng. protester frequency: 816 (7.62 per million) Gigafida focus word: vztrajnik, Eng. Flywheel frequency: 423 (0.30 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW drag dear 7 / 4.53 dvomasen dual-mass 12 / 11.41 pravi true 5 / 2.24 vrteč rotating 5 / 6.24 / / / magneten magnetic 16 / 5.37 / / / lahek light 15 / 2.45 / / / težek heavy 7 / 0.50 verbs with FW as subject zapeti sing 7 / 8.81 skrbeti care 6 / 1.97 vztrajati persist 5 / 8.05 / / / / / / / / / / / / / / / / / / / / / nouns followed by FW Viktor Viktor 8 / 10.78 motor motor 13 / 3.43 Odbor Committee 10 / 7.33 sistem system 5 / 0.30 / / / / / / / / / / / / / / / / / / Twitter focus word: vztrajnik, Eng. protester frequency: 816 (7.62 per million) Gigafida focus word: vztrajnik, Eng. Flywheel frequency: 423 (0.30 per million) Relation Collocation English Freq / logDice Collocation English Freq / logDice modifiers of FW drag dear 7 / 4.53 dvomasen dual-mass 12 / 11.41 pravi true 5 / 2.24 vrteč rotating 5 / 6.24 / / / magneten magnetic 16 / 5.37 / / / lahek light 15 / 2.45 / / / težek heavy 7 / 0.50 verbs with FW as subject zapeti sing 7 / 8.81 skrbeti care 6 / 1.97 vztrajati persist 5 / 8.05 / / / / / / / / / / / / / / / / / / / / / nouns followed by FW Viktor Viktor 8 / 10.78 motor motor 13 / 3.43 Odbor Committee 10 / 7.33 sistem system 5 / 0.30 / / / / / / / / / / / / / / / / / / View Large It is interesting to note that the few earliest usage examples from the Twitter corpus that predate the period of political and social unrest in 2013-2014 belong to the ‘flywheel’ sense, just as in the reference Gigafida corpus. We begin to see a sharp increase of the usage in 2014. Along with the increased frequency, we observe the rise of the ‘protester’ sense, first appearing along with the ‘flywheel’ sense, and then prevailing completely. The spike in word usage coincides with the period of two waves of civil protests. First, there was a large nation-wide protest by the left-leaning citizens against the government on grounds of suspected corruption by the Prime Minister Janez Janša. When Janša was imprisoned on corruption charges, there followed a second round of smaller-scale but sustained protests against the court ruling by the Janša supporters. A closer examination of the sentiment of the Tweets including the word reveals that the word ‘vztrajnik’ has a markedly positive connotation. However, the connotation of a very similar expression, ‘vztajnik’, Eng. ‘rebel’, used much more frequently (freq. 1767, 16.51 per million) to refer to the protesters against the government that had started when Janša was still Prime Minister and is used by more or less the same users, is distinctly negative. 3.4.1.2 Register-based semantic shifts Many senses absent from the reference corpus can be detected in the Twitter corpus because a lot of informal communication is performed via Twitter and colloquial language is frequent. Such an example is the noun ‘penzion’ which in standard Slovene means ‘guesthouse’ but is also used for ‘retirement’ in non-standard language, as an excerpt from the Word Sketches in Table 4 shows. Detecting this type of usage is valuable because it is typically insufficiently covered by the traditional lexical resources and not represented in most existing reference corpora. With the growing volumes and importance of communication on social media, it is becoming increasingly important to study this segment of language as well and include it in lexical resources. Coverage of non-standard language is also required to enable robust processing of noisy internet texts. Table 4. Examples of concordances for the only productive grammatical relation for the word ‘penzion’ (‘guesthouse/pension’) in the Twitter and the Gigafida corpus. The words in bold indicate a novel sense in the Twitter corpus unattested in the Gigafida corpus. Twitter focus word: penzion (Eng. ‘guesthouse, retirement’) frequency: 1,073 (10.02 per million) Gigafida focus word: penzion (Eng. ‘guesthouse’) frequency: 4,898 (3.47 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice preposition + FW iz from 127 / 5.04 izpred in front of 107 / 8.22 e.g., in addition to the guesthouse sense e.g. original: ‘moja mam bi šla iz Penzion paketa na Enostavni 300’ original: ‘odhod bo ob 17. uri izpred penziona Špik’ translation: ‘my mum would like to switch from the Senior package to the Simple 300’ translation: ‘departure at 5 p.m. in front of the Špik guesthouse’ v in 589 / 4.25 pred in front of 53 / 0.72 e.g., in addition to the guesthouse sense e.g. original: ‘ker sem se odloču da se mi ne da več, grem z naslednjim letom v penzion’ original: ‘koncert narodnozabavne skupine Gašperji na plaži pred penzionom Tiha dolina’ translation: ‘because I decided that I can’t be bothered anymore I’ll retire next year’ translation: ‘concert of the Oberkrainer band Gašperji on the beach in front of the Tiha dolina guesthouse’ do to 9 / 1.73 V in 757 / 0.44 e.g., in addition to the guesthouse sense e.g. original: ‘vsako leto mi manjka več do penziona’ original: ‘Prenočiti je možno le v penzionih ali najeti počitniško hišico.’ translation: ‘each year I am further away from retirement’ translation: ‘It’s only possible to spend the night in a guesthouse or rent a cabin.’ Twitter focus word: penzion (Eng. ‘guesthouse, retirement’) frequency: 1,073 (10.02 per million) Gigafida focus word: penzion (Eng. ‘guesthouse’) frequency: 4,898 (3.47 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice preposition + FW iz from 127 / 5.04 izpred in front of 107 / 8.22 e.g., in addition to the guesthouse sense e.g. original: ‘moja mam bi šla iz Penzion paketa na Enostavni 300’ original: ‘odhod bo ob 17. uri izpred penziona Špik’ translation: ‘my mum would like to switch from the Senior package to the Simple 300’ translation: ‘departure at 5 p.m. in front of the Špik guesthouse’ v in 589 / 4.25 pred in front of 53 / 0.72 e.g., in addition to the guesthouse sense e.g. original: ‘ker sem se odloču da se mi ne da več, grem z naslednjim letom v penzion’ original: ‘koncert narodnozabavne skupine Gašperji na plaži pred penzionom Tiha dolina’ translation: ‘because I decided that I can’t be bothered anymore I’ll retire next year’ translation: ‘concert of the Oberkrainer band Gašperji on the beach in front of the Tiha dolina guesthouse’ do to 9 / 1.73 V in 757 / 0.44 e.g., in addition to the guesthouse sense e.g. original: ‘vsako leto mi manjka več do penziona’ original: ‘Prenočiti je možno le v penzionih ali najeti počitniško hišico.’ translation: ‘each year I am further away from retirement’ translation: ‘It’s only possible to spend the night in a guesthouse or rent a cabin.’ View Large Table 4. Examples of concordances for the only productive grammatical relation for the word ‘penzion’ (‘guesthouse/pension’) in the Twitter and the Gigafida corpus. The words in bold indicate a novel sense in the Twitter corpus unattested in the Gigafida corpus. Twitter focus word: penzion (Eng. ‘guesthouse, retirement’) frequency: 1,073 (10.02 per million) Gigafida focus word: penzion (Eng. ‘guesthouse’) frequency: 4,898 (3.47 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice preposition + FW iz from 127 / 5.04 izpred in front of 107 / 8.22 e.g., in addition to the guesthouse sense e.g. original: ‘moja mam bi šla iz Penzion paketa na Enostavni 300’ original: ‘odhod bo ob 17. uri izpred penziona Špik’ translation: ‘my mum would like to switch from the Senior package to the Simple 300’ translation: ‘departure at 5 p.m. in front of the Špik guesthouse’ v in 589 / 4.25 pred in front of 53 / 0.72 e.g., in addition to the guesthouse sense e.g. original: ‘ker sem se odloču da se mi ne da več, grem z naslednjim letom v penzion’ original: ‘koncert narodnozabavne skupine Gašperji na plaži pred penzionom Tiha dolina’ translation: ‘because I decided that I can’t be bothered anymore I’ll retire next year’ translation: ‘concert of the Oberkrainer band Gašperji on the beach in front of the Tiha dolina guesthouse’ do to 9 / 1.73 V in 757 / 0.44 e.g., in addition to the guesthouse sense e.g. original: ‘vsako leto mi manjka več do penziona’ original: ‘Prenočiti je možno le v penzionih ali najeti počitniško hišico.’ translation: ‘each year I am further away from retirement’ translation: ‘It’s only possible to spend the night in a guesthouse or rent a cabin.’ Twitter focus word: penzion (Eng. ‘guesthouse, retirement’) frequency: 1,073 (10.02 per million) Gigafida focus word: penzion (Eng. ‘guesthouse’) frequency: 4,898 (3.47 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice preposition + FW iz from 127 / 5.04 izpred in front of 107 / 8.22 e.g., in addition to the guesthouse sense e.g. original: ‘moja mam bi šla iz Penzion paketa na Enostavni 300’ original: ‘odhod bo ob 17. uri izpred penziona Špik’ translation: ‘my mum would like to switch from the Senior package to the Simple 300’ translation: ‘departure at 5 p.m. in front of the Špik guesthouse’ v in 589 / 4.25 pred in front of 53 / 0.72 e.g., in addition to the guesthouse sense e.g. original: ‘ker sem se odloču da se mi ne da več, grem z naslednjim letom v penzion’ original: ‘koncert narodnozabavne skupine Gašperji na plaži pred penzionom Tiha dolina’ translation: ‘because I decided that I can’t be bothered anymore I’ll retire next year’ translation: ‘concert of the Oberkrainer band Gašperji on the beach in front of the Tiha dolina guesthouse’ do to 9 / 1.73 V in 757 / 0.44 e.g., in addition to the guesthouse sense e.g. original: ‘vsako leto mi manjka več do penziona’ original: ‘Prenočiti je možno le v penzionih ali najeti počitniško hišico.’ translation: ‘each year I am further away from retirement’ translation: ‘It’s only possible to spend the night in a guesthouse or rent a cabin.’ View Large 3.4.1.3 Medium-based semantic shifts The last type of major semantic shifts are new communication conventions that have emerged on social media and have appropriated some existing vocabulary for a new or specialised purpose. An example of these phenomena is the noun ‘sledilec’ (Eng. ‘follower’), for which it is immediately obvious that some transformation in word usage has probably occurred. For one, we detect a substantial increase in usage (601 hits or 0.43 per million in the 1.2 giga-word reference corpus vs. 2,854 hits or 26.65 per million in the ten times smaller Twitter corpus). A closer examination of Word Sketches shows a specialization of its meaning on Twitter from one of the following senses: a follower of the beliefs and works of influential politicians, religious leaders or artists (e.g. ‘predan/devoted’, ‘zvest/loyal’, ‘nauk/teaching’, ‘ideja/idea’, ‘gibanje/movement’); a person or organisation who copies what others are doing or saying and is not a leader (e.g. ‘slep/blind’, ‘podrejen/inferior’, ‘trend/trend’, ‘četica/troop’, ‘prepisovalec/copier’); and a tracking device or medium (e.g. ‘izotopski/isotope’, ‘satelitski/satellite’, ‘vgrajen/in-built’, ‘radioaktiven/radioactive’, ‘silicijski/silicon’); to a user who follows you on Twitter or other social media (e.g. ‘nov/new’, ‘število/number’, ‘nabirati/collect’, ‘meja/threshold’, ‘milijon/million’). 3.4.2 Minor semantic shifts Among the words displaying minor differences in the semantic footprint of the vocabulary on Twitter with respect to the reference corpus, we distinguished between the following three categories: shifts in sense distribution, pattern restrictions, and semantic narrowing. 3.4.2.1 Shifts in sense distribution As the first type of minor shifts we consider those cases in which we identified the same senses in both corpora, but with different frequency distributions. An example of this is the noun ‘sesalec’, which can mean both ‘mammal’ and ‘vacuum cleaner’ in both corpora, but the ‘mammal’ sense predominates in the reference corpus, while the ‘vacuum cleaner’ sense in the Twitter corpus. As can be seen from Table 5, the noun is more important for Twitter conversations and only two collocations (‘edin/only’ and ‘vrsta/species’) out of the top five in the three most prolific grammatical relations refer to the mammal sense in the Twitter corpus, while the opposite is true for the reference corpus (‘globinski/high-power’ and ‘prodajati/sell’). Table 5. Top five collocates for three most productive Word Sketches for the word ‘sesalec’ (Eng. ‘mammal/vacuum cleaner’) in the Twitter and the Gigafida corpus. The words in bold indicate the redistribution of senses in favour of the non-standard ‘vacuum cleaner’ sense that dominates in the Twitter corpus. Twitter focus word: sesalec, Eng. vacuum cleaner frequency: 701 (6.54 per million) Gigafida focus word: sesalec, Eng. Mammal frequency: 7,047 (4.99 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW robotski robot 10 / 9.91 kopenski terrestrial 81 / 8.01 globinski high-power 7 / 9.53 rastlinojed herbivorous 30 / 8.00 voden water 8 / 6.58 morski sea 502 / 7.71 edin only 6 / 3.70 globinski high-power 47 / 7.68 nov new 11 / 1.26 kloniran cloned 26 / 7.49 verbs with FW as object imeti sing 7 / 8.81 klonirati clone 9 / 8.57 kupiti persist 5 / 8.05 pleniti catch 9 / 8.52 / / / loviti hunt 16 / 4.98 / / / napadati attack 6 / 4.88 / / / prodajati sell 10 / 3.27 nouns followed by FW vrečka bag 7 / 9.01 kloniranje cloning 53 / 8.89 zvok noise 10 / 8.24 samica female 18 / 7.61 vrsta species 5 / 5.52 tkivo tissue 19 / 7.18 / / / genom genome 10 / 7.09 / / / mladič cub 12 / 7.07 Twitter focus word: sesalec, Eng. vacuum cleaner frequency: 701 (6.54 per million) Gigafida focus word: sesalec, Eng. Mammal frequency: 7,047 (4.99 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW robotski robot 10 / 9.91 kopenski terrestrial 81 / 8.01 globinski high-power 7 / 9.53 rastlinojed herbivorous 30 / 8.00 voden water 8 / 6.58 morski sea 502 / 7.71 edin only 6 / 3.70 globinski high-power 47 / 7.68 nov new 11 / 1.26 kloniran cloned 26 / 7.49 verbs with FW as object imeti sing 7 / 8.81 klonirati clone 9 / 8.57 kupiti persist 5 / 8.05 pleniti catch 9 / 8.52 / / / loviti hunt 16 / 4.98 / / / napadati attack 6 / 4.88 / / / prodajati sell 10 / 3.27 nouns followed by FW vrečka bag 7 / 9.01 kloniranje cloning 53 / 8.89 zvok noise 10 / 8.24 samica female 18 / 7.61 vrsta species 5 / 5.52 tkivo tissue 19 / 7.18 / / / genom genome 10 / 7.09 / / / mladič cub 12 / 7.07 View Large Table 5. Top five collocates for three most productive Word Sketches for the word ‘sesalec’ (Eng. ‘mammal/vacuum cleaner’) in the Twitter and the Gigafida corpus. The words in bold indicate the redistribution of senses in favour of the non-standard ‘vacuum cleaner’ sense that dominates in the Twitter corpus. Twitter focus word: sesalec, Eng. vacuum cleaner frequency: 701 (6.54 per million) Gigafida focus word: sesalec, Eng. Mammal frequency: 7,047 (4.99 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW robotski robot 10 / 9.91 kopenski terrestrial 81 / 8.01 globinski high-power 7 / 9.53 rastlinojed herbivorous 30 / 8.00 voden water 8 / 6.58 morski sea 502 / 7.71 edin only 6 / 3.70 globinski high-power 47 / 7.68 nov new 11 / 1.26 kloniran cloned 26 / 7.49 verbs with FW as object imeti sing 7 / 8.81 klonirati clone 9 / 8.57 kupiti persist 5 / 8.05 pleniti catch 9 / 8.52 / / / loviti hunt 16 / 4.98 / / / napadati attack 6 / 4.88 / / / prodajati sell 10 / 3.27 nouns followed by FW vrečka bag 7 / 9.01 kloniranje cloning 53 / 8.89 zvok noise 10 / 8.24 samica female 18 / 7.61 vrsta species 5 / 5.52 tkivo tissue 19 / 7.18 / / / genom genome 10 / 7.09 / / / mladič cub 12 / 7.07 Twitter focus word: sesalec, Eng. vacuum cleaner frequency: 701 (6.54 per million) Gigafida focus word: sesalec, Eng. Mammal frequency: 7,047 (4.99 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW robotski robot 10 / 9.91 kopenski terrestrial 81 / 8.01 globinski high-power 7 / 9.53 rastlinojed herbivorous 30 / 8.00 voden water 8 / 6.58 morski sea 502 / 7.71 edin only 6 / 3.70 globinski high-power 47 / 7.68 nov new 11 / 1.26 kloniran cloned 26 / 7.49 verbs with FW as object imeti sing 7 / 8.81 klonirati clone 9 / 8.57 kupiti persist 5 / 8.05 pleniti catch 9 / 8.52 / / / loviti hunt 16 / 4.98 / / / napadati attack 6 / 4.88 / / / prodajati sell 10 / 3.27 nouns followed by FW vrečka bag 7 / 9.01 kloniranje cloning 53 / 8.89 zvok noise 10 / 8.24 samica female 18 / 7.61 vrsta species 5 / 5.52 tkivo tissue 19 / 7.18 / / / genom genome 10 / 7.09 / / / mladič cub 12 / 7.07 View Large 3.4.2.2 Pattern restrictions Next, we register cases exhibiting distinct discrepancies in the patterns in which the focus word is regularly used, influencing the sense of the target word. A typical example of this category is the noun ‘eter/ether’. According to its Word Sketch summarised in Table 6, the usage of the word is both frequent and versatile in Gigafida (6,264 or 4.44 per million), ranging from: the literal chemical sense (e.g. ‘molekula/molecule’, ‘element/element,’ ‘alcohol/alcohol’, ‘ ester/ester’, ‘dietil/diethyl’); the metaphorical sense, including names of companies, movies, etc. (e.g. ‘moralni/moral’, ‘brazilski/Brazillian’, ‘svoboden/free’, ‘življenje/life’, ‘svetloba/light’); to the broadcasting sense (e.g. ‘radio/radio’, ‘televizijski/television’, ‘postaja/station’, ‘oddaja/show’, ‘v/on’) which is the least frequent. Table 6. Top five collocates for the two most productive Word Sketches for the word ‘eter’ (Eng. ‘ether/on air’) in the Twitter and the Gigafida corpus. The words in bold indicate the redistribution of senses in favour of the non-standard ‘on air’ sense that dominates in the Twitter corpus. Twitter focus word: eter, Eng. on air, ether frequency: 870 (8.12 per million) Gigafida focus word: eter, Eng. ether, on air frequency: 6,264 (4.44 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW Petričev Petrič’s 5 / 11.17 škroben starch 7 / 7.18 radijski radio 5 / 6.21 radijski radio 279 / 6.71 / / / toploten thermal 11 / 4.17 / / / svetloben light 9 / 3.71 / / / kemičen cheimcal 9 / 3.47 preposition + FW izven off 8 / 6.77 proti against 627 / 5.27 v in 707 / 4.52 izven off 9 / 3.80 proti against 17 / 3.85 preko through 18 / 3.33 iz from 6 / 0.63 zunaj outside 8 / 3.16 / / / skozi through 29 / 2.67 Twitter focus word: eter, Eng. on air, ether frequency: 870 (8.12 per million) Gigafida focus word: eter, Eng. ether, on air frequency: 6,264 (4.44 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW Petričev Petrič’s 5 / 11.17 škroben starch 7 / 7.18 radijski radio 5 / 6.21 radijski radio 279 / 6.71 / / / toploten thermal 11 / 4.17 / / / svetloben light 9 / 3.71 / / / kemičen cheimcal 9 / 3.47 preposition + FW izven off 8 / 6.77 proti against 627 / 5.27 v in 707 / 4.52 izven off 9 / 3.80 proti against 17 / 3.85 preko through 18 / 3.33 iz from 6 / 0.63 zunaj outside 8 / 3.16 / / / skozi through 29 / 2.67 View Large Table 6. Top five collocates for the two most productive Word Sketches for the word ‘eter’ (Eng. ‘ether/on air’) in the Twitter and the Gigafida corpus. The words in bold indicate the redistribution of senses in favour of the non-standard ‘on air’ sense that dominates in the Twitter corpus. Twitter focus word: eter, Eng. on air, ether frequency: 870 (8.12 per million) Gigafida focus word: eter, Eng. ether, on air frequency: 6,264 (4.44 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW Petričev Petrič’s 5 / 11.17 škroben starch 7 / 7.18 radijski radio 5 / 6.21 radijski radio 279 / 6.71 / / / toploten thermal 11 / 4.17 / / / svetloben light 9 / 3.71 / / / kemičen cheimcal 9 / 3.47 preposition + FW izven off 8 / 6.77 proti against 627 / 5.27 v in 707 / 4.52 izven off 9 / 3.80 proti against 17 / 3.85 preko through 18 / 3.33 iz from 6 / 0.63 zunaj outside 8 / 3.16 / / / skozi through 29 / 2.67 Twitter focus word: eter, Eng. on air, ether frequency: 870 (8.12 per million) Gigafida focus word: eter, Eng. ether, on air frequency: 6,264 (4.44 per million) Relation Collocation English Freq /logDice Collocation English Freq /logDice modifiers of FW Petričev Petrič’s 5 / 11.17 škroben starch 7 / 7.18 radijski radio 5 / 6.21 radijski radio 279 / 6.71 / / / toploten thermal 11 / 4.17 / / / svetloben light 9 / 3.71 / / / kemičen cheimcal 9 / 3.47 preposition + FW izven off 8 / 6.77 proti against 627 / 5.27 v in 707 / 4.52 izven off 9 / 3.80 proti against 17 / 3.85 preko through 18 / 3.33 iz from 6 / 0.63 zunaj outside 8 / 3.16 / / / skozi through 29 / 2.67 View Large In tweets, on the other hand, by far the most frequent sense is the broadcasting one (e.g. ‘v/on’, ‘proti/against’, ‘iz/from’, ‘radijski/radio’, with only a handful of collocations of the chemical sense (e.g. ‘borov/Boron’, ‘theory/teorija’) and some mentions of company names that are lemmatisation errors (e.g. ‘Etra’, ‘eTRI’). Again, the noun seems more central to Twitter conversations (8.12 vs. 4.44 per million) and almost exclusively comprises the ‘on air’ collocation. While the preposition relation in Gigafida belongs to it as well, virtually all others pertain to the chemical or the metaphorical ether senses. 3.4.2.3 Semantic narrowing The third type of minor shifts we detect is the narrowing of a word’s semantic repository that is most likely not a sign of a word sense dying out but rather due to a limited set of topics present in Twitter discussions with respect to the set of topics in the reference corpus. Such an example is the verb ‘posodobiti’, Eng. ‘update/improve’, which based on the Word Sketches is used with a very wide range of objects in the Gigafida corpus, such as ‘infrastruktura/infrastructure’, ‘proizvodnja/production’, ‘park/park’, ‘oprema/equipment’, and ‘flota/fleet’. In the Twitter corpus, on the other hand, the same verb is confined to IT-related objects: ‘aplikacija/application’, ‘stran/page’, ‘seznam/list’, ‘system/system’. Another example of a narrower semantic footprint in the Twitter corpus is the noun ‘faks’ which is polysemous in the Gigafida corpus and used either in the ‘telefax’ or the ‘university’ sense, the former being the predominant one. This can be clearly illustrated with an excerpt of the strongest collocations in the verb + object relation: ‘pošiljati/send’, ‘oddajati/send’, ‘sprejemati/receive’, ‘dokončati/finish’, ‘vpisati/enrol’. While the noun is much more prominent in the Twitter corpus (freq. 45.78 vs. 16.76 per million), it is used distinctly in the ‘university’ sense: ‘končati/finish’, ‘pustiti/drop out’, ‘pogrešati/miss’, ‘narediti/end’, ‘imeti/have’. 3.4.3. Error analysis Apart from a detailed analysis of the detected semantic shifts, we also examined the words that appeared at the top of the list erroneously. In total, no real semantic shift could be detected for 103 (51%) of the 200 top-ranking words. We classified them into two categories. The first one contains the pre-processing errors during which the focus word in one of the corpora was assigned a wrong lemma, making the vector very different from its counterpart in the other corpus, as they actually do not share semantic properties. This category will gradually become smaller as the processing tools improve. In the second category we list false positives that were ranked high by the method proposed in this paper, but for which an examination of the Word Sketches does not reveal any semantic shifts. These are the true errors that reveal the limitations of the proposed approach and should be addressed in future refinements of the method. 3.4.3.1 Pre-processing errors Ninety (45%) of the errors were caused by the NLP processing tool chain. This level of noise is not surprising, as we are dealing with highly non-standard data that is difficult to process with high accuracy, as well as corpora that were tagged and lemmatised with two different tools. This noise is highest at the top of the list and steadily decreases, which means that these errors are outliers that are easy to spot. As Table 7 shows, by far the most common source of pre-processing errors (34%) are foreign words from tweets written in a foreign language that was incorrectly detected as Slovene, or from Slovene tweets partially written in a foreign language. Next are non-standard words or spellings that the tagger and lemmatiser do not recognize correctly (33%). In third place are problems related to proper names (17%) that are especially pronounced when a proper name overlaps with a common noun. This shows how difficult user-generated content is to process automatically, and how important it is to make the NLP tools more robust to the phenomena of social media language. Table 7. Distribution of errors. Error type Freq % Example foreign 27 34% duda (pacifier, instead of Eng. ‘dude’) nonstandard 26 33% ajda (buckwheat, instead of ‘ajde/c'mon’) name 15 18% Tanko (surname, also meaning ‘thin’) lemma 13 15% meni (menu, instead of ‘jaz/I’) PoS 6 6% dobro (good, adj. as well as adv.) diacritics 1 2% sel (messenger, instead of ‘šel/went’) tool 2 2.5% nazadnje (last time, normalised as ‘ne nazadnje’) Grand Total 90 100% Error type Freq % Example foreign 27 34% duda (pacifier, instead of Eng. ‘dude’) nonstandard 26 33% ajda (buckwheat, instead of ‘ajde/c'mon’) name 15 18% Tanko (surname, also meaning ‘thin’) lemma 13 15% meni (menu, instead of ‘jaz/I’) PoS 6 6% dobro (good, adj. as well as adv.) diacritics 1 2% sel (messenger, instead of ‘šel/went’) tool 2 2.5% nazadnje (last time, normalised as ‘ne nazadnje’) Grand Total 90 100% View Large Table 7. Distribution of errors. Error type Freq % Example foreign 27 34% duda (pacifier, instead of Eng. ‘dude’) nonstandard 26 33% ajda (buckwheat, instead of ‘ajde/c'mon’) name 15 18% Tanko (surname, also meaning ‘thin’) lemma 13 15% meni (menu, instead of ‘jaz/I’) PoS 6 6% dobro (good, adj. as well as adv.) diacritics 1 2% sel (messenger, instead of ‘šel/went’) tool 2 2.5% nazadnje (last time, normalised as ‘ne nazadnje’) Grand Total 90 100% Error type Freq % Example foreign 27 34% duda (pacifier, instead of Eng. ‘dude’) nonstandard 26 33% ajda (buckwheat, instead of ‘ajde/c'mon’) name 15 18% Tanko (surname, also meaning ‘thin’) lemma 13 15% meni (menu, instead of ‘jaz/I’) PoS 6 6% dobro (good, adj. as well as adv.) diacritics 1 2% sel (messenger, instead of ‘šel/went’) tool 2 2.5% nazadnje (last time, normalised as ‘ne nazadnje’) Grand Total 90 100% View Large 3.4.3.2 False positives We are particularly interested in the 13 (6.5%) candidates for which no semantic shift could be detected based on the Word Sketches and which are not pre-processing errors. With the help of a concordance analysis, we classified them in two groups: words that are used in automatically generated tweets or in repetitive ad-style texts in Gigafida (9 or 69%), and words that are used in a telegraphic Twitter or journalistic style (4 or 31%). We believe that this level of noise is acceptable, given the high complexity of the task. Naturally, more candidates lower down the list would have to be examined in order to evaluate the approach more comprehensively for the words displaying a lower degree of contextual differences, which is planned for future work. In Figure 2 we give an example of the stylistically different use of the word ‘neuradno’ (Eng. ‘off-the record’) in the Twitter corpus with respect to the reference corpus. While the word is mostly used in newspaper and magazine articles, it is still syntactically integrated in the sentence in Gigafida. On the other hand, users, even the non-media accounts, use it in a distinct telegraphic style in tweets. Figure 2. View largeDownload slide Examples of stylistically different usage of the word ‘neuradno’ (Eng. ‘off-the record’) in the Twitter and the Gigafida corpus. Figure 2. View largeDownload slide Examples of stylistically different usage of the word ‘neuradno’ (Eng. ‘off-the record’) in the Twitter and the Gigafida corpus. 3.5. Results and discussion An overview of our results is presented in Table 8. It will be seen that some type of semantic shift was detected in just under half of all the cases in the sample that was analysed, suggesting that, while the proposed approach is not accurate enough to be used in a fully automated scenario, it could be useful as a semi-automatic procedure in which the lexicographer would manually examine the output of the automatic procedure. Taking into account a rather large share of pre-processing errors (45%), the results could be sufficiently improved with more robust NLP tools for non-standard language at various stages: better language identification, more robust processing of non-standard spelling variants and non-standard vocabulary, and better PoS tagging for proper nouns. Table 8. Distribution of the types of semantic shifts in Slovene Twitterese. The figures in brackets are subcategory counts and the percentages are calculated for the category. Category Subcategory Freq. % Major 51 25% Register-based (26) (51%) Event-based (18) (35%) Medium-based (7) (14%) Minor 46 23% Sense redistribution (24) (52%) Semantic narrowing (20) (43%) Pattern restriction (2) (4%) No shift 103 51% Pre-processing errors (90) (87%) False positives (13) (13%) Total 200 100% Category Subcategory Freq. % Major 51 25% Register-based (26) (51%) Event-based (18) (35%) Medium-based (7) (14%) Minor 46 23% Sense redistribution (24) (52%) Semantic narrowing (20) (43%) Pattern restriction (2) (4%) No shift 103 51% Pre-processing errors (90) (87%) False positives (13) (13%) Total 200 100% View Large Table 8. Distribution of the types of semantic shifts in Slovene Twitterese. The figures in brackets are subcategory counts and the percentages are calculated for the category. Category Subcategory Freq. % Major 51 25% Register-based (26) (51%) Event-based (18) (35%) Medium-based (7) (14%) Minor 46 23% Sense redistribution (24) (52%) Semantic narrowing (20) (43%) Pattern restriction (2) (4%) No shift 103 51% Pre-processing errors (90) (87%) False positives (13) (13%) Total 200 100% Category Subcategory Freq. % Major 51 25% Register-based (26) (51%) Event-based (18) (35%) Medium-based (7) (14%) Minor 46 23% Sense redistribution (24) (52%) Semantic narrowing (20) (43%) Pattern restriction (2) (4%) No shift 103 51% Pre-processing errors (90) (87%) False positives (13) (13%) Total 200 100% View Large Major and minor semantic shifts were almost equally frequent (about a quarter of the sample each), with the difference in frequency bands in the two corpora (e.g. ‘vztrajnik’ 7.62 per million in Twitter vs. 0.30 per million in Gigafida) being a strong indicator of a semantic shift. Unsurprisingly, most semantic shifts can be attributed to the less formal register and the topics characteristic of Twitter discussions (jointly amounting to a quarter of the analysed sample), which systematically show the differences in the focus, and range of topics between the two corpora. The fact that many more novel usages (new senses due to register, social context and the medium: 25%) were detected than narrowings (topic-restricted senses and patterns: 13%) suggests that the reference corpus could be further enhanced with more recent texts and texts from social media and other less formal and standard communication practices, as they contain rich and valuable linguistic material that is now almost entirely absent from the reference corpus. Of special interest are the detected novel senses as a result of the wider social context in which users express themselves on Twitter, adapting themselves to the emerging news (9%) as well as the dynamic communication conventions of social media (4%). This lexicographic material is particularly valuable because it could serve as the basis for updating the existing lexico-semantic resources of Slovene and highlights the need of a monitor corpus, which does not yet exist for Slovene. 5. Conclusion In this paper, we presented the increasingly important approach of distributional modelling for automating lexicographic tasks that we tested on semantic shift detection for the Slovene used in social media. We measured the semantic shift of a word as the distance between the word embedding representation learned from a reference corpus of Slovene, and the word embedding learned from a Twitter corpus of Slovene. We performed a manual analysis of 200 top-ranking words. Apart from the noise due to pre-processing errors (45%) that are easy to spot, the approach yields a lot of highly valuable semantic shift candidates, especially novel senses occurring due to daily events, and novel senses produced in informal communication settings. The results show that the approach presented in this paper could significantly contribute to regular semi-automatic updates of corpus-based general as well as specialized lexical resources. But the contribution of this paper reaches well beyond that, as it demonstrates how distributional approaches could be successfully employed for a number of lexicographic tasks in any language for which at least two large corpora — one specialised and the other in the form of a reference corpus with basic linguistic annotation — are available. Our future work will focus on extending the manual analysis to lower-ranked candidates, extending the approach to lower-frequency candidates, comparing our method with alternative approaches, such as representing words as word sketches / syntactic patterns, and using supervised learning for detecting semantic shifts, discriminating between specific types of semantic shifts and filtering out pre-processing errors. Footnotes 1 https://bitbucket.org/yoavgo/word2vecf/ 2 The full vocabulary lists with annotated semantic shifts are available here: http://nl.ijs.si/janes/wp-content/uploads/2017/07/IJS17-Appendix.pdf 3 https://www.sketchengine.co.uk Acknowledgments The work described in this paper was funded by the Slovenian Research Agency within the national basic research project ‘Resources, Tools and Methods for the Research of Nonstandard Internet Slovene’ (J6-6842, 2014-2017) and the COST Action ‘European Network of e-Lexicography’ (IS1305, 2013-2017), the national basic research project ‘Resources, methods, and tools for the understanding, identification, and classification of various forms of socially unacceptable discourse in the information society’ (J7-8280, 2017–2019). References Aggarwal C. C. , Hinneburg A. , Keim D. A. 2001 . ‘On the Surprising Behavior of Distance Metrics in High Dimensional Spaces.’ In Van den Bussche J. , Vianu V. (eds), Proceedings of the 8th International Conference on Database Theory, London, UK, January 4-6 2001, Lecture notes in computer science, vol. 1973 . Springer , 420 – 434 . Agres K. , McGregor S. , Purver M. , Wiggins G. . 2015 . ‘Conceptualizing Creativity: From Distributional Semantics to Conceptual Spaces.’ In Toivonen H. et al. (eds), Proceedings of the Sixth International Conference on Computational Creativity, Park City, Utah, USA, 29 June-2 July 2015. Birgham Young University , 118 – 125 . Ammar W. , Mulcaire G. , Tsvetkov Y. , Lample G. , Dyer C. , Smith N. A. . 2016 . ‘Massively Multilingual Word Embeddings.’ arXiv preprint arXiv:1602.01925. Bengio Y. , Courville A. , Vincent P. . 2013 . ‘ Representation Learning: A Review and New Perspectives.’ IEEE Transactions on Pattern Analysis and Machine Intelligence 35 . 8 : 1798 – 1828 . Google Scholar Crossref Search ADS PubMed Church K. W. , Hanks P. . 1990 . ‘Word Association Norms, Mutual Information, and Lexicography.’ Computational Linguistics 16 . 1 : 22 – 29 . Cook P. , Stevenson S. . 2010 . ‘Automatically Identifying Changes in the Semantic Orientation of Words.’ In Calzolari N. et al. (eds), Proceedings of the Seventh International Conference on Language Resources and Evaluation, Valletta, Malta, 19-21 May 2010 . Paris : European Language Resources Association , 28 – 34 . Cook P. , Lau J.H. , Rundell M. , McCarthy D. , Baldwin T. . 2013 . ‘A Lexicographic Appraisal of an Automatic Approach for Detecting New Word Senses.’ In Kosem I. et al. (eds), Proceedings of the eLex 2013 conference, Tallinn, Estonia, October 17-19 2013 , Ljubljana/Tallinn : Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut , 49 – 65 . Dice L. R. 1945 . ‘ Measures of the Amount of Ecologic Association Between Species.’ Ecology 26 . 3 : 297 – 302 . Google Scholar Crossref Search ADS Firth J.R. 1957 . ‘A Synopsis of Linguistic Theory, 1930-1955.’ Studies in Linguistic Analysis . Oxford : Basil Blackwel , 1 – 32 . Fišer D. , Erjavec T. , Ljubešić N. . 2016 . ‘ JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin.’ Slovenščina 2.0 4 . 2 : 67 – 99 . Fišer D. , Ljubešić N. . 2016 . ‘Detecting semantic shifts in Slovene Twitterese.’ In Horák A. , Rychlý P. , Rambousek A. (eds), Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2016 . Czech Republic : Karlova Studanka , December 2–4, 2016 , Tribun EU, 43 – 50 . Fodor I. 2002 . ‘A Survey of Dimension Reduction Techniques.’ Technical Report No. UCRL-ID-148494 . Livermore : Lawrence Livermore National Laboratory . Golub G. H. , Reinsch C. . 1970 . ‘ Singular Value Decomposition and Least Squares Solutions.’ Numerische Mathematik 14 . 5 : 403 – 420 . Google Scholar Crossref Search ADS Grčar M. , Krek S. , Dobrovoljc K. . 2012 . ‘Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik.’ In Erjavec T. , Žganec Gros J. (eds) Proceedings of the 8th Language Technologies Conference, Ljubljana, Slovenia, October 8-9 2012 . Ljubljana : Jožef Stefan Institute , 89 – 94 . Gulordava K. , Baroni M. . 2011 . ‘A Distributional Similarity Approach to the Detection Of Semantic Change in the Google Books Ngram Corpus.’ In Padó S. , Peirsman Y. (eds), Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, Edinburgh, Scotland, UK, July 31 2011 . Edinburgh : Association for Computational Linguistics , 67 – 71 . Hamilton W. L. , Leskovec J. , Jurafsky D. 2016 . ‘Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.’ arXiv preprint arXiv:1605.09096. Ide N. , Véronis J. 1998 . ‘Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art.’ Computational Linguistics 24 . 1 : 2 – 40 . Kanerva P. , Kristoferson J. , Holst A. . 2000 . ‘Random Indexing of Text Samples for Latent Semantic Analysis.’ In Gleitman L. R. , Joshi A. K. (eds), Proceedings of the 22nd Annual Conference of the Cognitive Science Society, Philadelphia, PA, USA, August 13-15 2000 . Philadelphia : University of Pennsylvania , 1036 . Kilgarriff A. , Baisa V. , Bušta J. , Jakubíček M. , Kovář V. , Michelfeit J. , Rychlý P. , Suchomel V. . 2014 . ‘ The Sketch Engine: Ten Years on.’ Lexicography 1 . 1 : 7 – 36 . Google Scholar Crossref Search ADS Krek S. , Kilgarriff A. . 2006 . ‘Slovene Word Sketches’ In Erjavec T. , Žganec Gros J. (eds), Proceedings of the 5th Language Technologies Conference, Ljubljana, Slovenia, October 9-10 2006 . Ljubljana : Jožef Stefan Institute . Lenci A. 2008 . ‘Distributional Semantics in Linguistic and Cognitive Research.’ Italian Journal of Linguistics 20 . 1 : 1 – 31 . Levy O. , Goldberg Y. . 2014b . ‘Neural Word Embedding as Implicit Matrix Factorization.’ In Welling M. et al. (eds) Proceedings of the 28th Neural Information Processing Systems Conference, Montréal, Canada, December 8-13 2014 . Red Hook : Curran Associates , 2177 – 2185 . Levy O. , Goldberg Y. . 2014a . ‘Dependency-Based Word Embeddings.’ Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers, Baltimore, Maryland, USA, June 2014. Stroudsburg: Association for Computational Linguistics, 302-308. Ljubešić N. , Erjavec T. . 2016 . ‘Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: The Case of Slovene.’ In Calzolari N. et al. (eds), Proceedings of the 10th International Conference on Language Resources and Evaluation, Portorož, Ljubljana, May 23-28 2016 . Paris : European Language Resources Association , 1527 – 1531 . Ljubešić N. , Zupan K. , Fišer D. , Erjavec T. 2016 . ‘Normalising Slovene Data: Historical Texts vs. User-Generated Content.’ In Dipper S. et al. (eds), Proceedings of the 13th Conference on Natural Language Processing, Bochum, Germany, September 19-21 2016 . Bochum : Ruhr-Universität Bochum , 146 – 155 . Logar N. , Grčar M. , Erjavec T. , Arhar Holdt Š. , Krek S. . 2012 . Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba . Ljubljana : Trojina . McCulloch W. S. , Pitts W. . 1943 . ‘ A Logical Calculus of the Ideas Immanent in Nervous Activity.’ The Bulletin of Mathematical Biophysics 5 . 4 : 115 – 133 . Google Scholar Crossref Search ADS Mikolov T. , Chen K. , Corrado G. , Dean J. . 2013a . ‘Efficient Estimation of Word Representations in Vector Space.’ arXiv preprint arXiv:1301.3781. Mikolov T. , Yih W.T. , Zweig G. . 2013b . ‘Linguistic Regularities in Continuous Space Word Representations.’ In Vanderwende L. et al. (eds), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, USA, June 9-14 2013 . Stroudsburg : The Association for Computational Linguistics , 746 – 751 . Mitchell T.M. 1997 . Machine Learning . New York : McGraw-Hill, Inc . Mitra S. , Mitra R. , Maity S. K. , Riedl M. , Biemann C. , Goyal P. , Mukherjee A. . 2015 . ‘ An Automatic Approach to Identify Word Sense Changes in Text Media across Timescales.’ Natural Language Engineering 21 . 5 : 773 – 98 . Google Scholar Crossref Search ADS Navigli R. 2009 . ‘Word Sense Disambiguation: A Survey.’ ACM Computing Surveys (CSUR) 41 / 2 . Pearson K. 1901 . ‘ On Lines and Planes of Closest Fit to Systems of Points in Space.’ Philosophical Magazine 2 . 11 : 559 – 572 . Rychlý P. 2008 . ‘A Lexicographer-Friendly Association Score.’ In Sojka P. , Horák A. (eds), Proceedings of the 2nd Workshop on Recent Advances in Slavonic Natural Language Processing, Karlova Studánka, Czech Republic, December 5-7 2008 . Brno : Masaryk University , 6 – 9 . Sagi E. , Kaufmann S. , Clark B. . 2009 . ‘Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space.’ In Pennacchiotti M. (ed.), Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, Athens, Greece, March 2009 . Stroudsburg : Association for Computational Linguistics , 104 – 111 . Salton G. , Wong A. , Yang C. S. . 1975 . ‘ A Vector Space Model for Automatic Indexing.’ Communications of the ACM 18 . 11 : 613 – 620 . Google Scholar Crossref Search ADS Schmidhuber J. 2015 . ‘ Deep Learning in Neural Networks: An Overview.’ Neural Networks 61 : 85 – 117 . Google Scholar Crossref Search ADS PubMed Schütze H. 1998 . ‘Automatic Word Sense Discrimination.’ Computational Linguistics 24 . 1 : 97 – 123 . Smith S. L. , Turban D. H. , Hamblin S. , Hammerla N. Y. . 2017 . ‘Offline Bilingual Word Vectors, Orthogonal Transformations and the Inverted Softmax.’ arXiv preprint arXiv:1702.03859. Sparck Jones K. 1972 . ‘ A Statistical Interpretation of Term Specificity and its Application in Retrieval.’ Journal of Documentation 28 . 1 : 11 – 21 . Google Scholar Crossref Search ADS Sparck Jones K. 1986 . Synonymy and Semantic Classification . Edinburgh : Edinburgh University Press . Tahmasebi N. , Risse T. , Dietze S. . 2011 . ‘Towards Automatic Language Evolution Tracking, a Study on Word Sense Tracking.’ In Novacek V. et al. (eds), Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics , Bonn, Germany , October 24 2011. Zipf G. K. 1935 . The Psycho-Biology of Language . Oxford : Houghton Mifflin . © 2018 Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
TI - Distributional modelling for semantic shift detection
JF - International Journal of Lexicography
DO - 10.1093/ijl/ecy011
DA - 2019-06-01
UR - https://www.deepdyve.com/lp/oxford-university-press/distributional-modelling-for-semantic-shift-detection-pyQKN3fY12
SP - 163
VL - 32
IS - 2
DP - DeepDyve
ER -