A statistical model for the joint inference of vertical stability and horizontal diffusibility of typological features

A statistical model for the joint inference of vertical stability and horizontal diffusibility of... Abstract A major pursuit within the study of language evolution is to advance understanding of the historical behavior of typological features. Previous studies have identified at least three factors that determine the typological similarity of a pair of languages: (1) vertical stability, (2) horizontal diffusibility, and (3) universality. Of these factors, the first two are of particular interest. Although observed data are affected by all three factors to a greater or lesser degree, previous studies have not jointly modeled them in a straightforward manner. Here, we propose a solution that is derived from the field of cultural anthropology. We present a simple and extensible Bayesian autologistic model to jointly infer the three factors from observed data. Although a large number of missing values in the data set pose serious difficulties for statistical modeling, the proposed model can robustly estimate these parameters as well as missing values. Applying missing value imputation to indirectly evaluate the estimated parameters, we quantitatively demonstrated that they were meaningful. In conclusion, we briefly compare our findings with those of previous studies and discuss future directions. 1. Introduction Linguistic typology is concerned with the systematic comparison of language structures. Major linguistic types, or features, by which the world’s languages are classified, include basic word order (Order of Subject, Object, and Verb) and the presence or absence of tone. For example, basic word order concerns the ordering of subject, object, and verb: English is SVO (Subject-Verb-Object), while Japanese is SOV (Subject-Object-Verb). Features of linguistic typology constitute a promising resource that can potentially be used to uncover the evolutionary history of languages. It has been argued that in exceptional cases, typological features can reflect a time span of 10,000 years or more (Nichols 1994). One of the greatest challenges entailed in typology-based phylogenetic inference is that in contrast to lexical evidence, which is the default choice of historical–comparative linguistics, typological features are characterized by uncertainty. Assessing whether words grouped together are cognates that exhibit regular sound correspondences requires binary judgments for which the proficiency of logic-based human inference has already been demonstrated. Because of the arbitrary nature of linguistic signs, it is extremely unlikely that unrelated words coincidentally follow sound laws. Thus, once a sufficient number of cognates are identified within a group of languages, we can reasonably conclude that they share a common ancestor.1 In contrast, the sharing of the same basic word order, for example, an SOV is only a weak indicator of shared ancestry because this pattern is repeated multiple times in multiple places. Whereas human beings are not adept at handling uncertainties, we believe that computer-intensive statistical methods (Tsunoda et al. 1995; Dunn et al. 2005; Longobardi and Guardiano 2009; Murawaki 2015) can overcome this problem because they provide effective ways of combining weak signals. In this article, we demonstrate a step forward toward developing typology-based phylogenetic inference. Specifically, we engage with the problem of the obscuring of vertical (phylogenetic) signals by horizontal (spatial or areal) transmission. Like loanwords, typological features can be borrowed from one language to another. In fact, a non-tree-like mode of evolution has emerged as one of the central topics in linguistic typology (Sijthoff 1928; Campbell 2006). However, the tree model, which is the standard statistical model used for phylogenetic inference, only reflects vertical transmission. The task of incorporating both vertical and horizontal transmission within a statistical model is notoriously challenging because of the excessive flexibility of horizontal transmissions. Although the tree model also requires a vast search space, this issue can be handled using computer-intensive models (Felsenstein 1981; Gray and Atkinson 2003). Whereas the degree of uncertainty can be intuitively posited to increase as we trace the states of languages back to the past, the tree model, when used for this process, simultaneously reduces the degree of freedom by repeatedly merging nodes into a parent. Horizontal transmission elicits a degree of freedom that cannot be currently modeled without imposing some strong assumptions (Daumé 2009; Nelson-Sathi et al. 2010). Consequently, here we pursue a line of research in linguistic typology that draws on information on the current distribution of typological features without explicitly requiring the reconstruction of previous states (Nichols 1992, 1995; Parkvall 2008; Wichmann and Holman 2009). Studies that have applied this approach have attempted to quantify how stable typological features are. In their original forms, typological features constitute a mosaic exhibiting varying degrees of stability. If we succeed in discovering stable features, these could be regarded as typological equivalents of basic vocabulary, or as a list of basic concepts. Words within the list are assumed to be resistant to borrowing (Swadesh 1971), and some words are even considered to be extremely stable (Pagel et al. 2013). Such stable traits can potentially enable deep phylogenetic relations existing among the world’s languages to be uncovered. The question then is how is the notion of stability connected to vertical and horizontal transmission? Answering this question is not as simple as we may think. Previous studies offer a surprisingly wide variety of perspectives on this question (Dediu and Cysouw 2013). However, it is evident that at least three factors must be considered. These factors are: (1) vertical stability, (2) horizontal diffusibility, and (3) universality.2 In other words, there are three possible explanations for the sharing of the same feature value by a pair of languages: Two languages may be phylogenetically related, with the feature demonstrating a strong phylogenetic association (i.e., vertical stability). The two languages may be spatially proximate, with the feature demonstrating a strong spatial association (i.e., horizontally diffusibility). The third explanation lies in a bias term. In other words, the more often the feature value appears globally, the higher the probability of assuming the value is.3 Of course, these three explanations are not mutually exclusive. The current distribution of typological features is more or less affected by all three factors. Thus, the first step toward developing typology-based phylogenetic inference is to model all these factors jointly. 1.1 A solution from cultural anthropology We suggest that a parallel can be drawn between cultural anthropology and linguistics with respect to vertical and horizontal transmission. Within cultural anthropology, these two modes of vertical and horizontal transmission are known as phylogenesis and ethnogenesis, respectively (Collard and Shennan 2000). Phylogenesis involves the transmission of cultural traits primarily from an ancestral population to its descendants, whereas ethnogenesis entails strong influences resulting from transmissions between populations. A model that is potentially capable of disentangling these factors has already been proposed by Towner et al. (2012). This model is a variant of the autologistic model (Besag 1974), which has been widely used to model the spatial distribution of a feature. The model assumes that the value of a random variable depends probabilistically on the values of its neighbors. Towner et al. (2012)’s extension incorporates two neighbor graphs: one for vertical transmission and the other for horizontal transmission. These graphs are associated with scalar parameters indicating the relative strength of the transmission modes. Towner et al. (2012) applied the autologistic model to examine cultural traits (features) of North American Indian societies, such as the presence or absence of agriculture, the tendency toward exogamy, and types of social structure. They constructed a spatial neighbor graph using longitudinal and latitudinal data on the societies, as well as a phylogenetic neighbor graph, created by collapsing known language families. Applying this model, they empirically demonstrated that both transmission modes were non-negligible for the majority of traits. The same model was subsequently applied to folktales of Indo-European societies by da Silva and Tehrani (2016). Applied at an intermediary step toward discovering folktales with deep historical roots, the model was used to filter out those tales with strong horizontal signals. Although the key idea of jointly modeling vertical and horizontal transmissions was built into the model formulated by Towner et al. (2012), it was not readily applicable to typological data. For this reason, we modified the model and added a new inference algorithm, as described in this article. Our model primarily differs from the above described model in two ways. First, whereas the original model only deals with binary features, our extended model can handle categorical features. Although the original database of cultural traits was coded categorically, Towner et al. (2012) selected its small subset by excluding unsuitable features and merging feature values. Second, because Towner et al. (2012) only focused on high-coverage features, they simply excluded languages with missing values from neighbor graphs. However, the linguistic typological database is too sparse to ignore missing values. The situation is unlikely to change in the foreseeable future because of the thousands of languages in the world, and there is ample documentation for only a handful. Consequently, we take a Bayesian approach because it is known for its robustness in modeling uncertainties (Murphy 2012: Chapter 2). We used a Markov Chain Monte Carlo (MCMC) method to jointly infer scalar parameters and missing values. The source code is available from GitHub at: https://github.com/murawaki/bayes-autologistic. 1.2 Brief review of previous studies Though a thorough review of previous studies is beyond the scope of this article, to the best of our knowledge, there are no previous studies in linguistic typology that have jointly modeled all three factors in a straightforward manner. More details on these earlier studies can be found in a review by Wichmann (2015), as well as in a review and empirical comparison conducted by Dediu and Cysouw (2013). Here we briefly describe how three proposed approaches, described in the literature, have ignored one or more factors or have suffered from contamination. Dediu (2010) used the tree model to infer stability (hereafter, stability D). The implication of his study was that stability D does not explicitly reflect horizontal diffusibility. He assumed that a feature is passed on from a parent to a child within a family tree and obeys a continuous-time Markov chain model in which the transition rate matrix is constrained to have only one parameter. This parameter is estimated for each of several given trees and is then merged into one index, that is, stability D. In general, a transition rate matrix controls not just the speed of change (i.e., instability) but it also determines the stationary distribution toward which the model converges. We can consider the stationary distribution to reflect universality because it is where vertical influence vanishes. However, the stationary distribution in Dediu’s model is trivial because only one parameter controls the transition rate matrix. A binary-coded feature, for example, is doomed to take value 0 with probability 0.5 and 1 with probability 0.5. In short, stability D also ignores universality. Parkvall (2008) presented a series of calculations for deriving a stability index (stability P). First, he computed the Herfindahl–Hirschman index for a given group G, defined either phylogenetically or areally (spatially):   DG=1−∑j=1(pj)2, where pj is the relative frequency of the feature value j in G. Because this was a diversity index, he calculated a homogeneity index, HG, using the reciprocal of DG. He then averaged HG across all phylogenetic groups to obtain HFAM. HARE denoted the areal equivalent of HFAM. As a final step, he divided HFAM by HARE to obtain the stability P. Stability P takes into account both vertical stability (HFAM) and horizontal diffusibility (HARE). However, DG suffers from noise because G does not reflect purely phylogenetic (or spatial) signals. A model that performs one-step disentanglement rather than a combination of noisy estimation and correction would thus be ideal. The question of universality remains, which is obscured by the series of calculations. However, HFAM should probably be seen as an amalgam of vertical stability and universality (the same is true of HARE). The stability index developed by Wichmann and Holman (2009) (stability W) is similar to that of Parkvall, but only considers phylogenetic groups. Whereas Wichmann and Holman introduced three metrics, here we focus on metric C, which has been found to perform the best within simulation experiments. For each phylogenetic group, G, they calculated VG, the proportion of language pairs in G that shared the same feature value. Next, they averaged VG across all phylogenetic groups weighted by the inverse of the square of the number of languages in G:   R=∑GVG|G|. Similarly, they calculated U as the proportion of pairs of unrelated languages sharing the same value. Stability W was obtained by correcting for U as follows:   R−U1−U. Though stability W accounts for vertical stability and universality, it does not explicitly consider horizontal diffusibility. Wichmann and Holman (2009) justified this exclusion by pointing out that diffusibility is largely area-specific. Nevertheless, it is desirable to control for horizontal diffusibility when computing vertical stability. To summarize, the methods reviewed above either simply ignore one or more factors or they entail the performance of noisy estimation, followed by a correction step. In what follows, we will show that all three factors can jointly be modeled in a straightforward manner. We also argue that it is better to keep apart vertical stability and horizontal diffusibility, rather than conflating them into a single stability index. 2. Materials and methods 2.1 Data set Although the proposed model does not depend on a particular data set, in this article we focus on one database, namely, the online edition of the World Atlas of Language Structures (WALS) (Haspelmath et al. 2005). As of 2017, this atlas covered 2,679 languages and 192 typological features. However, the associated language–feature matrix was very sparse, containing less than 15% of the items. An item of the language–feature matrix was defined as a categorical value. For example, Feature 81A, ‘Order of Subject, Object and Verb’, had seven possible values: SOV, SVO, VSO, VOS, OVS, OSV, and No dominant order, and each language incorporated one of these seven values. We used these categorical feature values without making any modifications. Although mergers of some fine-grained feature values appear to be desirable (Daumé and Campbell 2007; Dediu 2010; Greenhill et al. 2010; Dunn et al. 2011), we leave this work for a future investigation. Some of the previous studies discarded languages with few observed features (Murawaki 2015; Takamura et al. 2016). To avoid introducing arbitrariness by ourselves, we included all but 72 languages in the database. The excluded languages were sign languages and pidgin and creole languages, all of which have been categorized in WALS as belonging to a dummy phylogenetic group labeled ‘other’. Of the 192 features, we selected 82 features that covered at least 10% of the remaining languages and applied to the entire world. For example, we dropped Feature 144H (‘NegSVO Order’) because it was only defined for SVO languages. WALS was also the source for constructing phylogenetic and spatial neighbor graphs. This atlas provides two levels of grouping: family and genus. We used the genera level to construct a phylogenetic neighbor graph in which every pair of languages within a genus was linked. The analysis showed that the average number of phylogenetic neighbors was 30.8. Other resources, such as Glottolog (Hammarström et al. 2016) and Ethnologue (Lewis et al. 2014), provided more detailed hierarchical classifications. However, there seemed to be no way to select groups of languages of comparable time depth. Languages were found to be associated with single-point geographical coordinates (longitude and latitude), as shown in Fig. 1. We constructed a spatial neighbor graph by linking all language pairs that were located within a distance of R km. Following da Silva and Tehrani (2016), we set the value of R at 1,000. The average number of spatial neighbors was 89.1.4 The phylogenetic and spatial neighbor graphs had substantial overlap. On average, 71.4% of phylogenetic neighbors were simultaneously spatial neighbors, while 20.1% of spatial neighbors were simultaneously phylogenetic neighbors. Figure 1 View largeDownload slide The geographical distribution of Feature 97A (‘Relationship between the Order of Object and Verb and the Order of Adjective and Noun’). Each point denotes a language. There are a non-negligible number of missing values (denoted as N/A). Figure 1 View largeDownload slide The geographical distribution of Feature 97A (‘Relationship between the Order of Object and Verb and the Order of Adjective and Noun’). Each point denotes a language. There are a non-negligible number of missing values (denoted as N/A). 2.2 Counting functions We begin our description of the model by introducing three counting functions corresponding to vertical, horizontal, and universal factors. A counting function receives the feature values of the world’s languages and returns a non-negative scalar. Formally, let xi=(x1,i,⋯,xL,i)∈Xi be a sequence of feature values, where xl,i is the value of the l-th language for feature type i. xl,i takes one of Fi categorical values, and Fi differs according to feature types. For now, we can assume that there are no missing values. The vertical factor is represented by V(xi). It returns the number of pairs sharing the same value in the phylogenetic neighbor graph. Figure 2 provides an example. As the basis of this function, we intuitively posit that if the feature in question is vertically stable, then a phylogenetically defined group of languages will tend to share the same value. A more direct counting method would entail checking how many changes have occurred relating to parent-to-child transitions within the trees. However, this alternative method would be difficult to implement because the feature values of ancestors remain unknown. Consequently, we chose a counting function that only refers to contemporary languages. Figure 2 View largeDownload slide The phylogenetic neighbor graph and V(xi). Each node denotes a language. Languages are grouped by common ancestors whose feature values are unknown. The neighbor graph consists of edges, each of which connects a pair of languages within the same group. In this example, two edges indicate the sharing of feature values. Figure 2 View largeDownload slide The phylogenetic neighbor graph and V(xi). Each node denotes a language. Languages are grouped by common ancestors whose feature values are unknown. The neighbor graph consists of edges, each of which connects a pair of languages within the same group. In this example, two edges indicate the sharing of feature values. H(xi) is the spatial equivalent of V(xi), as illustrated in Fig. 3. If the feature in question is horizontally diffusible, then spatially close languages would be expected to frequently share the same feature value. Figure 3 View largeDownload slide The spatial neighbor graph and H(xi). Of the five edges in the neighbor graph, four indicate feature value-sharing. Figure 3 View largeDownload slide The spatial neighbor graph and H(xi). Of the five edges in the neighbor graph, four indicate feature value-sharing. Last, Uj(xi) is the number of languages that take the value of j. If j is universally common, then the value of Uj(xi) must be large. 2.3 Autologistic model We now introduce the following parameters: vertical stability vi, horizontal diffusibility hi, and universality ui=(ui,1,⋯,ui,Fi) for each feature i. Applying the autologistic model, the probability of xi, conditioned on vi, hi, and ui, can be expressed as:   P(xi|vi,hi,ui)= exp ⁡(viV(xi)+hiH(xi)+∑jui,jUj(xi))∑xi′∈Xi exp ⁡(viV(xi′)+hiH(xi′)+∑jui,jUj(xi′)). (1) The denominator is a normalization term, ensuring that the sum of the distribution equals 1. This term is computationally intractable because there are (Fi)L possible assignments of xi. However, as we will show, the parameters can be estimated without precisely calculating the normalization term. The autologistic model can be interpreted in terms of the competition associated with possible assignments of xi for the probability mass 1. If a given value, xi, has a relatively large V(xi), then setting a large value for vi enables it to appropriate fractions of the mass from its weaker rivals. However, if too large a value is set for vi, then it will be overwhelmed by its stronger rivals. To acquire further insights into the model, let us consider the probability of language l taking the value k, conditioned on the rest of the languages, x−l,i:   P(xl,i=k|x−l,i,vi,hi,ui)∝ exp ⁡(viVl,i,k+hiHl,i,k+ui,k), (2) where Vl,i,k is the number of language l’s phylogenetic neighbors that assume the value k, and Hl,i,k is its spatial counterpart. The derivation of Equation (2) is explained in Supplementary Section S.1 of this article. The probability of language l assuming value k is expressed by the weighted linear combination of the three factors in the log-space. If the vertical stability parameter vi is positive, then the probability will increase with a rise in the number of phylogenetic neighbors that assume value k. However, this probability depends not only on the phylogenetic neighbors of language l but it also depends on its spatial neighbors and on universality. How strongly these factors affect the stochastic selection is controlled by vi, hi, and ui. We employed a Bayesian approach because it is known for its robustness in modeling uncertainties. We set prior distributions to vi, hi, and ui,j as follows: vi∼Gamma(κ,θ), hi∼Gamma(κ,θ), and ui,j∼N(0,σ2), where κ, θ, and σ were hyperparameters. Note that by setting Gamma priors, we constrained vi and hi to always be positive. Negative vi and hi would be difficult to justify from a linguistic point of view because they indicate that languages attempt to differentiate themselves from other languages to which they are related. We set shape κ=1.0, scale θ=1.0, and SD σ=10.0. These priors were not non-informative because we wanted to penalize extreme values. However, they were sufficiently gentle in the regions where these parameters typically resided. The fact that the phylogenetic and spatial neighbor graphs had substantial overlap was likely to cause difficulty disentangling the two factors. As a possible solution to the problem, we considered a model variant with an additional parameter mi:   P(xi|vi,hi,mi,ui)= exp ⁡(viV˜(xi)+hiH˜(xi)+miM(xi)+∑jui,jUj(xi))∑xi′∈Xi exp ⁡(viV˜(xi′)+hiH˜(xi′)+miM(xi′)+∑jui,jUj(xi′)), (3) where M(xi) counts languages that are simultaneously phylogenetic and spatial neighbors, V˜(xi)=V(xi)−M(xi), and H˜(xi)=H(xi)−M(xi). By tying vi and hi to disjoint sets of languages, we expected the autologistic model to detect vertical and horizontal signals more easily. 2.4 Inference We can now proceed to the parameter estimation. Let xi be decomposed into an observed portion xiobs and the remaining missing portion be ximis. ximis, in addition to vi, hi, and ui, needed to be inferred. We used a standard MCMC sampling algorithm (Bishop 2006: Chapter 11). Specifically, we employed Gibbs sampling to draw posterior samples of vi, hi, ui, and ximis from   P(vi,hi,ui,ximis|xiobs)∝P(vi,hi,ui)P(xi|vi,hi,ui)=P(vi)P(hi)P(ui)P(xi|vi,hi,ui), (4) where hyperparameters are omitted for brevity. The algorithm is sketched in Supplementary Section S.2. Note that we needed to sample vi (and hi and ui,j) from P(vi|hi,ui,xi)∝P(vi)P(xi|vi,hi,ui). This belongs to a class of problems known as sampling from doubly intractable distributions (Møller et al. 2006; Murray et al. 2006). While it remains a challenging problem in statistics, it is not difficult to approximately sample the variables if we give up theoretical rigorousness (Liang 2010). The details of the algorithm we used can be found in Supplementary Section S.3.5 3. Results and discussion 3.1 Missing value imputation While most previous studies have involved a subjective analysis of stability indices, the quality of estimated parameters can be quantitatively measured. This quantitative measuring ability is an advantageous feature of a probabilistic model with predictive ability. Specifically, estimated parameters can be indirectly evaluated by means of missing value imputation. If the autologistic model predicts missing values better than reasonable baselines, we can say that the estimated parameters are justified. Although no ground truth exists for the missing portion of the database, missing value imputation can be evaluated by hiding some observed values and verifying the effectiveness of their recovery. For each feature type, we conducted tenfold cross-validation. We randomly partitioned observed values into ten blocks that were almost equal in size. For each of the ten runs, we treated an individual block the same as other missing values and performed posterior sampling. After 200 burn-in iterations, we ran 2,495 iterations and collected samples with the interval of 5 iterations. Among the collected samples, we chose the most frequent feature value for each language as the final output. The autologistic model had two variants: the basic model (autologistic-basic) defined in Equation (1) and the extension (autologistic-disjoint) defined in Equation (3). They were compared with three baselines. Universal: Values are derived from the empirical distribution of xiobs. Global majority: The most frequently occurring value among xiobs is selected, and this value is always the output. Neighborhood: The neighbors of each language are compiled in xiobs, and a value is derived from the empirical distribution. If both graphs have observed neighbors, one is chosen randomly. If one of the two graphs has no observed neighbor, then the other is chosen. If neither is available, then the universal baseline provides a fallback option. Table 1 shows the overall results of the analysis. The universal baseline performed the worst, followed, in ascending order, by the global majority, and the neighborhood. The autologistic model outperformed the three baselines. The two model variants achieved comparable performance. The estimated parameters were evidently reasonable. Table 1 Results of missing value imputation. Model  Accuracy (%)   Macro-average  Micro-average  Universal  42.21  42.05  Global majority  54.85  54.20  Neighborhood  57.87  58.95  Autologistic-basic  60.32  62.03  Autologistic-disjoint  60.92  62.84  Model  Accuracy (%)   Macro-average  Micro-average  Universal  42.21  42.05  Global majority  54.85  54.20  Neighborhood  57.87  58.95  Autologistic-basic  60.32  62.03  Autologistic-disjoint  60.92  62.84  Let us now take a closer look at the results. Supplementary Table S1 depicts the accuracy of missing value imputations for each feature. Figure 4 and Table 2 compare the performance of the autologistic model with those of the baselines. Although the autologistic model outperformed the neighborhood baseline in relation to the majority of features, it demonstrated performance loss for some of the features. For these features, the autologistic model fell in between the neighborhood and the global majority baselines, suggesting that the effect of universality was overestimated. Table 2 Features ranked by performance gain of autologistic-basic compared with the neighborhood baseline. Feature  Accuracy (%)  Gain (%)  16A  Weight factors in weight-sensitive stress systems  52.52  +14.69  15A  Weight-sensitive stress  56.54  +12.68  100A  Alignment of verbal person marking  56.38  +11.17  116A  Polar questions  61.32  +9.85  111A  Nonperiphrastic causative constructions  82.20  +9.71  53A  Ordinal numerals  36.77  −6.77  119A  Nominal and locational predication  69.29  −7.61  17A  Rhythm types  44.41  −8.70  9A  The velar nasal  54.27  −11.32  118A  Predicative adjectives  49.61  −17.32  Feature  Accuracy (%)  Gain (%)  16A  Weight factors in weight-sensitive stress systems  52.52  +14.69  15A  Weight-sensitive stress  56.54  +12.68  100A  Alignment of verbal person marking  56.38  +11.17  116A  Polar questions  61.32  +9.85  111A  Nonperiphrastic causative constructions  82.20  +9.71  53A  Ordinal numerals  36.77  −6.77  119A  Nominal and locational predication  69.29  −7.61  17A  Rhythm types  44.41  −8.70  9A  The velar nasal  54.27  −11.32  118A  Predicative adjectives  49.61  −17.32  The top five and bottom five ranked features are shown. Figure 4 View largeDownload slide Accuracy per feature. Figure 4 View largeDownload slide Accuracy per feature. 3.2 Estimated parameters We now turn to an examination of the estimated parameters. The model settings were the same as those applied in the preceding experiment, with the exception that all of the observed values were used. Figure 5 depicts estimated parameters for the autologistic model. Note that comparing the absolute values of a vi and an hi makes no sense because they are tied with different neighbor graphs. Table 3 lists the top five and bottom five features ranked according to vi and hi (Supplementary Tables S2 and S3 for the complete lists). While Fig. 5 represents each feature as a single point, we can use posterior samples to measure the uncertainty about the estimated parameters, as visualized in Supplementary Fig. S3. Table 3 Features ranked with vi and hi of autologistic-basic. (a) Features ranked with the vertical stability parameter vi.   Feature  vi  73A  The optative  0.0281  33A  Coding of nominal plurality  0.0280  6A  Uvular consonants  0.0278  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0251  86A  Order of genitive and noun  0.0243  101A  Expression of pronominal subjects  0.0035  105A  Ditransitive constructions: the verb ‘give’  0.0034  130A  Finger and hand  0.0034  72A  Imperative–hortative systems  0.0030  18A  Absence of common consonants  0.0024    (b) Features ranked with the horizontal diffusibility parameter hi.  Feature  hi    144A  Position of negative word with respect to Subject, Object, and Verb  0.0222  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0182  93A  Position of interrogative phrases in content questions  0.0166  81A  Order of Subject, Object, and Verb  0.0163  82A  Order of Subject and Verb  0.0163  34A  Occurrence of nominal plurality  0.0024  8A  Lateral consonants  0.0023  94A  Order of adverbial subordinator and clause  0.0022  18A  Absence of common consonants  0.0019  111A  Nonperiphrastic causative constructions  0.0013  (a) Features ranked with the vertical stability parameter vi.   Feature  vi  73A  The optative  0.0281  33A  Coding of nominal plurality  0.0280  6A  Uvular consonants  0.0278  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0251  86A  Order of genitive and noun  0.0243  101A  Expression of pronominal subjects  0.0035  105A  Ditransitive constructions: the verb ‘give’  0.0034  130A  Finger and hand  0.0034  72A  Imperative–hortative systems  0.0030  18A  Absence of common consonants  0.0024    (b) Features ranked with the horizontal diffusibility parameter hi.  Feature  hi    144A  Position of negative word with respect to Subject, Object, and Verb  0.0222  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0182  93A  Position of interrogative phrases in content questions  0.0166  81A  Order of Subject, Object, and Verb  0.0163  82A  Order of Subject and Verb  0.0163  34A  Occurrence of nominal plurality  0.0024  8A  Lateral consonants  0.0023  94A  Order of adverbial subordinator and clause  0.0022  18A  Absence of common consonants  0.0019  111A  Nonperiphrastic causative constructions  0.0013  The top five and bottom five features are shown for vi and hi. See Supplementary Tables S2 and S3 for complete lists and Supplementary Tables S4 and S5 for autologistic-disjoint. Figure 5 View largeDownload slide Scatter plots of estimated parameters for the autologistic model. Each point denotes the arithmetic mean of the posterior samples. Larger vi (hi) indicates that feature i is more stable (diffusible). The shapes of the points represent broad categories of features (called Area in WALS). Figure 5 View largeDownload slide Scatter plots of estimated parameters for the autologistic model. Each point denotes the arithmetic mean of the posterior samples. Larger vi (hi) indicates that feature i is more stable (diffusible). The shapes of the points represent broad categories of features (called Area in WALS). As we stated in Section 2.1, the phylogenetic and spatial neighbor graphs had substantial overlap. In fact, the arithmetic mean of vi was moderately correlated with that of hi for autologistic-basic (Spearman’s ρ=0.454). While the disjointness extension reduced the correlation to 0.170, we refrain from concluding that two neighbor graphs exhibited exactly what we expected them to exhibit, namely, vertical and horizontal signals. We investigated the effects of coverage (the ratio of languages whose feature values are present) on the estimated parameters. The results are shown in Supplementary Fig. S1. Coverage moderately positively correlated with vi and hi (Spearman’s ρ=0.416 and 0.425, respectively). This can be explained by the fact that word order-related features, which happened to be of high coverage thanks to Matthew Dryer (Haspelmath et al. 2005), tended to vertically stable and horizontally diffusible. Not surprisingly, coverage showed moderate negative correlations to 95% highest posterior density credible intervals ( ρ=−0.414 for vi and −0.473 for hi), meaning that the autologistic model was more uncertain about features with larger numbers of missing values. We also checked correlation statistics regarding the number of unique values for each feature. All we found was very weak to weak correlations (Supplementary Fig. S2). As a final step, we compared estimated parameters with statements within the literature. There is no clear-cut methodological solution for undertaking a comparison of previously proposed stability indices with a combination of two parameters, namely, vertical stability vi and horizontal diffusibility hi. We believe that the two distinct concepts of vertical stability and horizontal diffusibility should be kept apart, rather than being conflated into a single stability index. As a tentative solution, we devised our own index by subtracting hi from vi. Table 4 summarizes the findings of the comparison. Whereas we used a list created by Wichmann and Holman (2009), only its subset is shown here for the following reasons. First, some of the statements referred to feature values rather than feature types. Second, some of the statements were below the coverage threshold we set in Section 2.1. Thus, although our results were not strongly supported, they were largely consistent with previous findings. Somewhat surprisingly, tone was judged only moderately horizontally diffusible by the autologistic model, even though it is known as a typical areal feature. As Wichmann and Holman (2009) observed, horizontal diffusion appears to exhibit area specificity. A possible solution to the heterogeneity is to introduce a hierarchical Bayesian model to hi, in a way similar to relaxed molecular clock methods of evolutionary biology (Drummond and Bouckaert 2015: Chapter 4.3): instead of letting the single parameter hi control the whole world, we first draw a global parameter and then use it to draw area-specific parameters that are close to the global parameter on average but can be distant from it. Table 4 A comparison of the estimated parameters with statements in the literature compiled by Wichmann and Holman (2009). Statements in the literature  Correspondence feature  Estimated parameters   Agreement  vi  hi  vi−hi  i.  Dem-N, Num-N Adj-N less stable than Gen-N and Rel-N (Hawkins, 1983: 93)  88A  0.0136  0.0134  0.0002    89A  0.0147  0.0150  –0.0003    87A  0.0183  0.0106  0.0077  Yes  86A  0.0243  0.0140  0.0102    90A  0.0168  0.0043  0.0125    ii.  Place of adposition appears to be stable (Nichols, 1995: 352); Adpositions are stable (Croft, 1996: 206–7)  85A  0.0228  0.0144  0.0084  Yes  iii.  Definite articles are stable (Croft, 1996: 206–7)  37A  0.0126  0.0102  0.0024  Moderate  iv.  Tones are stable but also areal (Nichols, 1995: 343, 2003: 307)  13A  0.0091  0.0025  0.0066  Partial  v.  Cases are stable (Nichols, 2003: 286, 307)  51A  0.0164  0.0110  0.0054  Yes  vi.  Numerical classifiers do not have high probabilities for inheritance (Nichols, 2003: 299)  55A  0.0051  0.0071  −0.0020  Yes  vii.  A-removing inflection (or very regular derivation) on verbs (passive, etc.) is stable (Nichols, 1995: 343)  107A  0.0070  0.0140  −0.0070  No?  Statements in the literature  Correspondence feature  Estimated parameters   Agreement  vi  hi  vi−hi  i.  Dem-N, Num-N Adj-N less stable than Gen-N and Rel-N (Hawkins, 1983: 93)  88A  0.0136  0.0134  0.0002    89A  0.0147  0.0150  –0.0003    87A  0.0183  0.0106  0.0077  Yes  86A  0.0243  0.0140  0.0102    90A  0.0168  0.0043  0.0125    ii.  Place of adposition appears to be stable (Nichols, 1995: 352); Adpositions are stable (Croft, 1996: 206–7)  85A  0.0228  0.0144  0.0084  Yes  iii.  Definite articles are stable (Croft, 1996: 206–7)  37A  0.0126  0.0102  0.0024  Moderate  iv.  Tones are stable but also areal (Nichols, 1995: 343, 2003: 307)  13A  0.0091  0.0025  0.0066  Partial  v.  Cases are stable (Nichols, 2003: 286, 307)  51A  0.0164  0.0110  0.0054  Yes  vi.  Numerical classifiers do not have high probabilities for inheritance (Nichols, 2003: 299)  55A  0.0051  0.0071  −0.0020  Yes  vii.  A-removing inflection (or very regular derivation) on verbs (passive, etc.) is stable (Nichols, 1995: 343)  107A  0.0070  0.0140  −0.0070  No?  For practical purposes, we labeled vi≥0.01 as vertically stable, 0.01>vi≥0.005 as moderate, and v < 0.005 as unstable. Similarly, we labeled hi≥0.01 as horizontally diffusible, 0.01>hi≥0.005 as moderate, and h < 0.005 as undiffusible. For the tentative stability index vi−hi, we set two thresholds, namely, 0.003 and −0.003, to create a three-way division. 3.3 Possible extensions The autologistic model is conceptually simple and easy to extend. In this section, we discuss several possible extensions that can be made to this model. 3.3.1 Phylogenetically and spatially coherent samples To account for uncertainty about data, a Bayesian model sometimes incorporates posterior samples generated by another Bayesian model (Pagel et al. 2004). Given that our model yielded reasonably good results on missing value imputation, we consider that the samples of ximis are potentially useful for such a model. It is noteworthy that another approach exists for imputing missing values (Murawaki 2015; Takamura et al. 2016). This approach is based on the idea that some features depend on others. Greenberg (1963) presents several rules that hold true for the world’s languages, one of which stipulates the following: in languages with prepositions, the genitive almost always follows the governing noun, whereas in languages with postpositions, it almost always precedes. The scope of practical applications is not limited to pairs of features but instead uses a set of features to predict missing features. During a preprocessing step, Murawaki (2015) used a variant of multiple correspondence analysis (Josse et al. 2012) to impute missing values. Takamura et al. (2016) chose a logistic model to investigate the predictive power of features. In their experiments, the discriminative classifier was given all but one feature of a given language, and the value of the remaining feature was predicted. One language was repeatedly selected for evaluation, and the classifier was trained using all of the other languages in the typological database. Compared with the dependency-based approach, the autologistic model can be considered a proximity-based approach. Because of different experimental settings, we cannot directly compare our results with those of dependency-based methods. However, there appears to be a greater degree of improvement for the latter compared with the results obtained from the baselines. It is noteworthy that proximity is implicitly exploited by the dependency-based approach because languages with similar feature value combinations often happen to be phylogenetically or spatially close. In fact, Takamura et al. (2016) conducted a type of ablation experiment in which they modified the training data by removing languages that shared an ancestral population in common with the target language or by excluding languages that exhibited spatial proximity to the target. They demonstrated that accuracy declined in both settings. This finding implies that vertical and horizontal clues are useful for imputing missing values, although they do not control for the tendency of lower volumes of training data to decrease test accuracy. Nevertheless, the proximity-based approach is largely complementary to the dependency-based one. An advantage of the proximity-based approach over the dependency-based one is that the former produces phylogenetically and spatially coherent samples. Because a dependency-based model imputes missing features of each language independently of those of other languages, imputed languages are likely to disrupt vertical and horizontal signals. A combined model, which we would like to work on in the future, would be expected to improve accuracy of missing value imputation while retaining phylogenetic and spatial coherence. 3.3.2 Parameters for feature values We defined the parameters of vertical stability vi and horizontal diffusibility hi for each feature type. However, some feature values could be more vertically stable and/or horizontally diffusible than others. The procedure for estimating these parameters for each feature value is a simple one. It merely requires replacing vi with vi=(vi,1,⋯,vi,Fi), and hi with hi=(hi,1,⋯,hi,Fi) and modifying the model accordingly. 3.3.3 Geography and cultural contacts A spatial neighbor graph was constructed by connecting all languages located within a distance range of 1,000 km. This appears to be a rather arbitrary method because, for example, the Himalayan range is much more difficult to cross than the Eurasian Steppe. A sophisticated landscape-based (Bouckaert et al. 2012) and/or human-centric (Marck 1986) model is thus needed to take account of these ecological factors. A much simpler extension to the model comprises a weighted variant of the spatial neighbor graph in which a spatially distant pair of languages carries a smaller weight. For the weighted variant, the number of pairs H(xi) is replaced with the sum of the edge weights. Preliminary experiments indicated that this simple weighting scheme did not lead to significant differences in the results, only slightly improving performance in terms of missing value imputation (from 62.03 to 62.16% in microaveraged accuracy). We also tested the spatial neighbor graph with different distance thresholds. In terms of missing value imputation, the 500 km threshold yielded the best performance, but performance gaps were not large (Supplementary Fig. S4). Thus, tuning the distance threshold is of minor importance. Another factor that needs to be considered is cultural contacts because the duration of contact of a pair of languages is not necessarily proportional to spatial proximity. If some proxy data for the degree of cultural contacts could be obtained, these data could be incorporated into H(xi). 3.3.4 Explaining the universality factor In the present study, universality is another name for a bias term. It was incorporated into the autologistic model only to correct for the default distribution explained neither by vertical stability nor horizontal diffusibility. However, uncovering universals is one of the main goals of linguistic typology. In fact, the autologistic model can be viewed from the opposite side. To uncover something truly universal, we need to correct for the fact that languages are not independent (Daumé and Campbell 2007), and the autologistic model does its job. Thus, explaining the universality factor by means of extralinguistic features (Lupyan and Dale 2010; Everett et al. 2015), for example, would be interesting future work. If data for such features are available, they can be straightforwardly incorporated into the model. In conclusion, we have presented a statistical model that jointly models the distribution of typological features in terms of vertical stability, horizontal diffusibility, and universality. The application of missing value imputation demonstrated that the estimated parameters were meaningful. We believe that the present study will contribute to the broader interdisciplinary research community on language evolution. Though we have emphasized statistical modeling in this article, a future in-depth analysis of estimated parameters would be fruitful. Although computer-intensive, the autologistic model is conceptually simple and extensible. Extensions to the model are a promising future direction. Supplementary data Supplementary data is available at Journal of Language Evolution online. Footnotes 1 Once genetic relationship of languages is established, other types of evidence such as the ordering of regular sound changes and historical changes in inflectional paradigms would be needed to subgroup them. 2 Like Dediu and Cysouw (2013) and others, we treat each typological feature separately. Modeling dependencies of a typological feature with other typological features (Murawaki, 2015) or extralinguistic features such as social structure (Lupyan and Dale, 2010) and climate (Everett et al., 2015) are out of scope of the present study. 3 While one of the main goals of linguistic typology is to uncover universals, universality in our framework is nothing more than a factor that can be explained neither phylogenetically nor areally. We will revisit this point in Section 3.3.4. 4 da Silva and Tehrani (2016) chose the distance threshold such that the average number of spatial neighbors was roughly the same as that of phylogenetic neighbors. In our case, the former is nearly three times as large as the latter, however. 5 Towner et al. (2012) performed model selection to assess the necessity of vertical and horizontal factors. In Bayesian statistics, the standard choice for model selection is to use Bayes factors. However, estimating Bayes factors for the autologistic model is challenging because it entails yet another layer of intractability (and thus this task is called a triply intractable problem) (Friel, 2013), not to mention the marginalization of missing values. We leave it for future work. Acknowledgements The key ideas and early experimental results were presented at the 26th International Conference on Computational Linguistics (Yamauchi and Murawaki 2016). This article, which presents updated results, is a substantially extended version of the earlier conference paper. The work of the author K.Y. was done when he was a student in the Graduate School of Informatics, Kyoto University. Funding This work was partly supported by JSPS KAKENHI (grant Number 26730122). References Besag J. ( 1974) ‘ Spatial Interaction and the Statistical Analysis of Lattice Systems’, Journal of the Royal Statistical Society. Series B (Methodological) , 36/ 2: 192– 236. Bishop C. M. ( 2006) Pattern Recognition and Machine Learning . Springer. Bouckaert R., et al.   ( 2012) ‘ Mapping the Origins and Expansion of the Indo-European Language Family’, Science , 337/ 6097: 957– 60. doi: 10.1126/science.1219669. Google Scholar CrossRef Search ADS PubMed  Campbell L. ( 2006) ‘Areal Linguistics’, in Encyclopedia of Language and Linguistics , 2nd edn, pp. 454– 60. Elsevier. Google Scholar CrossRef Search ADS   Collard M., Shennan S. J. ( 2000) ‘Ethnogenesis Versus Phylogenesis in Prehistoric Culture Change: A Case-Study Using European Neolithic Pottery and Biological Phylogenetic Techniques’, in Renfrew C., Boyle K. (eds) Archaeogenetics: DNA and the Population Prehistory of Europe . McDonald Institute for Archaeological Research. Croft W. ( 1996) Typology and Universals , 1st edn. Cambridge University Press. da Silva S. G., Tehrani J. J. ( 2016) ‘ Comparative Phylogenetic Analyses Uncover the Ancient Roots of Indo-European Folktales’, Royal Society Open Science , 3/ 1: 150645. doi: 10.1098/rsos. Google Scholar CrossRef Search ADS PubMed  Daumé H.III. ( 2009) ‘Non-Parametric Bayesian Areal Linguistics’, In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pp. 593– 601, http://www.aclweb.org/anthology/N/N09/N09-1067.bib. Daumé H.III, Campbell L. ( 2007) ‘A Bayesian Model for Discovering Typological Implications’, in Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics , pp. 65– 72, http://www.aclweb.org/anthology/P/P07/P07-1009.bib. Dediu D. ( 2010) ‘ A Bayesian Phylogenetic Approach to Estimating the Stability of Linguistic Features and the Genetic Biasing of Tone’, Proceedings of the Royal Society of London B: Biological Sciences , 278/ 1704: 474– 9. doi: 10.1098/rspb.2010.1595. Google Scholar CrossRef Search ADS   Dediu D., Cysouw M. ( 2013) ‘ Some Structural Aspects of Language are More Stable than Others: A Comparison of Seven Methods’, PLoS One , 8/ 1: e55009. doi: 10.1371/journal.pone.0055009. Google Scholar CrossRef Search ADS PubMed  Drummond A. J., Bouckaert R. R. ( 2015) Bayesian Evolutionary Analysis with BEAST . Cambridge University Press. Google Scholar CrossRef Search ADS   Dunn M., et al.   ( 2005) ‘ Structural Phylogenetics and the Reconstruction of Ancient Language History’, Science , 309/ 5743: 2072– 5. doi: 10.1126/science.1114615. Google Scholar CrossRef Search ADS PubMed  Dunn M., ( 2011) ‘ Evolved Structure of Language Shows Lineage-Specific Trends in Word-Order Universals’, Nature , 473/ 7345: 79– 82. doi: 10.1038/nature09923. Google Scholar CrossRef Search ADS PubMed  Everett C., Blasi D. E., Roberts S. G. ( 2015) ‘ Climate, Vocal Folds, and Tonal Languages: Connecting the Physiological and Geographic Dots’, Proceedings of the National Academy of Sciences , 112/ 5: 1322– 7. doi: 10.1073/pnas.1417413112. Google Scholar CrossRef Search ADS   Felsenstein J. ( 1981) ‘ Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach’, Journal of Molecular Evolution , 17/ 6: 368– 76. doi: 10.1007/BF01734359. Google Scholar CrossRef Search ADS PubMed  Friel N. ( 2013) ‘ Evidence and Bayes Factor Estimation for Gibbs Random Fields’, Journal of Computational and Graphical Statistics , 22/ 3: 518– 32. doi: 10.1080/10618600.2013. 778780. Google Scholar CrossRef Search ADS   Gray R. D., Atkinson Q. D. ( 2003) ‘ Language-Tree Divergence Times Support the Anatolian Theory of Indo-European Origin’, Nature , 426/ 6965: 435– 9. doi: 10.1038/nature02029. Google Scholar CrossRef Search ADS PubMed  Greenberg J. H., ed. ( 1963) Universals of Language . MIT Press. Greenhill S. J., et al.   ( 2010) ‘ The Shape and Tempo of Language Evolution’, Proceedings of the Royal Society B: Biological Sciences , 277/ 1693: 2443– 50. doi: 10.1098/rspb.2010.0051. Google Scholar CrossRef Search ADS   Hammarström H., et al.  , eds. ( 2016) Glottolog 2.7. Max Planck Institute for the Science of Human History. <http://glottolog.org> Haspelmath M., et al.  , eds. ( 2005) The World Atlas of Language Structures . Oxford University Press. Hawkins J. A. ( 1983) Word Order Universals . Academic Press. Josse J., et al.   ( 2012) ‘ Handling Missing Values with Regularized Iterative Multiple Correspondence Analysis’, Journal of Classification , 29/ 1: 91– 116. doi: 10.1007/s00357-012-9097-0. Google Scholar CrossRef Search ADS   Lewis M. P., et al.  , eds. ( 2014) Ethnologue: Languages of the World , 17th edn. SIL International. <http://www.ethnologue.com> Liang F. ( 2010) ‘ A Double Metropolis–Hastings Sampler for Spatial Models with Intractable Normalizing Constants’, Journal of Statistical Computation and Simulation , 80/ 9: 1007– 22. doi: 10.1080/00949650902882162. Google Scholar CrossRef Search ADS   Longobardi G., Guardiano C. ( 2009) ‘ Evidence for Syntax as a Signal of Historical Relatedness’, Lingua , 119/ 11: 1679– 706. doi: 10.1016/j.lingua.2008.09.012. Google Scholar CrossRef Search ADS   Lupyan G., Dale R. ( 2010) ‘ Language Structure Is Partly Determined by Social Structure’, PLoS One , 5/ 1: e8559. doi: 10.1371/journal.pone.0008559. Google Scholar CrossRef Search ADS PubMed  Marck J. C. ( 1986) ‘ Micronesian Dialects and the Overnight Voyage’, The Journal of the Polynesian Society , 95/ 2: 253– 8. Møller J., et al.   ( 2006) ‘ An Efficient Markov Chain Monte Carlo Method for Distributions with Intractable Normalising Constants’, Biometrika , 93/ 2: 451– 8. Google Scholar CrossRef Search ADS   Murawaki Y. ( 2015) ‘Continuous Space Representations of Linguistic Typology and Their Application to Phylogenetic Inference’, in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 324– 34. http://www.aclweb.org/anthology/N/N15/N15-1036.bib. Murphy K. P. ( 2012) Machine Learning: A Probabilistic Perspective . MIT press. Murray I., Ghahramani Z., MacKay D. J. C. ( 2006) ‘MCMC for Doubly-Intractable Distributions’, in Proceedings of the Twenty-Second Confere nce on Uncertainty in Artificial Intelligence , pp. 359– 66, https://arxiv.org/html/1208.5160. Nelson-Sathi S., et al.   ( 2010) ‘ Networks Uncover Hidden Lexical Borrowing in Indo-European Language Evolution’, Proceedings of the Royal Society of London B: Biological Sciences, 278/ 1713: 1794– 803. doi: 10.1098/rspb. 2010.1917. Nichols J. ( 1992) Linguistic Diversity in Space and Time . University of Chicago Press. Google Scholar CrossRef Search ADS   Nichols J. ( 1994) ‘ The Spread of Language Around the Pacific Rim’, Evolutionary Anthropology: Issues, News, and Reviews , 3/ 6: 206– 15. doi: 10.1002/evan.1360030607. Google Scholar CrossRef Search ADS   Nichols J. ( 1995) ‘Diachronically Stable Structural Features’, in Andersen H. (ed) Historical Linguistics 1993. Selected Papers from the 11th International Conference on Historical Linguistics, Los Angeles 16–20 August 1993 . John Benjamins Publishing Company. Nichols J. ( 2003) ‘Diversity and Stability in Languages’, in Joseph B. D., Janda R. D. (eds) The Handbook of Historical Linguistics , pp. 283– 310. Oxford University Press. Google Scholar CrossRef Search ADS   Pagel M., et al.   ( 2013) ‘ Ultraconserved Words Point to Deep Language Ancestry across Eurasia’, Proceedings of the National Academy of Sciences , 110/ 21: 8471– 6. doi: 10.1073/pnas.1218726110. Google Scholar CrossRef Search ADS   Pagel M., Meade A., Barker D. ( 2004) ‘ Bayesian Estimation of Ancestral Character States on Phylogenies’, Systematic Biology , 53/ 5: 673– 84. doi: 10.1080/10635150490522232. Google Scholar CrossRef Search ADS PubMed  Parkvall M. ( 2008) ‘ Which Parts of Language are the Most Stable?’, STUF: Language Typology and Universals Sprachtypologie Und Universalienforschung , 61/ 3: 234– 50. Google Scholar CrossRef Search ADS   Sijthoff A. W. ( 1928) ‘Actes du premier congrès international de linguistes à la Haye’, du 10– 15, Avril 1928. Swadesh M. ( 1971) The Origin and Diversification of Language . Aldine Atherton. Takamura H., Nagata R., Kawasaki Y. ( 2016) ‘Discriminative Analysis of Linguistic Features for Typological Study’, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 69– 76, http://www.lrec-conf.org/proceedings/lrec2016/summaries/114.html. Towner M. C., et al.   ( 2012) ‘ Cultural Macroevolution on Neighbor Graphs: Vertical and Horizontal Transmission among Western North American Indian Societies’, Human Nature , 23/ 3: 283– 305. doi: 10.1007/s12110-012-9142-z. Google Scholar CrossRef Search ADS PubMed  Tsunoda T., Ueda S., Itoh Y. ( 1995) ‘ Adpositions in Word-Order Typology’, Linguistics , 33/ 4: 741– 62. doi: 10.1515/ling.1995.33.4.741. Google Scholar CrossRef Search ADS   Wichmann S. ( 2015) ‘Diachronic Stability and Typology’, in Bowern C., Evans B. (eds) The Routledge Handbook of Historical Linguistics , pp. 212– 24. Routledge. Wichmann S., Holman E. W. ( 2009) Temporal Stability of Linguistic Typological Features . Lincom Europa. Yamauchi K., Murawaki Y. ( 2016) ‘Contrasting Vertical and Horizontal Transmission of Typological Features’, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 836– 846, http://www.aclweb.org/anthology/C/C16/C16-1080.bib. © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Language Evolution Oxford University Press

A statistical model for the joint inference of vertical stability and horizontal diffusibility of typological features

Loading next page...
 
/lp/ou_press/a-statistical-model-for-the-joint-inference-of-vertical-stability-and-dyuX6F7CE0
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
2058-4571
eISSN
2058-458X
D.O.I.
10.1093/jole/lzx022
Publisher site
See Article on Publisher Site

Abstract

Abstract A major pursuit within the study of language evolution is to advance understanding of the historical behavior of typological features. Previous studies have identified at least three factors that determine the typological similarity of a pair of languages: (1) vertical stability, (2) horizontal diffusibility, and (3) universality. Of these factors, the first two are of particular interest. Although observed data are affected by all three factors to a greater or lesser degree, previous studies have not jointly modeled them in a straightforward manner. Here, we propose a solution that is derived from the field of cultural anthropology. We present a simple and extensible Bayesian autologistic model to jointly infer the three factors from observed data. Although a large number of missing values in the data set pose serious difficulties for statistical modeling, the proposed model can robustly estimate these parameters as well as missing values. Applying missing value imputation to indirectly evaluate the estimated parameters, we quantitatively demonstrated that they were meaningful. In conclusion, we briefly compare our findings with those of previous studies and discuss future directions. 1. Introduction Linguistic typology is concerned with the systematic comparison of language structures. Major linguistic types, or features, by which the world’s languages are classified, include basic word order (Order of Subject, Object, and Verb) and the presence or absence of tone. For example, basic word order concerns the ordering of subject, object, and verb: English is SVO (Subject-Verb-Object), while Japanese is SOV (Subject-Object-Verb). Features of linguistic typology constitute a promising resource that can potentially be used to uncover the evolutionary history of languages. It has been argued that in exceptional cases, typological features can reflect a time span of 10,000 years or more (Nichols 1994). One of the greatest challenges entailed in typology-based phylogenetic inference is that in contrast to lexical evidence, which is the default choice of historical–comparative linguistics, typological features are characterized by uncertainty. Assessing whether words grouped together are cognates that exhibit regular sound correspondences requires binary judgments for which the proficiency of logic-based human inference has already been demonstrated. Because of the arbitrary nature of linguistic signs, it is extremely unlikely that unrelated words coincidentally follow sound laws. Thus, once a sufficient number of cognates are identified within a group of languages, we can reasonably conclude that they share a common ancestor.1 In contrast, the sharing of the same basic word order, for example, an SOV is only a weak indicator of shared ancestry because this pattern is repeated multiple times in multiple places. Whereas human beings are not adept at handling uncertainties, we believe that computer-intensive statistical methods (Tsunoda et al. 1995; Dunn et al. 2005; Longobardi and Guardiano 2009; Murawaki 2015) can overcome this problem because they provide effective ways of combining weak signals. In this article, we demonstrate a step forward toward developing typology-based phylogenetic inference. Specifically, we engage with the problem of the obscuring of vertical (phylogenetic) signals by horizontal (spatial or areal) transmission. Like loanwords, typological features can be borrowed from one language to another. In fact, a non-tree-like mode of evolution has emerged as one of the central topics in linguistic typology (Sijthoff 1928; Campbell 2006). However, the tree model, which is the standard statistical model used for phylogenetic inference, only reflects vertical transmission. The task of incorporating both vertical and horizontal transmission within a statistical model is notoriously challenging because of the excessive flexibility of horizontal transmissions. Although the tree model also requires a vast search space, this issue can be handled using computer-intensive models (Felsenstein 1981; Gray and Atkinson 2003). Whereas the degree of uncertainty can be intuitively posited to increase as we trace the states of languages back to the past, the tree model, when used for this process, simultaneously reduces the degree of freedom by repeatedly merging nodes into a parent. Horizontal transmission elicits a degree of freedom that cannot be currently modeled without imposing some strong assumptions (Daumé 2009; Nelson-Sathi et al. 2010). Consequently, here we pursue a line of research in linguistic typology that draws on information on the current distribution of typological features without explicitly requiring the reconstruction of previous states (Nichols 1992, 1995; Parkvall 2008; Wichmann and Holman 2009). Studies that have applied this approach have attempted to quantify how stable typological features are. In their original forms, typological features constitute a mosaic exhibiting varying degrees of stability. If we succeed in discovering stable features, these could be regarded as typological equivalents of basic vocabulary, or as a list of basic concepts. Words within the list are assumed to be resistant to borrowing (Swadesh 1971), and some words are even considered to be extremely stable (Pagel et al. 2013). Such stable traits can potentially enable deep phylogenetic relations existing among the world’s languages to be uncovered. The question then is how is the notion of stability connected to vertical and horizontal transmission? Answering this question is not as simple as we may think. Previous studies offer a surprisingly wide variety of perspectives on this question (Dediu and Cysouw 2013). However, it is evident that at least three factors must be considered. These factors are: (1) vertical stability, (2) horizontal diffusibility, and (3) universality.2 In other words, there are three possible explanations for the sharing of the same feature value by a pair of languages: Two languages may be phylogenetically related, with the feature demonstrating a strong phylogenetic association (i.e., vertical stability). The two languages may be spatially proximate, with the feature demonstrating a strong spatial association (i.e., horizontally diffusibility). The third explanation lies in a bias term. In other words, the more often the feature value appears globally, the higher the probability of assuming the value is.3 Of course, these three explanations are not mutually exclusive. The current distribution of typological features is more or less affected by all three factors. Thus, the first step toward developing typology-based phylogenetic inference is to model all these factors jointly. 1.1 A solution from cultural anthropology We suggest that a parallel can be drawn between cultural anthropology and linguistics with respect to vertical and horizontal transmission. Within cultural anthropology, these two modes of vertical and horizontal transmission are known as phylogenesis and ethnogenesis, respectively (Collard and Shennan 2000). Phylogenesis involves the transmission of cultural traits primarily from an ancestral population to its descendants, whereas ethnogenesis entails strong influences resulting from transmissions between populations. A model that is potentially capable of disentangling these factors has already been proposed by Towner et al. (2012). This model is a variant of the autologistic model (Besag 1974), which has been widely used to model the spatial distribution of a feature. The model assumes that the value of a random variable depends probabilistically on the values of its neighbors. Towner et al. (2012)’s extension incorporates two neighbor graphs: one for vertical transmission and the other for horizontal transmission. These graphs are associated with scalar parameters indicating the relative strength of the transmission modes. Towner et al. (2012) applied the autologistic model to examine cultural traits (features) of North American Indian societies, such as the presence or absence of agriculture, the tendency toward exogamy, and types of social structure. They constructed a spatial neighbor graph using longitudinal and latitudinal data on the societies, as well as a phylogenetic neighbor graph, created by collapsing known language families. Applying this model, they empirically demonstrated that both transmission modes were non-negligible for the majority of traits. The same model was subsequently applied to folktales of Indo-European societies by da Silva and Tehrani (2016). Applied at an intermediary step toward discovering folktales with deep historical roots, the model was used to filter out those tales with strong horizontal signals. Although the key idea of jointly modeling vertical and horizontal transmissions was built into the model formulated by Towner et al. (2012), it was not readily applicable to typological data. For this reason, we modified the model and added a new inference algorithm, as described in this article. Our model primarily differs from the above described model in two ways. First, whereas the original model only deals with binary features, our extended model can handle categorical features. Although the original database of cultural traits was coded categorically, Towner et al. (2012) selected its small subset by excluding unsuitable features and merging feature values. Second, because Towner et al. (2012) only focused on high-coverage features, they simply excluded languages with missing values from neighbor graphs. However, the linguistic typological database is too sparse to ignore missing values. The situation is unlikely to change in the foreseeable future because of the thousands of languages in the world, and there is ample documentation for only a handful. Consequently, we take a Bayesian approach because it is known for its robustness in modeling uncertainties (Murphy 2012: Chapter 2). We used a Markov Chain Monte Carlo (MCMC) method to jointly infer scalar parameters and missing values. The source code is available from GitHub at: https://github.com/murawaki/bayes-autologistic. 1.2 Brief review of previous studies Though a thorough review of previous studies is beyond the scope of this article, to the best of our knowledge, there are no previous studies in linguistic typology that have jointly modeled all three factors in a straightforward manner. More details on these earlier studies can be found in a review by Wichmann (2015), as well as in a review and empirical comparison conducted by Dediu and Cysouw (2013). Here we briefly describe how three proposed approaches, described in the literature, have ignored one or more factors or have suffered from contamination. Dediu (2010) used the tree model to infer stability (hereafter, stability D). The implication of his study was that stability D does not explicitly reflect horizontal diffusibility. He assumed that a feature is passed on from a parent to a child within a family tree and obeys a continuous-time Markov chain model in which the transition rate matrix is constrained to have only one parameter. This parameter is estimated for each of several given trees and is then merged into one index, that is, stability D. In general, a transition rate matrix controls not just the speed of change (i.e., instability) but it also determines the stationary distribution toward which the model converges. We can consider the stationary distribution to reflect universality because it is where vertical influence vanishes. However, the stationary distribution in Dediu’s model is trivial because only one parameter controls the transition rate matrix. A binary-coded feature, for example, is doomed to take value 0 with probability 0.5 and 1 with probability 0.5. In short, stability D also ignores universality. Parkvall (2008) presented a series of calculations for deriving a stability index (stability P). First, he computed the Herfindahl–Hirschman index for a given group G, defined either phylogenetically or areally (spatially):   DG=1−∑j=1(pj)2, where pj is the relative frequency of the feature value j in G. Because this was a diversity index, he calculated a homogeneity index, HG, using the reciprocal of DG. He then averaged HG across all phylogenetic groups to obtain HFAM. HARE denoted the areal equivalent of HFAM. As a final step, he divided HFAM by HARE to obtain the stability P. Stability P takes into account both vertical stability (HFAM) and horizontal diffusibility (HARE). However, DG suffers from noise because G does not reflect purely phylogenetic (or spatial) signals. A model that performs one-step disentanglement rather than a combination of noisy estimation and correction would thus be ideal. The question of universality remains, which is obscured by the series of calculations. However, HFAM should probably be seen as an amalgam of vertical stability and universality (the same is true of HARE). The stability index developed by Wichmann and Holman (2009) (stability W) is similar to that of Parkvall, but only considers phylogenetic groups. Whereas Wichmann and Holman introduced three metrics, here we focus on metric C, which has been found to perform the best within simulation experiments. For each phylogenetic group, G, they calculated VG, the proportion of language pairs in G that shared the same feature value. Next, they averaged VG across all phylogenetic groups weighted by the inverse of the square of the number of languages in G:   R=∑GVG|G|. Similarly, they calculated U as the proportion of pairs of unrelated languages sharing the same value. Stability W was obtained by correcting for U as follows:   R−U1−U. Though stability W accounts for vertical stability and universality, it does not explicitly consider horizontal diffusibility. Wichmann and Holman (2009) justified this exclusion by pointing out that diffusibility is largely area-specific. Nevertheless, it is desirable to control for horizontal diffusibility when computing vertical stability. To summarize, the methods reviewed above either simply ignore one or more factors or they entail the performance of noisy estimation, followed by a correction step. In what follows, we will show that all three factors can jointly be modeled in a straightforward manner. We also argue that it is better to keep apart vertical stability and horizontal diffusibility, rather than conflating them into a single stability index. 2. Materials and methods 2.1 Data set Although the proposed model does not depend on a particular data set, in this article we focus on one database, namely, the online edition of the World Atlas of Language Structures (WALS) (Haspelmath et al. 2005). As of 2017, this atlas covered 2,679 languages and 192 typological features. However, the associated language–feature matrix was very sparse, containing less than 15% of the items. An item of the language–feature matrix was defined as a categorical value. For example, Feature 81A, ‘Order of Subject, Object and Verb’, had seven possible values: SOV, SVO, VSO, VOS, OVS, OSV, and No dominant order, and each language incorporated one of these seven values. We used these categorical feature values without making any modifications. Although mergers of some fine-grained feature values appear to be desirable (Daumé and Campbell 2007; Dediu 2010; Greenhill et al. 2010; Dunn et al. 2011), we leave this work for a future investigation. Some of the previous studies discarded languages with few observed features (Murawaki 2015; Takamura et al. 2016). To avoid introducing arbitrariness by ourselves, we included all but 72 languages in the database. The excluded languages were sign languages and pidgin and creole languages, all of which have been categorized in WALS as belonging to a dummy phylogenetic group labeled ‘other’. Of the 192 features, we selected 82 features that covered at least 10% of the remaining languages and applied to the entire world. For example, we dropped Feature 144H (‘NegSVO Order’) because it was only defined for SVO languages. WALS was also the source for constructing phylogenetic and spatial neighbor graphs. This atlas provides two levels of grouping: family and genus. We used the genera level to construct a phylogenetic neighbor graph in which every pair of languages within a genus was linked. The analysis showed that the average number of phylogenetic neighbors was 30.8. Other resources, such as Glottolog (Hammarström et al. 2016) and Ethnologue (Lewis et al. 2014), provided more detailed hierarchical classifications. However, there seemed to be no way to select groups of languages of comparable time depth. Languages were found to be associated with single-point geographical coordinates (longitude and latitude), as shown in Fig. 1. We constructed a spatial neighbor graph by linking all language pairs that were located within a distance of R km. Following da Silva and Tehrani (2016), we set the value of R at 1,000. The average number of spatial neighbors was 89.1.4 The phylogenetic and spatial neighbor graphs had substantial overlap. On average, 71.4% of phylogenetic neighbors were simultaneously spatial neighbors, while 20.1% of spatial neighbors were simultaneously phylogenetic neighbors. Figure 1 View largeDownload slide The geographical distribution of Feature 97A (‘Relationship between the Order of Object and Verb and the Order of Adjective and Noun’). Each point denotes a language. There are a non-negligible number of missing values (denoted as N/A). Figure 1 View largeDownload slide The geographical distribution of Feature 97A (‘Relationship between the Order of Object and Verb and the Order of Adjective and Noun’). Each point denotes a language. There are a non-negligible number of missing values (denoted as N/A). 2.2 Counting functions We begin our description of the model by introducing three counting functions corresponding to vertical, horizontal, and universal factors. A counting function receives the feature values of the world’s languages and returns a non-negative scalar. Formally, let xi=(x1,i,⋯,xL,i)∈Xi be a sequence of feature values, where xl,i is the value of the l-th language for feature type i. xl,i takes one of Fi categorical values, and Fi differs according to feature types. For now, we can assume that there are no missing values. The vertical factor is represented by V(xi). It returns the number of pairs sharing the same value in the phylogenetic neighbor graph. Figure 2 provides an example. As the basis of this function, we intuitively posit that if the feature in question is vertically stable, then a phylogenetically defined group of languages will tend to share the same value. A more direct counting method would entail checking how many changes have occurred relating to parent-to-child transitions within the trees. However, this alternative method would be difficult to implement because the feature values of ancestors remain unknown. Consequently, we chose a counting function that only refers to contemporary languages. Figure 2 View largeDownload slide The phylogenetic neighbor graph and V(xi). Each node denotes a language. Languages are grouped by common ancestors whose feature values are unknown. The neighbor graph consists of edges, each of which connects a pair of languages within the same group. In this example, two edges indicate the sharing of feature values. Figure 2 View largeDownload slide The phylogenetic neighbor graph and V(xi). Each node denotes a language. Languages are grouped by common ancestors whose feature values are unknown. The neighbor graph consists of edges, each of which connects a pair of languages within the same group. In this example, two edges indicate the sharing of feature values. H(xi) is the spatial equivalent of V(xi), as illustrated in Fig. 3. If the feature in question is horizontally diffusible, then spatially close languages would be expected to frequently share the same feature value. Figure 3 View largeDownload slide The spatial neighbor graph and H(xi). Of the five edges in the neighbor graph, four indicate feature value-sharing. Figure 3 View largeDownload slide The spatial neighbor graph and H(xi). Of the five edges in the neighbor graph, four indicate feature value-sharing. Last, Uj(xi) is the number of languages that take the value of j. If j is universally common, then the value of Uj(xi) must be large. 2.3 Autologistic model We now introduce the following parameters: vertical stability vi, horizontal diffusibility hi, and universality ui=(ui,1,⋯,ui,Fi) for each feature i. Applying the autologistic model, the probability of xi, conditioned on vi, hi, and ui, can be expressed as:   P(xi|vi,hi,ui)= exp ⁡(viV(xi)+hiH(xi)+∑jui,jUj(xi))∑xi′∈Xi exp ⁡(viV(xi′)+hiH(xi′)+∑jui,jUj(xi′)). (1) The denominator is a normalization term, ensuring that the sum of the distribution equals 1. This term is computationally intractable because there are (Fi)L possible assignments of xi. However, as we will show, the parameters can be estimated without precisely calculating the normalization term. The autologistic model can be interpreted in terms of the competition associated with possible assignments of xi for the probability mass 1. If a given value, xi, has a relatively large V(xi), then setting a large value for vi enables it to appropriate fractions of the mass from its weaker rivals. However, if too large a value is set for vi, then it will be overwhelmed by its stronger rivals. To acquire further insights into the model, let us consider the probability of language l taking the value k, conditioned on the rest of the languages, x−l,i:   P(xl,i=k|x−l,i,vi,hi,ui)∝ exp ⁡(viVl,i,k+hiHl,i,k+ui,k), (2) where Vl,i,k is the number of language l’s phylogenetic neighbors that assume the value k, and Hl,i,k is its spatial counterpart. The derivation of Equation (2) is explained in Supplementary Section S.1 of this article. The probability of language l assuming value k is expressed by the weighted linear combination of the three factors in the log-space. If the vertical stability parameter vi is positive, then the probability will increase with a rise in the number of phylogenetic neighbors that assume value k. However, this probability depends not only on the phylogenetic neighbors of language l but it also depends on its spatial neighbors and on universality. How strongly these factors affect the stochastic selection is controlled by vi, hi, and ui. We employed a Bayesian approach because it is known for its robustness in modeling uncertainties. We set prior distributions to vi, hi, and ui,j as follows: vi∼Gamma(κ,θ), hi∼Gamma(κ,θ), and ui,j∼N(0,σ2), where κ, θ, and σ were hyperparameters. Note that by setting Gamma priors, we constrained vi and hi to always be positive. Negative vi and hi would be difficult to justify from a linguistic point of view because they indicate that languages attempt to differentiate themselves from other languages to which they are related. We set shape κ=1.0, scale θ=1.0, and SD σ=10.0. These priors were not non-informative because we wanted to penalize extreme values. However, they were sufficiently gentle in the regions where these parameters typically resided. The fact that the phylogenetic and spatial neighbor graphs had substantial overlap was likely to cause difficulty disentangling the two factors. As a possible solution to the problem, we considered a model variant with an additional parameter mi:   P(xi|vi,hi,mi,ui)= exp ⁡(viV˜(xi)+hiH˜(xi)+miM(xi)+∑jui,jUj(xi))∑xi′∈Xi exp ⁡(viV˜(xi′)+hiH˜(xi′)+miM(xi′)+∑jui,jUj(xi′)), (3) where M(xi) counts languages that are simultaneously phylogenetic and spatial neighbors, V˜(xi)=V(xi)−M(xi), and H˜(xi)=H(xi)−M(xi). By tying vi and hi to disjoint sets of languages, we expected the autologistic model to detect vertical and horizontal signals more easily. 2.4 Inference We can now proceed to the parameter estimation. Let xi be decomposed into an observed portion xiobs and the remaining missing portion be ximis. ximis, in addition to vi, hi, and ui, needed to be inferred. We used a standard MCMC sampling algorithm (Bishop 2006: Chapter 11). Specifically, we employed Gibbs sampling to draw posterior samples of vi, hi, ui, and ximis from   P(vi,hi,ui,ximis|xiobs)∝P(vi,hi,ui)P(xi|vi,hi,ui)=P(vi)P(hi)P(ui)P(xi|vi,hi,ui), (4) where hyperparameters are omitted for brevity. The algorithm is sketched in Supplementary Section S.2. Note that we needed to sample vi (and hi and ui,j) from P(vi|hi,ui,xi)∝P(vi)P(xi|vi,hi,ui). This belongs to a class of problems known as sampling from doubly intractable distributions (Møller et al. 2006; Murray et al. 2006). While it remains a challenging problem in statistics, it is not difficult to approximately sample the variables if we give up theoretical rigorousness (Liang 2010). The details of the algorithm we used can be found in Supplementary Section S.3.5 3. Results and discussion 3.1 Missing value imputation While most previous studies have involved a subjective analysis of stability indices, the quality of estimated parameters can be quantitatively measured. This quantitative measuring ability is an advantageous feature of a probabilistic model with predictive ability. Specifically, estimated parameters can be indirectly evaluated by means of missing value imputation. If the autologistic model predicts missing values better than reasonable baselines, we can say that the estimated parameters are justified. Although no ground truth exists for the missing portion of the database, missing value imputation can be evaluated by hiding some observed values and verifying the effectiveness of their recovery. For each feature type, we conducted tenfold cross-validation. We randomly partitioned observed values into ten blocks that were almost equal in size. For each of the ten runs, we treated an individual block the same as other missing values and performed posterior sampling. After 200 burn-in iterations, we ran 2,495 iterations and collected samples with the interval of 5 iterations. Among the collected samples, we chose the most frequent feature value for each language as the final output. The autologistic model had two variants: the basic model (autologistic-basic) defined in Equation (1) and the extension (autologistic-disjoint) defined in Equation (3). They were compared with three baselines. Universal: Values are derived from the empirical distribution of xiobs. Global majority: The most frequently occurring value among xiobs is selected, and this value is always the output. Neighborhood: The neighbors of each language are compiled in xiobs, and a value is derived from the empirical distribution. If both graphs have observed neighbors, one is chosen randomly. If one of the two graphs has no observed neighbor, then the other is chosen. If neither is available, then the universal baseline provides a fallback option. Table 1 shows the overall results of the analysis. The universal baseline performed the worst, followed, in ascending order, by the global majority, and the neighborhood. The autologistic model outperformed the three baselines. The two model variants achieved comparable performance. The estimated parameters were evidently reasonable. Table 1 Results of missing value imputation. Model  Accuracy (%)   Macro-average  Micro-average  Universal  42.21  42.05  Global majority  54.85  54.20  Neighborhood  57.87  58.95  Autologistic-basic  60.32  62.03  Autologistic-disjoint  60.92  62.84  Model  Accuracy (%)   Macro-average  Micro-average  Universal  42.21  42.05  Global majority  54.85  54.20  Neighborhood  57.87  58.95  Autologistic-basic  60.32  62.03  Autologistic-disjoint  60.92  62.84  Let us now take a closer look at the results. Supplementary Table S1 depicts the accuracy of missing value imputations for each feature. Figure 4 and Table 2 compare the performance of the autologistic model with those of the baselines. Although the autologistic model outperformed the neighborhood baseline in relation to the majority of features, it demonstrated performance loss for some of the features. For these features, the autologistic model fell in between the neighborhood and the global majority baselines, suggesting that the effect of universality was overestimated. Table 2 Features ranked by performance gain of autologistic-basic compared with the neighborhood baseline. Feature  Accuracy (%)  Gain (%)  16A  Weight factors in weight-sensitive stress systems  52.52  +14.69  15A  Weight-sensitive stress  56.54  +12.68  100A  Alignment of verbal person marking  56.38  +11.17  116A  Polar questions  61.32  +9.85  111A  Nonperiphrastic causative constructions  82.20  +9.71  53A  Ordinal numerals  36.77  −6.77  119A  Nominal and locational predication  69.29  −7.61  17A  Rhythm types  44.41  −8.70  9A  The velar nasal  54.27  −11.32  118A  Predicative adjectives  49.61  −17.32  Feature  Accuracy (%)  Gain (%)  16A  Weight factors in weight-sensitive stress systems  52.52  +14.69  15A  Weight-sensitive stress  56.54  +12.68  100A  Alignment of verbal person marking  56.38  +11.17  116A  Polar questions  61.32  +9.85  111A  Nonperiphrastic causative constructions  82.20  +9.71  53A  Ordinal numerals  36.77  −6.77  119A  Nominal and locational predication  69.29  −7.61  17A  Rhythm types  44.41  −8.70  9A  The velar nasal  54.27  −11.32  118A  Predicative adjectives  49.61  −17.32  The top five and bottom five ranked features are shown. Figure 4 View largeDownload slide Accuracy per feature. Figure 4 View largeDownload slide Accuracy per feature. 3.2 Estimated parameters We now turn to an examination of the estimated parameters. The model settings were the same as those applied in the preceding experiment, with the exception that all of the observed values were used. Figure 5 depicts estimated parameters for the autologistic model. Note that comparing the absolute values of a vi and an hi makes no sense because they are tied with different neighbor graphs. Table 3 lists the top five and bottom five features ranked according to vi and hi (Supplementary Tables S2 and S3 for the complete lists). While Fig. 5 represents each feature as a single point, we can use posterior samples to measure the uncertainty about the estimated parameters, as visualized in Supplementary Fig. S3. Table 3 Features ranked with vi and hi of autologistic-basic. (a) Features ranked with the vertical stability parameter vi.   Feature  vi  73A  The optative  0.0281  33A  Coding of nominal plurality  0.0280  6A  Uvular consonants  0.0278  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0251  86A  Order of genitive and noun  0.0243  101A  Expression of pronominal subjects  0.0035  105A  Ditransitive constructions: the verb ‘give’  0.0034  130A  Finger and hand  0.0034  72A  Imperative–hortative systems  0.0030  18A  Absence of common consonants  0.0024    (b) Features ranked with the horizontal diffusibility parameter hi.  Feature  hi    144A  Position of negative word with respect to Subject, Object, and Verb  0.0222  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0182  93A  Position of interrogative phrases in content questions  0.0166  81A  Order of Subject, Object, and Verb  0.0163  82A  Order of Subject and Verb  0.0163  34A  Occurrence of nominal plurality  0.0024  8A  Lateral consonants  0.0023  94A  Order of adverbial subordinator and clause  0.0022  18A  Absence of common consonants  0.0019  111A  Nonperiphrastic causative constructions  0.0013  (a) Features ranked with the vertical stability parameter vi.   Feature  vi  73A  The optative  0.0281  33A  Coding of nominal plurality  0.0280  6A  Uvular consonants  0.0278  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0251  86A  Order of genitive and noun  0.0243  101A  Expression of pronominal subjects  0.0035  105A  Ditransitive constructions: the verb ‘give’  0.0034  130A  Finger and hand  0.0034  72A  Imperative–hortative systems  0.0030  18A  Absence of common consonants  0.0024    (b) Features ranked with the horizontal diffusibility parameter hi.  Feature  hi    144A  Position of negative word with respect to Subject, Object, and Verb  0.0222  97A  Relationship between the Order of Object and Verb and the Order of Adjective and Noun  0.0182  93A  Position of interrogative phrases in content questions  0.0166  81A  Order of Subject, Object, and Verb  0.0163  82A  Order of Subject and Verb  0.0163  34A  Occurrence of nominal plurality  0.0024  8A  Lateral consonants  0.0023  94A  Order of adverbial subordinator and clause  0.0022  18A  Absence of common consonants  0.0019  111A  Nonperiphrastic causative constructions  0.0013  The top five and bottom five features are shown for vi and hi. See Supplementary Tables S2 and S3 for complete lists and Supplementary Tables S4 and S5 for autologistic-disjoint. Figure 5 View largeDownload slide Scatter plots of estimated parameters for the autologistic model. Each point denotes the arithmetic mean of the posterior samples. Larger vi (hi) indicates that feature i is more stable (diffusible). The shapes of the points represent broad categories of features (called Area in WALS). Figure 5 View largeDownload slide Scatter plots of estimated parameters for the autologistic model. Each point denotes the arithmetic mean of the posterior samples. Larger vi (hi) indicates that feature i is more stable (diffusible). The shapes of the points represent broad categories of features (called Area in WALS). As we stated in Section 2.1, the phylogenetic and spatial neighbor graphs had substantial overlap. In fact, the arithmetic mean of vi was moderately correlated with that of hi for autologistic-basic (Spearman’s ρ=0.454). While the disjointness extension reduced the correlation to 0.170, we refrain from concluding that two neighbor graphs exhibited exactly what we expected them to exhibit, namely, vertical and horizontal signals. We investigated the effects of coverage (the ratio of languages whose feature values are present) on the estimated parameters. The results are shown in Supplementary Fig. S1. Coverage moderately positively correlated with vi and hi (Spearman’s ρ=0.416 and 0.425, respectively). This can be explained by the fact that word order-related features, which happened to be of high coverage thanks to Matthew Dryer (Haspelmath et al. 2005), tended to vertically stable and horizontally diffusible. Not surprisingly, coverage showed moderate negative correlations to 95% highest posterior density credible intervals ( ρ=−0.414 for vi and −0.473 for hi), meaning that the autologistic model was more uncertain about features with larger numbers of missing values. We also checked correlation statistics regarding the number of unique values for each feature. All we found was very weak to weak correlations (Supplementary Fig. S2). As a final step, we compared estimated parameters with statements within the literature. There is no clear-cut methodological solution for undertaking a comparison of previously proposed stability indices with a combination of two parameters, namely, vertical stability vi and horizontal diffusibility hi. We believe that the two distinct concepts of vertical stability and horizontal diffusibility should be kept apart, rather than being conflated into a single stability index. As a tentative solution, we devised our own index by subtracting hi from vi. Table 4 summarizes the findings of the comparison. Whereas we used a list created by Wichmann and Holman (2009), only its subset is shown here for the following reasons. First, some of the statements referred to feature values rather than feature types. Second, some of the statements were below the coverage threshold we set in Section 2.1. Thus, although our results were not strongly supported, they were largely consistent with previous findings. Somewhat surprisingly, tone was judged only moderately horizontally diffusible by the autologistic model, even though it is known as a typical areal feature. As Wichmann and Holman (2009) observed, horizontal diffusion appears to exhibit area specificity. A possible solution to the heterogeneity is to introduce a hierarchical Bayesian model to hi, in a way similar to relaxed molecular clock methods of evolutionary biology (Drummond and Bouckaert 2015: Chapter 4.3): instead of letting the single parameter hi control the whole world, we first draw a global parameter and then use it to draw area-specific parameters that are close to the global parameter on average but can be distant from it. Table 4 A comparison of the estimated parameters with statements in the literature compiled by Wichmann and Holman (2009). Statements in the literature  Correspondence feature  Estimated parameters   Agreement  vi  hi  vi−hi  i.  Dem-N, Num-N Adj-N less stable than Gen-N and Rel-N (Hawkins, 1983: 93)  88A  0.0136  0.0134  0.0002    89A  0.0147  0.0150  –0.0003    87A  0.0183  0.0106  0.0077  Yes  86A  0.0243  0.0140  0.0102    90A  0.0168  0.0043  0.0125    ii.  Place of adposition appears to be stable (Nichols, 1995: 352); Adpositions are stable (Croft, 1996: 206–7)  85A  0.0228  0.0144  0.0084  Yes  iii.  Definite articles are stable (Croft, 1996: 206–7)  37A  0.0126  0.0102  0.0024  Moderate  iv.  Tones are stable but also areal (Nichols, 1995: 343, 2003: 307)  13A  0.0091  0.0025  0.0066  Partial  v.  Cases are stable (Nichols, 2003: 286, 307)  51A  0.0164  0.0110  0.0054  Yes  vi.  Numerical classifiers do not have high probabilities for inheritance (Nichols, 2003: 299)  55A  0.0051  0.0071  −0.0020  Yes  vii.  A-removing inflection (or very regular derivation) on verbs (passive, etc.) is stable (Nichols, 1995: 343)  107A  0.0070  0.0140  −0.0070  No?  Statements in the literature  Correspondence feature  Estimated parameters   Agreement  vi  hi  vi−hi  i.  Dem-N, Num-N Adj-N less stable than Gen-N and Rel-N (Hawkins, 1983: 93)  88A  0.0136  0.0134  0.0002    89A  0.0147  0.0150  –0.0003    87A  0.0183  0.0106  0.0077  Yes  86A  0.0243  0.0140  0.0102    90A  0.0168  0.0043  0.0125    ii.  Place of adposition appears to be stable (Nichols, 1995: 352); Adpositions are stable (Croft, 1996: 206–7)  85A  0.0228  0.0144  0.0084  Yes  iii.  Definite articles are stable (Croft, 1996: 206–7)  37A  0.0126  0.0102  0.0024  Moderate  iv.  Tones are stable but also areal (Nichols, 1995: 343, 2003: 307)  13A  0.0091  0.0025  0.0066  Partial  v.  Cases are stable (Nichols, 2003: 286, 307)  51A  0.0164  0.0110  0.0054  Yes  vi.  Numerical classifiers do not have high probabilities for inheritance (Nichols, 2003: 299)  55A  0.0051  0.0071  −0.0020  Yes  vii.  A-removing inflection (or very regular derivation) on verbs (passive, etc.) is stable (Nichols, 1995: 343)  107A  0.0070  0.0140  −0.0070  No?  For practical purposes, we labeled vi≥0.01 as vertically stable, 0.01>vi≥0.005 as moderate, and v < 0.005 as unstable. Similarly, we labeled hi≥0.01 as horizontally diffusible, 0.01>hi≥0.005 as moderate, and h < 0.005 as undiffusible. For the tentative stability index vi−hi, we set two thresholds, namely, 0.003 and −0.003, to create a three-way division. 3.3 Possible extensions The autologistic model is conceptually simple and easy to extend. In this section, we discuss several possible extensions that can be made to this model. 3.3.1 Phylogenetically and spatially coherent samples To account for uncertainty about data, a Bayesian model sometimes incorporates posterior samples generated by another Bayesian model (Pagel et al. 2004). Given that our model yielded reasonably good results on missing value imputation, we consider that the samples of ximis are potentially useful for such a model. It is noteworthy that another approach exists for imputing missing values (Murawaki 2015; Takamura et al. 2016). This approach is based on the idea that some features depend on others. Greenberg (1963) presents several rules that hold true for the world’s languages, one of which stipulates the following: in languages with prepositions, the genitive almost always follows the governing noun, whereas in languages with postpositions, it almost always precedes. The scope of practical applications is not limited to pairs of features but instead uses a set of features to predict missing features. During a preprocessing step, Murawaki (2015) used a variant of multiple correspondence analysis (Josse et al. 2012) to impute missing values. Takamura et al. (2016) chose a logistic model to investigate the predictive power of features. In their experiments, the discriminative classifier was given all but one feature of a given language, and the value of the remaining feature was predicted. One language was repeatedly selected for evaluation, and the classifier was trained using all of the other languages in the typological database. Compared with the dependency-based approach, the autologistic model can be considered a proximity-based approach. Because of different experimental settings, we cannot directly compare our results with those of dependency-based methods. However, there appears to be a greater degree of improvement for the latter compared with the results obtained from the baselines. It is noteworthy that proximity is implicitly exploited by the dependency-based approach because languages with similar feature value combinations often happen to be phylogenetically or spatially close. In fact, Takamura et al. (2016) conducted a type of ablation experiment in which they modified the training data by removing languages that shared an ancestral population in common with the target language or by excluding languages that exhibited spatial proximity to the target. They demonstrated that accuracy declined in both settings. This finding implies that vertical and horizontal clues are useful for imputing missing values, although they do not control for the tendency of lower volumes of training data to decrease test accuracy. Nevertheless, the proximity-based approach is largely complementary to the dependency-based one. An advantage of the proximity-based approach over the dependency-based one is that the former produces phylogenetically and spatially coherent samples. Because a dependency-based model imputes missing features of each language independently of those of other languages, imputed languages are likely to disrupt vertical and horizontal signals. A combined model, which we would like to work on in the future, would be expected to improve accuracy of missing value imputation while retaining phylogenetic and spatial coherence. 3.3.2 Parameters for feature values We defined the parameters of vertical stability vi and horizontal diffusibility hi for each feature type. However, some feature values could be more vertically stable and/or horizontally diffusible than others. The procedure for estimating these parameters for each feature value is a simple one. It merely requires replacing vi with vi=(vi,1,⋯,vi,Fi), and hi with hi=(hi,1,⋯,hi,Fi) and modifying the model accordingly. 3.3.3 Geography and cultural contacts A spatial neighbor graph was constructed by connecting all languages located within a distance range of 1,000 km. This appears to be a rather arbitrary method because, for example, the Himalayan range is much more difficult to cross than the Eurasian Steppe. A sophisticated landscape-based (Bouckaert et al. 2012) and/or human-centric (Marck 1986) model is thus needed to take account of these ecological factors. A much simpler extension to the model comprises a weighted variant of the spatial neighbor graph in which a spatially distant pair of languages carries a smaller weight. For the weighted variant, the number of pairs H(xi) is replaced with the sum of the edge weights. Preliminary experiments indicated that this simple weighting scheme did not lead to significant differences in the results, only slightly improving performance in terms of missing value imputation (from 62.03 to 62.16% in microaveraged accuracy). We also tested the spatial neighbor graph with different distance thresholds. In terms of missing value imputation, the 500 km threshold yielded the best performance, but performance gaps were not large (Supplementary Fig. S4). Thus, tuning the distance threshold is of minor importance. Another factor that needs to be considered is cultural contacts because the duration of contact of a pair of languages is not necessarily proportional to spatial proximity. If some proxy data for the degree of cultural contacts could be obtained, these data could be incorporated into H(xi). 3.3.4 Explaining the universality factor In the present study, universality is another name for a bias term. It was incorporated into the autologistic model only to correct for the default distribution explained neither by vertical stability nor horizontal diffusibility. However, uncovering universals is one of the main goals of linguistic typology. In fact, the autologistic model can be viewed from the opposite side. To uncover something truly universal, we need to correct for the fact that languages are not independent (Daumé and Campbell 2007), and the autologistic model does its job. Thus, explaining the universality factor by means of extralinguistic features (Lupyan and Dale 2010; Everett et al. 2015), for example, would be interesting future work. If data for such features are available, they can be straightforwardly incorporated into the model. In conclusion, we have presented a statistical model that jointly models the distribution of typological features in terms of vertical stability, horizontal diffusibility, and universality. The application of missing value imputation demonstrated that the estimated parameters were meaningful. We believe that the present study will contribute to the broader interdisciplinary research community on language evolution. Though we have emphasized statistical modeling in this article, a future in-depth analysis of estimated parameters would be fruitful. Although computer-intensive, the autologistic model is conceptually simple and extensible. Extensions to the model are a promising future direction. Supplementary data Supplementary data is available at Journal of Language Evolution online. Footnotes 1 Once genetic relationship of languages is established, other types of evidence such as the ordering of regular sound changes and historical changes in inflectional paradigms would be needed to subgroup them. 2 Like Dediu and Cysouw (2013) and others, we treat each typological feature separately. Modeling dependencies of a typological feature with other typological features (Murawaki, 2015) or extralinguistic features such as social structure (Lupyan and Dale, 2010) and climate (Everett et al., 2015) are out of scope of the present study. 3 While one of the main goals of linguistic typology is to uncover universals, universality in our framework is nothing more than a factor that can be explained neither phylogenetically nor areally. We will revisit this point in Section 3.3.4. 4 da Silva and Tehrani (2016) chose the distance threshold such that the average number of spatial neighbors was roughly the same as that of phylogenetic neighbors. In our case, the former is nearly three times as large as the latter, however. 5 Towner et al. (2012) performed model selection to assess the necessity of vertical and horizontal factors. In Bayesian statistics, the standard choice for model selection is to use Bayes factors. However, estimating Bayes factors for the autologistic model is challenging because it entails yet another layer of intractability (and thus this task is called a triply intractable problem) (Friel, 2013), not to mention the marginalization of missing values. We leave it for future work. Acknowledgements The key ideas and early experimental results were presented at the 26th International Conference on Computational Linguistics (Yamauchi and Murawaki 2016). This article, which presents updated results, is a substantially extended version of the earlier conference paper. The work of the author K.Y. was done when he was a student in the Graduate School of Informatics, Kyoto University. Funding This work was partly supported by JSPS KAKENHI (grant Number 26730122). References Besag J. ( 1974) ‘ Spatial Interaction and the Statistical Analysis of Lattice Systems’, Journal of the Royal Statistical Society. Series B (Methodological) , 36/ 2: 192– 236. Bishop C. M. ( 2006) Pattern Recognition and Machine Learning . Springer. Bouckaert R., et al.   ( 2012) ‘ Mapping the Origins and Expansion of the Indo-European Language Family’, Science , 337/ 6097: 957– 60. doi: 10.1126/science.1219669. Google Scholar CrossRef Search ADS PubMed  Campbell L. ( 2006) ‘Areal Linguistics’, in Encyclopedia of Language and Linguistics , 2nd edn, pp. 454– 60. Elsevier. Google Scholar CrossRef Search ADS   Collard M., Shennan S. J. ( 2000) ‘Ethnogenesis Versus Phylogenesis in Prehistoric Culture Change: A Case-Study Using European Neolithic Pottery and Biological Phylogenetic Techniques’, in Renfrew C., Boyle K. (eds) Archaeogenetics: DNA and the Population Prehistory of Europe . McDonald Institute for Archaeological Research. Croft W. ( 1996) Typology and Universals , 1st edn. Cambridge University Press. da Silva S. G., Tehrani J. J. ( 2016) ‘ Comparative Phylogenetic Analyses Uncover the Ancient Roots of Indo-European Folktales’, Royal Society Open Science , 3/ 1: 150645. doi: 10.1098/rsos. Google Scholar CrossRef Search ADS PubMed  Daumé H.III. ( 2009) ‘Non-Parametric Bayesian Areal Linguistics’, In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pp. 593– 601, http://www.aclweb.org/anthology/N/N09/N09-1067.bib. Daumé H.III, Campbell L. ( 2007) ‘A Bayesian Model for Discovering Typological Implications’, in Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics , pp. 65– 72, http://www.aclweb.org/anthology/P/P07/P07-1009.bib. Dediu D. ( 2010) ‘ A Bayesian Phylogenetic Approach to Estimating the Stability of Linguistic Features and the Genetic Biasing of Tone’, Proceedings of the Royal Society of London B: Biological Sciences , 278/ 1704: 474– 9. doi: 10.1098/rspb.2010.1595. Google Scholar CrossRef Search ADS   Dediu D., Cysouw M. ( 2013) ‘ Some Structural Aspects of Language are More Stable than Others: A Comparison of Seven Methods’, PLoS One , 8/ 1: e55009. doi: 10.1371/journal.pone.0055009. Google Scholar CrossRef Search ADS PubMed  Drummond A. J., Bouckaert R. R. ( 2015) Bayesian Evolutionary Analysis with BEAST . Cambridge University Press. Google Scholar CrossRef Search ADS   Dunn M., et al.   ( 2005) ‘ Structural Phylogenetics and the Reconstruction of Ancient Language History’, Science , 309/ 5743: 2072– 5. doi: 10.1126/science.1114615. Google Scholar CrossRef Search ADS PubMed  Dunn M., ( 2011) ‘ Evolved Structure of Language Shows Lineage-Specific Trends in Word-Order Universals’, Nature , 473/ 7345: 79– 82. doi: 10.1038/nature09923. Google Scholar CrossRef Search ADS PubMed  Everett C., Blasi D. E., Roberts S. G. ( 2015) ‘ Climate, Vocal Folds, and Tonal Languages: Connecting the Physiological and Geographic Dots’, Proceedings of the National Academy of Sciences , 112/ 5: 1322– 7. doi: 10.1073/pnas.1417413112. Google Scholar CrossRef Search ADS   Felsenstein J. ( 1981) ‘ Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach’, Journal of Molecular Evolution , 17/ 6: 368– 76. doi: 10.1007/BF01734359. Google Scholar CrossRef Search ADS PubMed  Friel N. ( 2013) ‘ Evidence and Bayes Factor Estimation for Gibbs Random Fields’, Journal of Computational and Graphical Statistics , 22/ 3: 518– 32. doi: 10.1080/10618600.2013. 778780. Google Scholar CrossRef Search ADS   Gray R. D., Atkinson Q. D. ( 2003) ‘ Language-Tree Divergence Times Support the Anatolian Theory of Indo-European Origin’, Nature , 426/ 6965: 435– 9. doi: 10.1038/nature02029. Google Scholar CrossRef Search ADS PubMed  Greenberg J. H., ed. ( 1963) Universals of Language . MIT Press. Greenhill S. J., et al.   ( 2010) ‘ The Shape and Tempo of Language Evolution’, Proceedings of the Royal Society B: Biological Sciences , 277/ 1693: 2443– 50. doi: 10.1098/rspb.2010.0051. Google Scholar CrossRef Search ADS   Hammarström H., et al.  , eds. ( 2016) Glottolog 2.7. Max Planck Institute for the Science of Human History. <http://glottolog.org> Haspelmath M., et al.  , eds. ( 2005) The World Atlas of Language Structures . Oxford University Press. Hawkins J. A. ( 1983) Word Order Universals . Academic Press. Josse J., et al.   ( 2012) ‘ Handling Missing Values with Regularized Iterative Multiple Correspondence Analysis’, Journal of Classification , 29/ 1: 91– 116. doi: 10.1007/s00357-012-9097-0. Google Scholar CrossRef Search ADS   Lewis M. P., et al.  , eds. ( 2014) Ethnologue: Languages of the World , 17th edn. SIL International. <http://www.ethnologue.com> Liang F. ( 2010) ‘ A Double Metropolis–Hastings Sampler for Spatial Models with Intractable Normalizing Constants’, Journal of Statistical Computation and Simulation , 80/ 9: 1007– 22. doi: 10.1080/00949650902882162. Google Scholar CrossRef Search ADS   Longobardi G., Guardiano C. ( 2009) ‘ Evidence for Syntax as a Signal of Historical Relatedness’, Lingua , 119/ 11: 1679– 706. doi: 10.1016/j.lingua.2008.09.012. Google Scholar CrossRef Search ADS   Lupyan G., Dale R. ( 2010) ‘ Language Structure Is Partly Determined by Social Structure’, PLoS One , 5/ 1: e8559. doi: 10.1371/journal.pone.0008559. Google Scholar CrossRef Search ADS PubMed  Marck J. C. ( 1986) ‘ Micronesian Dialects and the Overnight Voyage’, The Journal of the Polynesian Society , 95/ 2: 253– 8. Møller J., et al.   ( 2006) ‘ An Efficient Markov Chain Monte Carlo Method for Distributions with Intractable Normalising Constants’, Biometrika , 93/ 2: 451– 8. Google Scholar CrossRef Search ADS   Murawaki Y. ( 2015) ‘Continuous Space Representations of Linguistic Typology and Their Application to Phylogenetic Inference’, in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 324– 34. http://www.aclweb.org/anthology/N/N15/N15-1036.bib. Murphy K. P. ( 2012) Machine Learning: A Probabilistic Perspective . MIT press. Murray I., Ghahramani Z., MacKay D. J. C. ( 2006) ‘MCMC for Doubly-Intractable Distributions’, in Proceedings of the Twenty-Second Confere nce on Uncertainty in Artificial Intelligence , pp. 359– 66, https://arxiv.org/html/1208.5160. Nelson-Sathi S., et al.   ( 2010) ‘ Networks Uncover Hidden Lexical Borrowing in Indo-European Language Evolution’, Proceedings of the Royal Society of London B: Biological Sciences, 278/ 1713: 1794– 803. doi: 10.1098/rspb. 2010.1917. Nichols J. ( 1992) Linguistic Diversity in Space and Time . University of Chicago Press. Google Scholar CrossRef Search ADS   Nichols J. ( 1994) ‘ The Spread of Language Around the Pacific Rim’, Evolutionary Anthropology: Issues, News, and Reviews , 3/ 6: 206– 15. doi: 10.1002/evan.1360030607. Google Scholar CrossRef Search ADS   Nichols J. ( 1995) ‘Diachronically Stable Structural Features’, in Andersen H. (ed) Historical Linguistics 1993. Selected Papers from the 11th International Conference on Historical Linguistics, Los Angeles 16–20 August 1993 . John Benjamins Publishing Company. Nichols J. ( 2003) ‘Diversity and Stability in Languages’, in Joseph B. D., Janda R. D. (eds) The Handbook of Historical Linguistics , pp. 283– 310. Oxford University Press. Google Scholar CrossRef Search ADS   Pagel M., et al.   ( 2013) ‘ Ultraconserved Words Point to Deep Language Ancestry across Eurasia’, Proceedings of the National Academy of Sciences , 110/ 21: 8471– 6. doi: 10.1073/pnas.1218726110. Google Scholar CrossRef Search ADS   Pagel M., Meade A., Barker D. ( 2004) ‘ Bayesian Estimation of Ancestral Character States on Phylogenies’, Systematic Biology , 53/ 5: 673– 84. doi: 10.1080/10635150490522232. Google Scholar CrossRef Search ADS PubMed  Parkvall M. ( 2008) ‘ Which Parts of Language are the Most Stable?’, STUF: Language Typology and Universals Sprachtypologie Und Universalienforschung , 61/ 3: 234– 50. Google Scholar CrossRef Search ADS   Sijthoff A. W. ( 1928) ‘Actes du premier congrès international de linguistes à la Haye’, du 10– 15, Avril 1928. Swadesh M. ( 1971) The Origin and Diversification of Language . Aldine Atherton. Takamura H., Nagata R., Kawasaki Y. ( 2016) ‘Discriminative Analysis of Linguistic Features for Typological Study’, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 69– 76, http://www.lrec-conf.org/proceedings/lrec2016/summaries/114.html. Towner M. C., et al.   ( 2012) ‘ Cultural Macroevolution on Neighbor Graphs: Vertical and Horizontal Transmission among Western North American Indian Societies’, Human Nature , 23/ 3: 283– 305. doi: 10.1007/s12110-012-9142-z. Google Scholar CrossRef Search ADS PubMed  Tsunoda T., Ueda S., Itoh Y. ( 1995) ‘ Adpositions in Word-Order Typology’, Linguistics , 33/ 4: 741– 62. doi: 10.1515/ling.1995.33.4.741. Google Scholar CrossRef Search ADS   Wichmann S. ( 2015) ‘Diachronic Stability and Typology’, in Bowern C., Evans B. (eds) The Routledge Handbook of Historical Linguistics , pp. 212– 24. Routledge. Wichmann S., Holman E. W. ( 2009) Temporal Stability of Linguistic Typological Features . Lincom Europa. Yamauchi K., Murawaki Y. ( 2016) ‘Contrasting Vertical and Horizontal Transmission of Typological Features’, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 836– 846, http://www.aclweb.org/anthology/C/C16/C16-1080.bib. © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Journal of Language EvolutionOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off