TY - JOUR AB - Abstract Spelling variation seems to go hand in hand with grammatical variation in certain historical texts. This article presents a method of quantifying spelling variation as a linguistic variable whose relation with relevant grammatical and contextual variables can be statistically measured. Based on the normalization of the non-standard word forms and the subsequent calculation of edit distance between the normalized and attested word forms, the method is applicable to morphologically tagged historical text corpora and is here tested on an early medieval documentary Latin corpus with notable spelling variation. To justify the proposed method, several methodological issues of both philological and technical nature are discussed. The latter part of the article illustrates the potential of the method by way of a case study on the relationship between spelling variation and the use of non-standard prepositions in documentary Latin and by examining the chronological variation of spelling in its historical context. There appears to be a statistically significant dependence between non-standard spelling and the use of non-standard prepositions. It is also argued that the diachronic spelling distribution may be indicative of a spelling reform, in addition to reflecting an already known administrative change relative to scribal practices. Thus, the proposed method of quantifying spelling variation proves to offer interesting insights into the linguistic and historical reality underlying the text corpus. 1 Introduction A reader of non-literary historical texts often gets the impression that the fluctuation of spelling is connected to the writer's defective abilities to produce grammatically standard language and to convey his or her message successfully.1 This article presents a method to quantify spelling variation in morphologically tagged historical text corpora to enable statistical analysis of spelling and its relationship to linguistic, discursive, textual, palaeographic, and other relevant factors. The here discussed method of quantifying spelling variation by way of word form normalization and edit distances is also an example of how digital text analysis can be used to answer questions previously undreamed-of because of their laboriousness. In the present article, early medieval non-standard-Latin documents are utilized as example data. The method opens a unique window into the language competences of individual Latin scribes and can be exploited by historical linguistics to draw conclusions on the use of spoken language in diachrony as well as by historiography to reconstruct details of documentary production, both of which are continuously issues subject to heated discussion (e.g. Adams' (2013),Social Variation and the Latin Language and Brown et al.'s (2013)Document Culture and the Laity in the Early Middle Ages).2 The method is fully scalable to any lemmatized and morphologically tagged corpus of any language variant with spelling variation in respect of the obtaining orthographical norm, provided that there are a tagger and a sufficient word form lexicon available for the respective standard variant. I will begin with presenting the data. After that, I will explain the methodological motivation behind the here applied procedure of spelling quantification. At the end, I will illustrate the power of the method by a case study on the relationship between spelling and non-standard preposition usage, and by showing how the diachronic distribution of the spelling variable can be argued to reflect the historical development taken place in early medieval Tuscany. It turns out that even a rather superficial quantitative examination of spelling variation can provide interesting inferences both on certain linguistic developments and on the changing conditions of Latin documentary writing in a period which covers the so-called Carolingian reforms (Irvine, 1994, pp. 305–13). 2 Data The example data are the Late Latin Charter Treebank (LLCT), a lemmatized and morphologically and syntactically annotated corpus of 1,040 documentary texts (ca. 420,000 words) written in early medieval Italy between 714 and 897. The great majority of the documents has been preserved in the archives of the episcopal see of Lucca. Most documents, the so-called charters (chartae), are contracts about selling, buying, and donating landed property, but there are also a few judgements (iudicata) as well as lists and registers (brevia) compiled for various purposes (Bresslau, 1958, pp. 46–8). Since the 220 scribes dated the documents and signed them with their own name, LLCT offers a rare sociolinguistic possibility to observe the linguistic choices of individual medieval language users. To guarantee diatopic linguistic homogeneity, all the documents of LLCT are from historical Tuscia, a region comprising most of modern Tuscany, large parts of Umbria, and the northern parts of Lazio. Under the Lombard era in the 8th century, the region was organized mainly as the Duchy of Lucca. Lucca was home to an influential archbishop, and it was also a cultural centre and scriptorium. Initially left intact by the Carolingian conquest in 774, the administration was reorganized into a marca of Tuscia beginning from 781, first by mediation of loyal Lombard dukes, with Frankish counts appearing by the end of the century. The marca of Tuscia came to include Lucca, Pisa, Luni, Maremma, Corsica, and later also Florence and Fiesole, but not Siena, Chiusi, and Arezzo, although there are also documents written in the territories of Siena and Chiusi in the Luccan archives (Keller, 1973). The documents included in LLCT have been published in five diplomatic editions between 1833 and 1933.3 In the first decades of LLCT, the number of surviving documents is rather low but, after the 750s, each decade features sixty-nine documents on average (for the numbers, see note 11). Only original documents and contemporary copies (ca. 4% of the material) have been accepted, all on condition that they are not too fragmentary. The documents have been digitized, proofread, converted into TEI XML, and provided with appropriate metadata as well as diplomatic information on abbreviated and fragmentary words, which cannot be included in the spelling analysis (Korkiakangas and Lassila, 2013). The lemmatization and the morphological tagging is based on the Ancient Greek and Latin Dependency Treebank (AGLDT) style developed originally within the Perseus Project (Tufts University, Medford, MA), and the syntactic annotation of dependency relations follows the Guidelines for the Syntactic Annotation of Latin Treebanks (Bamman et al., 2007). Korkiakangas and Passarotti (2011) define a number of additions and modifications to these general guidelines which are designed for Classical Latin. This is necessary because documentary Latin is a non-standard variety which can be described with reason as a merger of (post-)Classical Latin, spoken-language phenomena, and hypercorrections. Moreover, documents are formulaic by default, i.e. they consist primarily of conventional building blocks. However, in early medieval Italy, the scribes did not copy documents from formulary books, as they did later in the Middle Ages, but compiled them from memory (Schiaparelli, 1933b, p. 3; Amelotti and Costamagna, 1975, pp. 215–16). This obviously led to a considerable linguistic variation, which is particularly conspicuous in the case-specific ‘improvised’ parts of the documents, usually descriptions of the transferred property. Contrary to the fully formulaic parts, such as the opening and closing clauses, these non-formulaic parts clearly make recourse to the spoken idiom and, thus, reflect a different linguistic reality (Sabatini, 1965; Korkiakangas, 2016, pp. 13–15). Importantly for a study of spelling, LLCT is one of the few Latin corpora which really contain notable spelling variation. By the Early Middle Ages, the spoken language had evolved considerably and no obvious correspondence obtained between pronunciation and the inherited orthography. Likewise, the spoken language, which was already likely to have been easily described as Italo-Romance, had abandoned or replaced several standard-Latin inflexional endings, which however were expected to be written (Zamboni, 2000, p. 101 ff.; Adams, 2013, p. 31 ff.). At this point, it is necessary to define the often-mentioned concept of ‘standard’ the spelling of LLCT is compared to. In this article, I mean by standard Latin roughly the type of written language used, among others, by the Christian authors of the Late Antiquity who were seen as models for literary activity throughout the Early Middle Ages. Their Latin was still essentially Classical Latin as codified in prescriptive grammars, as far as orthography and morphology are concerned. Many non-Classical features originating from the Vulgate were accepted, however. The orthography and morphology of this (post-)Classical variety still seem to serve as the model for the best-written LLCT texts. So, there was available a rather clear standard, in terms of a substantial consensus about ‘correct’ or ‘accepted’ spelling and morphology (Larson, 2000, pp. 161–2; Auernheimer, 2003, pp. 49–50; Korkiakangas, 2016, p. 36). 3 Quantification of Spelling Variation Spelling variation is operationalized in this study by normalizing the non-standard word forms of LLCT into standard-Latin forms and subsequently quantified by calculating the edit distance between all the word forms of the corpus and their normalized standard-Latin counterparts, whether these be originally standard or non-standard. Thus, each word form of the corpus receives a value which indicates the word's distance from the respective standard form. Edit distance means the minimum number of single-character changes required to transform one string into the other. I have utilized the simplest edit distance metrics, the Levenshtein distance, where the changes can be either insertions, omissions, or substitutions, and all are of equal import. Although, say, insertions and omissions are admittedly seldom equivalent linguistically, the same treatment of all changes keeps the procedure transparent and, consequently, the results easier to interpret. In this section, I will discuss first the theoretical aspects involved in the quantification of spelling. After that, I will explain how I have handled certain philological and textual issues, such as abbreviations, and, finally, how I have produced the normalized standard spellings by using the morphological tagging present in the corpus. Table 1 illustrates the applied procedure. If a word already has a standard spelling, it receives the value 0 (per and mensem). If there is one single-character deviation, like between omne and omnem, the edit distance is 1 and the relative edit distance is 1 of 5 characters of the standard form (omnem), i.e. 0.20, which corresponds to 20%. In practice, the spelling of single words is only of limited interest, whereas the corpus-linguistically important associations usually obtain on the level of text, writer, location, or period of time. Indeed, the relative edit distance value can be calculated for any coherently defined unit as a percentage of single-character spelling deviations in the total number of characters in that unit. For example, if we wish to calculate the relative edit distance value for the entire phrase per omne mensem octubrio ‘all the month of October’ of Table 1, the four spelling deviations (0 + 1 + 0 +3) are divided by the total number of characters in the phrase (3 + 5 +5 + 9 = 22), resulting in a relative edit distance value of 0.18 (18%). Table 1 Calculating edit distance between attested and normalized forms Attested form Normalized form Edit distance Characters Edit distance/characters per per 0 3 0 omne omnem 1 5 0.20 mensem mensem 0 5 0 octubrio octobrium 3 9 0.33 Attested form Normalized form Edit distance Characters Edit distance/characters per per 0 3 0 omne omnem 1 5 0.20 mensem mensem 0 5 0 octubrio octobrium 3 9 0.33 Table 1 Calculating edit distance between attested and normalized forms Attested form Normalized form Edit distance Characters Edit distance/characters per per 0 3 0 omne omnem 1 5 0.20 mensem mensem 0 5 0 octubrio octobrium 3 9 0.33 Attested form Normalized form Edit distance Characters Edit distance/characters per per 0 3 0 omne omnem 1 5 0.20 mensem mensem 0 5 0 octubrio octobrium 3 9 0.33 For the analyses of Section 4, the edit distance will be averaged on the total number of characters present in each scribe, document, or decade. Note that I have calculated the percentage against the number of characters of the standard form and not of the attested form. In theory, nothing would prevent using alternatively the attested form as the reference point. Yet, the chosen procedure based on the standard form is obvious here because the standard-Latin norm serves as the yardstick throughout the study. An additional motivation is that the phonological erosion of the (particularly word-final) syllables in Late Latin resulted in a situation where most misspellings are shorter, and not longer, forms than the respective standard forms, e.g. stio in place of aestivum in tertia parte lavore vernio et stio ‘third part of the spring and summer crops’ (MED 411). Consequently, it makes better sense to calculate the share of the edits in the characters of the often longer standard form than in the often shorter attested form because in this way the percentage remains between 0 and 100% (62.5% in the case of stio, five edits in eight characters). Instead, using the attested form as the point of reference would sometimes yield percentages over 100%: in the case of stio, five edits in four characters correspond to 125%. Indeed, LLCT has no misspelled word form with more edits than the number of characters in its standard form, while there are thirty-eight word forms with more edits than characters in the attested form. As a matter of fact, the difference between the edit distance percentages calculated in the two ways (based on the normalized or attested forms) is essential only with the most non-standardly spelled words with several single-character omissions. Words of this type are eventually rather few. In fact, only 0.8% of all the LLCT words contain more than two single-character deviations, and the deviations are not necessarily omissions or insertions, which affect the length of the word. This is why the differences between the two calculating methods become largely neutralized when larger units are dealt with. For example, if both the edit distance percentages are calculated for each scribe, the percentage based on the normalized form is smaller than that based on the attested form only by 0.2 percentage points. With large units, such as the entire number of characters written by a scribe, the difference between the percentages appears to be as good as constant. As a consequence, the associations between edit distance and the examined linguistic and extra-linguistic variables remain virtually the same, whichever the calculating method. Leaving aside this topic, it is necessary to highlight the fact that, obviously, not all words of any corpus can be included in an edit distance analysis for several reasons. Table 2 presents the excluded categories of LLCT and gives some impression about their magnitude. Fragmentary and abbreviated words are excluded from the analysis if their inflexional endings cannot be restored and expanded with certainty. However, when the fragmentation or abbreviation does not involve the inflexional endings, e.g. with [su]bscripsi ‘I signed’ or with p(er) ‘through/by’, the restored or expanded forms, subscripsi and per, are accepted into the analysis. This partly infelicitous method, which derives from the annotation principles underlying LLCT, is possible because the documentary formulae usually allow the reconstruction of even badly fragmentary passages with a high degree of certainty and because those abbreviations that leave the possible inflexional ending intact are unambiguous. Both of these categories are relatively small in LLCT and, given that the annotation of LLCT does not allow their isolation, this spelling study considers by default these reconstructed and expanded characters as having their standard-Latin form. The corpus-linguistic motivation of this annotation decision is explained in detail in Korkiakangas and Passarotti (2011). Table 2 Word categories excluded from the analysis Category Examples Tokens Fragmentary words lo[co?], Gaip[…€.€.] 3,634 Abbreviated words ind(ictione), s(upra)s(criptus) 26,453 Editorial additions , , 54 Editorial deletions rogatus {rogatus}, Deo {Deo} 86 Words with undefinable Classical Latin ancestor havone ‘uncle’ (∼avum), sculdais ‘official’, offeruimus/offersimus 2,535 Personal names Warnegausu, Ghittia, Gausprandus 33,553 Toponyms Montenonni, Luca, Cellules 6,014 Category Examples Tokens Fragmentary words lo[co?], Gaip[…€.€.] 3,634 Abbreviated words ind(ictione), s(upra)s(criptus) 26,453 Editorial additions , , 54 Editorial deletions rogatus {rogatus}, Deo {Deo} 86 Words with undefinable Classical Latin ancestor havone ‘uncle’ (∼avum), sculdais ‘official’, offeruimus/offersimus 2,535 Personal names Warnegausu, Ghittia, Gausprandus 33,553 Toponyms Montenonni, Luca, Cellules 6,014 Table 2 Word categories excluded from the analysis Category Examples Tokens Fragmentary words lo[co?], Gaip[…€.€.] 3,634 Abbreviated words ind(ictione), s(upra)s(criptus) 26,453 Editorial additions , , 54 Editorial deletions rogatus {rogatus}, Deo {Deo} 86 Words with undefinable Classical Latin ancestor havone ‘uncle’ (∼avum), sculdais ‘official’, offeruimus/offersimus 2,535 Personal names Warnegausu, Ghittia, Gausprandus 33,553 Toponyms Montenonni, Luca, Cellules 6,014 Category Examples Tokens Fragmentary words lo[co?], Gaip[…€.€.] 3,634 Abbreviated words ind(ictione), s(upra)s(criptus) 26,453 Editorial additions , , 54 Editorial deletions rogatus {rogatus}, Deo {Deo} 86 Words with undefinable Classical Latin ancestor havone ‘uncle’ (∼avum), sculdais ‘official’, offeruimus/offersimus 2,535 Personal names Warnegausu, Ghittia, Gausprandus 33,553 Toponyms Montenonni, Luca, Cellules 6,014 The editorial additions and deletions, which resolve haplographies or other scribal omissions and dittographies, respectively, are treated in a similar manner. In general, I have analysed all the words in the form in which they appear in the original documents, except if they involve an entire word, such as {rogatus} in ego Toto filio bone memorie Aurili rogatus {rogatus} ab Aliseu clerico (MED 209) ‘I, Toto, son of the late Aurilius, asked {asked} by Aliseus, the cleric’. In this case, the erroneously repeated word is left out of edit distance analysis, whereas, for example, the dittography lavora{vora}verimus is analysed in its full form lavoravoraverimus. Methodologically, a spelling comparison study always postulates a standard variant for each word form. Yet, all the words attested in LLCT do not exist in standard Latin. These words with no self-evident ancestor in standard Latin are mainly of Germanic origin, such as sculdais, a high official under the Lombard reign, but there are also Late Latin neologisms and loans from other languages, such as curte or curtis, which comes from the Greek khórtos ‘courtyard’, but seems to have no established Latin spelling. Typical post-Classical neologisms include, for example, the regularized perfect and participle forms of verbs with stem fero, such as offeruimus or offersimus instead of the suppletive obtulimus ‘we offered’, and offerta instead of oblata ‘donations’. These seem to have been widely accepted morphological variants in documentary Latin judging from their ample and varied use. Anyhow, their spelling is not fully stable, e.g. oferrui, offerui, offersi ‘I offered’. Although these words are totally functional, they cannot be included because no standard-Latin counterparts can be established for them. On the one hand, Classical and post-Classical Latin knew only the suppletive paradigm and, on the other, it seems misleading to claim that the attested quite systematic use of the regularized paradigm is just a deviation of the standard forms with suppletive stem. The same also applies to most names, both anthroponyms and toponyms, which I have excluded from this spelling analysis. Most personal names in Lombard Italy were fashionably of Germanic origin and had no standard spelling, e.g. Warnegausu, Ghittia, and Gausprandus. Therefore, it is impossible to compare, for example, Austripert and Ostripertus to each other or to any norm. Even though mostly of Latin origin, place names must also be excluded from the analysis because they are usually difficult to match with a corresponding standard-Latin name. It is known that Capannori derives probably from a Late Latin expansive -ora plural of the standard word capanna, meaning roughly ‘small huts’, but it does not seem sensible to compare Capannori to its centuries-old etymological ancestor. Names were names, and people were most often not interested in whether their spelling was in keeping with the etymological roots. Note that about 1,000 personal and place names are fragmentary, so the categories presented in Table 2 slightly overlap. Consequently, 347,832 words remain for the spelling analysis. This and the following passage describe how the edit distance values were assigned to each word of LLCT technically. The procedure is made possible by that each LLCT word is stored with a lemma and a morphological analysis. For example, the non-standard third-person singular indicative form aues is connected to the lemma habeo ‘to have’ and the morphological tag verb-third_person-singular-present-indicative-active-#-#-# in the treebank. It is possible to create a corresponding normalized, standard-Latin, form (habet) on the basis of these two pieces of information if they are used as query terms within a comprehensive list of standard-Latin forms with respective lemmas and morphological analyses. For this purpose, I created a two-column list of two million Classical Latin word form entries with respective lemma + morphological analysis pairs, like in Table 3. I realized this Standard Word form Lexicon by feeding the publicly available Open Office Latin spelling lexicon into the Whitaker's WORDS (Words 1.97FC) tagger program.4 Table 3 Standard word form lexicon haberet habeo-verb-third_person-singular-imperfect-subjunctive-active-#-#-# haberetis habeo-verb-second_person-plural-imperfect-subjunctive-active-#-#-# haberetur habeo-verb-third_person-singular-imperfect-subjunctive-passive-#-#-# haberi habeo-verb-#-#-present-infinitive-passive-#-#-# haberis habeo-verb-second_person-singular-present-indicative-passive-#-#-# habes habeo-verb-second_person-singular-present-indicative-active-#-#-# habet habeo-verb-third_person-singular-present-indicative-active-#-#-# habete habeo-verb-second_person-plural-present-imperative-active-#-#-# habetis habeo-verb-second_person-plural-present-indicative-active-#-#-# habeto habeo-verb-second_person-singular-future-imperative-active-#-#-# habeto habeo-verb-third_person-singular-future-imperative-active-#-#-# habetote habeo-verb-second_person-plural-future-imperative-active-#-#-# habetur habeo-verb-third_person-singular-present-indicative-passive-#-#-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-ablative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-nominative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-vocative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-accusative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-nominative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-vocative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-dative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-genitive-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-nominative-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-vocative-# habitam habeo-participle-#-singular-perfect-participial-passive-feminine-accusative-# habitarum habeo-participle-#-plural-perfect-participial-passive-feminine-genitive-# haberet habeo-verb-third_person-singular-imperfect-subjunctive-active-#-#-# haberetis habeo-verb-second_person-plural-imperfect-subjunctive-active-#-#-# haberetur habeo-verb-third_person-singular-imperfect-subjunctive-passive-#-#-# haberi habeo-verb-#-#-present-infinitive-passive-#-#-# haberis habeo-verb-second_person-singular-present-indicative-passive-#-#-# habes habeo-verb-second_person-singular-present-indicative-active-#-#-# habet habeo-verb-third_person-singular-present-indicative-active-#-#-# habete habeo-verb-second_person-plural-present-imperative-active-#-#-# habetis habeo-verb-second_person-plural-present-indicative-active-#-#-# habeto habeo-verb-second_person-singular-future-imperative-active-#-#-# habeto habeo-verb-third_person-singular-future-imperative-active-#-#-# habetote habeo-verb-second_person-plural-future-imperative-active-#-#-# habetur habeo-verb-third_person-singular-present-indicative-passive-#-#-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-ablative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-nominative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-vocative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-accusative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-nominative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-vocative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-dative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-genitive-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-nominative-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-vocative-# habitam habeo-participle-#-singular-perfect-participial-passive-feminine-accusative-# habitarum habeo-participle-#-plural-perfect-participial-passive-feminine-genitive-# Table 3 Standard word form lexicon haberet habeo-verb-third_person-singular-imperfect-subjunctive-active-#-#-# haberetis habeo-verb-second_person-plural-imperfect-subjunctive-active-#-#-# haberetur habeo-verb-third_person-singular-imperfect-subjunctive-passive-#-#-# haberi habeo-verb-#-#-present-infinitive-passive-#-#-# haberis habeo-verb-second_person-singular-present-indicative-passive-#-#-# habes habeo-verb-second_person-singular-present-indicative-active-#-#-# habet habeo-verb-third_person-singular-present-indicative-active-#-#-# habete habeo-verb-second_person-plural-present-imperative-active-#-#-# habetis habeo-verb-second_person-plural-present-indicative-active-#-#-# habeto habeo-verb-second_person-singular-future-imperative-active-#-#-# habeto habeo-verb-third_person-singular-future-imperative-active-#-#-# habetote habeo-verb-second_person-plural-future-imperative-active-#-#-# habetur habeo-verb-third_person-singular-present-indicative-passive-#-#-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-ablative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-nominative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-vocative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-accusative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-nominative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-vocative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-dative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-genitive-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-nominative-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-vocative-# habitam habeo-participle-#-singular-perfect-participial-passive-feminine-accusative-# habitarum habeo-participle-#-plural-perfect-participial-passive-feminine-genitive-# haberet habeo-verb-third_person-singular-imperfect-subjunctive-active-#-#-# haberetis habeo-verb-second_person-plural-imperfect-subjunctive-active-#-#-# haberetur habeo-verb-third_person-singular-imperfect-subjunctive-passive-#-#-# haberi habeo-verb-#-#-present-infinitive-passive-#-#-# haberis habeo-verb-second_person-singular-present-indicative-passive-#-#-# habes habeo-verb-second_person-singular-present-indicative-active-#-#-# habet habeo-verb-third_person-singular-present-indicative-active-#-#-# habete habeo-verb-second_person-plural-present-imperative-active-#-#-# habetis habeo-verb-second_person-plural-present-indicative-active-#-#-# habeto habeo-verb-second_person-singular-future-imperative-active-#-#-# habeto habeo-verb-third_person-singular-future-imperative-active-#-#-# habetote habeo-verb-second_person-plural-future-imperative-active-#-#-# habetur habeo-verb-third_person-singular-present-indicative-passive-#-#-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-ablative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-nominative-# habita habeo-participle-#-singular-perfect-participial-passive-feminine-vocative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-accusative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-nominative-# habita habeo-participle-#-plural-perfect-participial-passive-neuter-vocative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-dative-# habitae habeo-participle-#-singular-perfect-participial-passive-feminine-genitive-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-nominative-# habitae habeo-participle-#-plural-perfect-participial-passive-feminine-vocative-# habitam habeo-participle-#-singular-perfect-participial-passive-feminine-accusative-# habitarum habeo-participle-#-plural-perfect-participial-passive-feminine-genitive-# After that, it was easy to match each word of LLCT with the lexicon entries with an appropriate R script.5 The script picks the standard-Latin word form from the lexicon for each LLCT word form and calculates the edit distance between the originally attested form and the one taken from the lexicon, i.e. the normalized form. When calculating the edit distance, the letters u and v, i and j, as well as k and c were counted as equivalent because they were considered merely alternative graphical representations of the same graphemes. The variation u/v (e.g. uua/uva ‘grape’) is purely conventional and varies across the editions on which LLCT is based. Instead, pairs, such as estimatione/estimatjone, are potentially relevant linguistically because there j, also called i longa, sometimes stands for a palatalized glide. This is, however, beyond the scope of the present study but, if needed, the i longae can be taken into account in a separate study. The graph k is rarely used as a learned relic in certain fixed expressions, such as kalendis. In standard Latin, much of grammatical information is encoded in the word-final inflexional morphemes. As stated in Section 2, in Late Latin, this information was increasingly conveyed by other means, such as prepositions and word order, as far as noun declension is concerned. At the same time, the word-final sounds underwent phonetic erosion (Zamboni, 2000, pp. 149–50; Adams, 2013, pp. 128–63). The outcome of this development was the Romance-type inflexional endings, such as -o in presbitero (standard presbyterum ‘priest-ACC’) and -e in tene (standard tenet ‘hold-3SG.PRES’), which occur with varying frequencies in LLCT. Non-standard spellings which occur in inflexional endings often involve deviations from standard morphology/morphosyntax in addition to phonologically motivated deviations, while the non-standard spellings concerning the word stem are principally motivated by underlying phonological change only.6 Linguistically unmotivated spelling deviations, i.e. typos, are presumably distributed evenly on stem and ending. Although phonological change obviously affects alike both stems and affixes, it might be useful to measure the edit distance separately in the word stems and in the inflexional morphemes because the morphosyntactic evolution of Late Latin manifests itself by way of inflexional affixes. A separate examination would make it possible to assess whether the edit distance values differ essentially between stem and inflexional ending and whether they are differently correlated with the other variables which describe the scribes' linguistic abilities. For the present, LLCT does not, however, contain information about the word stems, so this kind of experiment is not possible, unless one is content with a dirty solution that concentrates on a fixed number of characters counted from the end of each declinable word. This does not, however, seem unproblematic because the inflexional endings of Latin differ greatly in size. Any fixed-number approach is likely to produce rather fuzzy results if it is applied equally to endings as different as, for example, cas-a and fu-isserunt. Therefore, the present study examines the edit distance on entire word forms only. A related theoretical–methodological question is that of weighting of different types of edits: for example, omissions of characters could have a different weight on the edit distance value than additions. It was shown above in this section that writers usually tend to omit something there should be instead of adding something there should not be, a tendency fostered by the phonological development of Latin. Indeed, weighting edits would be likely to turn out useful in spelling variation studies. In LLCT, the weighting method is, however, not possible because LLCT has been annotated using standard-Latin morphology as the yardstick but allowing for the comprehensive linguistic change that took place in Late Latin. As for phonology, the lenition of word-final sounds as well as other sound changes have been taken into account. By way of example, this means that, regardless of what one thinks of the actual state of the case system, the form casa ‘house’ has been annotated etymologically as an accusative in ad casa ‘to/by the house’ and the form casam etymologically as an ablative in de casam ‘from the house’ and etymologically as a nominative in the subject function, e.g. casam illa regitur a Petro massario ‘that house is governed by Petrus the tenant’.7 The described etymology principle reduces the morphological tags of the attested forms to those standard-Latin forms they etymologically originate from and leaves unheeded the phonological change: the form casa-m is an accusative of casa (nominative) in standard Latin, but as the final -m was not pronounced any longer, it is not counted in the annotation. This annotation principle is parsimonious and optimized for non-standard Latin (Korkiakangas and Passarotti, 2011), but it is easy to understand how difficult it would be to measure different types of edits in LLCT on this annotational basis: the results would hardly be revealing, given that in a corpus study one cannot be aware of the etymological considerations behind each form. 4 Edit Distance in Linguistic and Historical Context In this section, I present first a case study on prepositions and then a somewhat broader examination of the scribes' diachronic spelling variation. These are expected to illustrate the potential of the here developed method for historical linguistics and for historical document studies. A full-scale examination of the relationship between spelling variation and relevant linguistic features, however, would require a study of its own. The linguistic case study is based on the assumption that non-standard spelling in historical texts is likely to correlate with the overall grammatical level of the language of those texts. The conservative standard spelling and certain linguistic features were not supported any longer by the spoken idiom, i.e. by the scribes' own first-language (L1) grammatical system, to the degree they were in the previous stages of the Latinity. Instead, they needed to be learnt by memorization, by cognitive effort, practically as second-language (L2) phenomena. Therefore, the quality of spelling and the mastery of these linguistic features can be reasonably expected to correlate with each other. As a consequence, spelling variation is supposed to be a practical measure of assessing the writers' learned language skills. However, the correlation between a scribe's overall spelling level and his use of some grammatical features may be partly determined not only by the quality of the scribe's training but also by the qualities of the feature. A learner's own L1 system obviously affects more heavily the use of those features that are out the L2 learner's conscious control than it affects spelling, which is always a convention and, as such, primarily a matter of training. This is why I chose to examine prepositions, which are free grammatical morphemes, and consequently, more likely to be under a learner's control than, for example, word order. Resting on Goldschneider and DeKeyser's (2001) findings, I propose in Korkiakangas (in preparation) that free grammatical morphemes were likely more salient for an early medieval Italian Latin L2 learner than abstract syntactic rules, like those conditioning word order, and as such, more easily acquired. A further motivation of investigating the correlation between spelling level and grammatical features is the wish that this kind of language competence measure may ultimately help detecting learned versus spoken-language-driven variation in historical texts, where linguistic preferences, such as prestige forms, are often assigned only with difficulty. This is to say that the here developed spelling metrics can be later used to examine the linguistic status of features which have so far not allowed categorization according to whether they are to be considered evanescent or still typical of the spoken usage: if the decreasing or hypercorrect use of a certain feature is systematically associated with non-standard spelling, the feature is likely to have been alien to the spoken language of the time. My precise working hypothesis in the following case study is that the more non-standard the spelling is, the more frequent are the novel, Romance-type, linguistic features, and the poorer the command of those standard-Latin features that were in decline in Late Latin. The innovative Romance-type features had obviously crept into the written code from the scribes' L1 system, while the conservative features derived from the centuries-old legal Latin. The latter were learnt (or not) through the scribal training and reproduced with varying success (Sabatini, 1965). The case study investigates whether spelling variation is associated with an increased use of non-standard, innovative prepositions. In Late Latin, Classical prepositions were challenged by new formations, most of which were originally adverbs. An often-cited example is the compound adverb abante (ab + ante ‘from + before’) which developed into a preposition and has various descendants in the Romance languages: e.g. French avant, Italian avanti, both with meaning ‘before’. However, not all the prepositions attested in the late imperial and early medieval Latin non-standard texts ended up in the Romance languages.8 Figure 1 shows that, in LLCT, there appears to be a statistically significant, albeit rather weak, positive correlation (Pearson's r = 0.30)9 between the scribes' spelling level (edit distance value) and their use of the following twenty innovative prepositions, some of which were probably temporary formations: anteposito, da, desub, desuper, excepto, fine/fini, foras, foris, hactenus, inante, insimul, insuper, intro, longinquo, recta, rectum, retro, sequenter, subtus, and usque (1,002 occurrences in total, corresponding to 2.3% of the prepositions in LLCT). The relative frequency of the prepositions is calculated against only those Classical prepositions that are in competition with the innovative prepositions. This means that the eight LLCT prepositions that were used in Classical Latin and continue to be used in Italian basically with the same meanings, i.e. ad, cum, de, in, per, secundum, sub, and supra, are left out of the calculation.10 These are also the most frequent ones: they are together responsible for 71.7% of all the prepositions in LLCT. It is true that the medieval variants of Italian allow more variation in the use of prepositions than the modern standard Italian, but it is often impossible to tell which of these were learned Latinisms. Importantly, the correlation behaves as predicted by the working hypothesis: the more standard-conformant the spelling, i.e. the lower the edit distance, the fewer the innovative prepositions. However, as was suggested above, the weakness of the correlation may be to some extent explained by that the interference of training might be more noticeable with spelling than with the preposition use. Fig. 1 View largeDownload slide Innovative prepositions as a function of edit distance in LLCT, calculated per scribe Fig. 1 View largeDownload slide Innovative prepositions as a function of edit distance in LLCT, calculated per scribe Based on the percentages underlying Fig. 1, it is apparent that the standard-Latin prepositions still dominate in LLCT, the most common prepositions being in (31.7%), ad (12.0%), and ab (11.3%). Indeed, 28% of the 220 LLCT scribes do not utilize the above listed twenty innovative prepositions at all, while others let innovative prepositions replace the Classical ones in some tens of percents of the cases, the two least Classical scribes reaching 71 and 92%. Thus, the variation between scribes is huge. This is also a hint that some scribes had a more imperfect training than others. However, it is possible that the relative frequency of innovative prepositions is partly related to the document type (e.g. charter, judgement, breve), given that different textual contents are likely to favour different prepositional expressions. Indeed, it turns out that with the small document group of lists and registers (brevia, only fifteen documents), the variation in the use of innovative prepositions is considerably bigger and the innovative prepositions more frequent than with other types of documents: in the brevia, the relative proportion of innovative prepositions is 60%, while the respective percent of all the other document types (non-brevia) is only 6.9%. To clarify the relationship between innovative prepositions, edit distance, and document type, a linear regression analysis was conducted in IBM SPSS (Table 4). Note that to estimate the influence of the document type, it is necessary to average the edit distance value on the total number of characters in each document instead of the number of characters produced by each scribe. This is because one scribe could write documents of several types. Thus, the statistics of Table 4 are calculated using document as the counting unit. Table 4 Linear regression model for the relative proportion of innovative prepositions in LLCT, calculated per document Explaining variable Standardized coefficient (beta) Significance Edit distance 0.280 P < 0.001 Document type (reference category: breve) −0.175 P < 0.001 R2 = 0.111 Explaining variable Standardized coefficient (beta) Significance Edit distance 0.280 P < 0.001 Document type (reference category: breve) −0.175 P < 0.001 R2 = 0.111 Table 4 Linear regression model for the relative proportion of innovative prepositions in LLCT, calculated per document Explaining variable Standardized coefficient (beta) Significance Edit distance 0.280 P < 0.001 Document type (reference category: breve) −0.175 P < 0.001 R2 = 0.111 Explaining variable Standardized coefficient (beta) Significance Edit distance 0.280 P < 0.001 Document type (reference category: breve) −0.175 P < 0.001 R2 = 0.111 The variable to be explained by the regression model is the percentage of the innovative prepositions. The model examines both the explaining variables (edit distance and document type) at the same time.11 The standardized coefficient indicates for the explaining variables how much the predicted probability of the explained variable increases (positive value) or how much it decreases (negative value) when the explaining variable increases by one unit. Just like the explained variable, edit distance is a continuous variable with values between 0 and 100%, while document type is a binary categorical variable, where breve is the reference category (0) as opposed to non-breve (1). The figures of Table 4 show that when the edit distance value increases by one unit, i.e. 1 percentage point, the predicted possibility of the preposition to be an innovative one increases by 0.28 percentage points. In other words, the more there are spelling deviations in a document, the more frequent are also the innovative prepositions. Likewise, when the document type variable increases from 0 to 1, i.e. shifts from breve to non-breve, the predicted possibility of the preposition to be an innovative one decreases by 17.5 percentage points. That is to say the document type being other than breve predicts a lower percentage of innovative prepositions. All this means that spelling variation is, indeed, associated in a statistically significant manner with an increased use of innovative prepositions even on document level, but the document type must also be taken into consideration. The explanation degree of the model is relatively low (R2 = 0.111), so a richer model with more explaining variables is wanted to account better for the variation in the use of the innovative prepositions. This will again be a subject for a further study. It is now time to turn to examining the diachronic distribution and the historical dimension of edit distance. Figure 2 presents the edit distance percentage as a function of time, so that edit distance is averaged on the scribes. In other words, each dot of the chatter plot represents the proportion of single-character spelling deviations (i.e. edits) in the sum of all the characters written by a scribe. For example, the productive scribe Austripertus wrote fifteen documents in the 760s and 770s. These 15 documents comprise 18,809 characters and contain only 516 spelling deviations, which makes Austripertus the most standardly spelling scribe of LLCT with 2.7% of edit distance (see the lowest dot in Fig. 2). Fig. 2 View largeDownload slide Edit distance as a function of time in LLCT, calculated per scribe Fig. 2 View largeDownload slide Edit distance as a function of time in LLCT, calculated per scribe Note that the dot's position on the time axis represents the chronological middle of the scribe's active career, i.e. the mean of the years between his first and last surviving document in LLCT, like in the case of Austripertus: his first surviving document dates back to 767 and the last one to 773, so the dot is situated at 770. From 49% of the 220 LLCT scribes, only one document survives, whereas 50% of the documents are written by the 17 most prolific scribes who were active for several years. Indeed, 27% of the scribes worked for more than 5 years, the longest career being of astonishing 42 years. The here proposed method may be criticized as inaccurate because it relies on an average of writing years, on which the documents may be rather unevenly scattered. It may also be argued with reason that the scribes' spelling was likely to vary over their career. Nonetheless, the pattern of Fig. 2 is surprisingly consistent with the diachronic distribution of edit distance of each document, i.e. where the edit distance percentage is averaged on the total number of characters in each document. Therefore, I consider it to be justified to utilize the averages of the activity periods. Somewhat surprisingly from the linguistic and historical point of view, Fig. 2 shows that the edit distance decreases by time. In other words, the scribes' spelling seems to improve within the time window of LLCT. However, the data points do not reveal a fully linear relationship between edit distance and time. This and the fact that the dispersion is wide result in a correlation coefficient of −0.47.12 Instead, the diachronic development seems to follow more like a polynomial curve: after the initial considerable variation, the average percentage falls rather drastically around 810, and then the data points go on relatively stable until they again rise a little before the end of the time frame. In the 9th century, especially between the 820s and the 870s, the great majority of the scribes employs more standard spelling than the 220 LLCT scribes on average. The mentioned notable variation before 810 and the radical decrease thereafter are likely to be related to a striking homogenization of the scribes' language skills. This becomes obvious when Fig. 2 is compared to the bar chart of Fig. 3, which presents the relative distribution of edit distance per decade: the general trend is decreasing until the full 9th century when a slight increase begins roughly after the 830s. Note that since only relatively few documents survive from the first four decades of LLCT, the decades 710–730 and 740–750 have been combined in Fig. 3. Now all the chart bars contain at least 29 documents, the smallest one being the bar 710–730 with 47,807 characters.13 Fig. 3 View largeDownload slide Edit distance as a function of time in LLCT, calculated per decade Fig. 3 View largeDownload slide Edit distance as a function of time in LLCT, calculated per decade Yet, one could have expected that the scribes' spelling deteriorates consistently by time, given that the spoken language obviously kept continuously drawing away from the written language and, as a consequence, the mastery of the written code became more and more difficult to acquire on the basis of one's mother tongue. On the other hand, a rather common view on the conceptual differentiation between the Romance vernaculars and the medieval Latin as a learned lingua franca is that language users became aware of this distance, or at least began to be aware of it, at the time when the reformatory movements, initiated by Carolingian rulers, re-established the (post-)Classical grammar as the ideal for written Latin. Either because of some natural development or in consequence of these reforms, the common people would not have any longer understood the reformed, classicizing, written Latin, and it was necessary to create a written form of the spoken vernacular for the time being (Heene, 1991; cf. Van Uytfanghe, 1991; Wright, 1991b). This ex novo development took then place in different ways and paces, depending on the needs of local societies. Most scholars agree on this general line of development, but their views still conflict on the motivations and mechanisms of the conceptual differentiation between Latin and vernacular as well as on the emergence of the vernacular writing systems.14 However, it has been thought that the Carolingian reforms did not affect greatly the Regnum Italiae, a rather detached part of the empire after its conquest in 774. This opinion seems to be based on the slow introduction of the Carolingian minuscule, the reform-induced standard script, into Italy (Bartoli Langeli, 2006, pp. 30–3) and probably also on the late emergence of the extensive use of the Italo-Romance vernaculars. In spite of this, the falling trend visible in Fig. 2 might hint at first sight at a kind of reform in spelling, whether it be related to the Carolingian reforms. Orthography played a central role in the Carolingian reforms,15 but the improving spelling in LLCT may well be a parallel, albeit independent, development and, as such, a symptom of a growing awareness of the two vastly different varieties: the spoken and the written form of what was still probably considered Latin (if conceptualized at all).16 It is perhaps worth reminding that the earliest reliably dated surviving fragments of vernacular Italian writing are more or less contemporary with LLCT, the famous placiti cassinesi (960–963) being indeed documentary texts themselves (Castellani, 1976). Another remarkable point is the above-mentioned particularly narrow variation of edit distance in the 820s–870s, where most of the values are between 4 and 6% (Fig. 2). Even this feature of the spelling variation distribution appears to bear witness to a specific historical change. Under the Lombard reign in the 8th century, most Luccan scribes belonged to the clergy and signed their documents as clerics, priests, and deacons. Keller (1973) shows that the Frankish counts of Lucca reorganized the legal administration of Tuscia in the early decades of the 9th century, probably according to the model of the royal court of Pavia where the scribes were all laymen (Keller, 1973, pp. 119–24; cf. Costambeys, 2013, p. 235).17 The radical reduction of variation in spelling and its concomitant increasing faithfulness to the standard in Fig. 2 coincides neatly with the ascension to power of the new, energetic count Bonifatius I, in 812 or 813. Keller reports that, under Frankish counts, the ecclesiastical scribes began to add first ‘notarius’ to their ecclesiastical title until, with Bonifatius I, the clergy was practically excluded from public administration. Thus, we should not explain Fig. 2 by claiming that the Tuscan scribes reformed their spelling suddenly, but by suggesting that the documentary evidence after the 810s comes almost in its entirety from a different body of scribes than earlier: ecclesiastical scribes were ousted by lay scribes by the count's order. These lay scribes, who actually are not very many in number, seem to employ a more standardized spelling compared to the ecclesiastical scribes. To explain the reasons behind this, an in-depth study would be needed but, for the present, it can be assumed that the ecclesiastical scribes were a heterogeneous group consisting not only of productive experts close to the Luccan cathedral but also of ordinary country clerics who were sometimes quite inexperienced in (documentary) writing. I suggest that this is reflected by the considerable dispersion of the data points in Fig. 2 before the 810s. Albeit a tempting interpretation, it is improbable that the Frankish rulers would have replaced the former comital lay scribes with their own Frankish or Northern Italian scribes, who would have been more trained spellers. First, the lay scribes continue to utilize virtually the same formulae that had been in use in Tuscia, and they seem to have locally widely used names. Besides, the historical sources of the time do not tell about permanent transfers of scribes; on the contrary, the Frankish conquerors rather seem to have had problems in recruiting loyal officials in Italy. This Section 4 has demonstrated that the edit distance distribution does not only superficially reflect historical events, but also tells us interesting details about them—details that would have remained unknown without the quantitative approach. Now we know that the historical fact that certain scribes were replaced by certain other scribes shows as a temporary rise in standard spelling, albeit we do not know the exact spelling conventions which changed. This will again be a case for further study. We also got to know that the general trend, which had already begun before the administrative change of the 810s, was towards a more standardized spelling, and might be a symptom of some kind of reformatory aspirations among the Tuscan scribes. After discussing the ouster of the ecclesiastical scribes, it looks obvious that the improving spelling quality of LLCT is partly related to this change from the 810s onwards. Contrary to ecclesiastical scribes, the education of the lay scribes was perhaps influenced by at least some of the reformatory ideals which were likely to obtain in the Carolingian power centres of Italy. The displacement of the ecclesiastical document writers also meant an end for amateur documentary production, and can be seen as a step in a process towards centralization and increasing professionalism which contributed to the formation of the Italian notaryship (Amelotti and Costamagna, 1975, p. 153 ff.; Costambeys, 2013, p. 246 ff.). By combining the edit distance metric with selected linguistic variables, it will probably be possible to sketch a more exhaustive picture of what was really happening and whether the person changes also involved changes in working practices and shared linguistic preferences. 5 Summary and Conclusion This article discussed a method of measuring spelling variation in historical texts corpora with non-standard spelling. The method was tested on the LLCT. Spelling variation was operationalized by normalizing the non-standard word forms of LLCT into standard-Latin forms and subsequently quantified by calculating the Levenshtein edit distance between the normalized and attested word forms. The normalized forms were produced by matching the lemma + morphological analysis pairs of LLCT with those of a sufficient standard-Latin word form lexicon. The developed method was tested in Section 4 by a case study, which was linguistic by nature, and by a somewhat broader diachronic discussion, which was meant to contribute to historical document studies. The linguistic case study was based on the hypothesis that spelling in historical texts is likely to correlate with the mastery of certain linguistic features. It was assumed that the more non-standard the spelling is, the more frequent are the novel, Romance-type, linguistic features, and the poorer the command of those standard-Latin features that were in decline in Late Latin. The case study showed that there, indeed, is a statistically significant dependence between non-standard spelling and the use of non-standard, innovative prepositions, albeit the effect of the document type must also be taken into account. The historical-diachronic examination discussed the chronological variation of edit distance within LLCT and showed that the scribes' spelling becomes more standard-conformant by time within the examined time window. The particularly drastic decrease of variation around 810 and the low edit distance value thereafter seem to be connected to the historical development of replacing ecclesiastical scribes by lay scribes in consequence of an administrative reorganization in Tuscany. Together with the general decreasing trend, the relatively standard spelling of the lay scribes may hint at an intentional spelling reform, either independent or related to the Carolingian reforms. Thus, the spelling variation quantification turns out to provide interesting insights into the linguistic reality and attitudes. I hope to further examine them in a prospective more detailed linguistic–historical study on documentary Latin. The method can also be applied to other historical text corpora of different languages, such as Latin and Greek inscription corpora, ancient Greek documentary papyri, as well as various European early modern private letter collections, to name but a few. Footnotes 1 Many thanks are due to Professor Dag Haug who commented on the first version of this article. I am also grateful to the two anonymous reviewers for their insightful comments and suggestions. 2 Linguists have begun to exploit orthographic variation as a variable only relatively recently. It has been so far utilized mainly by historical sociolinguistics (see Rutkowska and Rössler, 2012) and, on the other hand, by language acquisition studies (Burt, 2006, Wood et al., 2014). For a corpus-based application of spelling correctness to text classification, see Diderichsen et al. 2015. 3 The editions are Schiaparelli, 1929 (CDL1), Schiaparelli, 1933a (CDL2), Brunetti, 1833 (CDT), Barsocchini, 1837 (MED), Bertini, 1836 (MED2), Barsocchini, 1841 (MED3). Most documents have also been published recently in Cavallo et al. 1997– (ChLA), which has helped in correcting some erroneous readings of the 19th-century editions. 4 I used a preprocessed version of the Open Office Latin lexicon available at the Github repository of the CIS-OCR team of The Center for Information and Language Processing at the University of Munich (https://github.com/cisocrgroup/Resources/tree/master/lexica, accessed 20 May 2017). 5 I cordially thank Matti Lassila (University of Jyväskylä Library) for the design of the R script. 6 For example, the subject presbitero ‘priest’ in si presbitero venerit ‘if the priest comes’ is usually considered a phonologically reduced form of the standard accusative form presbyterum. In standard Latin, the subjects of finite verbs require the nominative (presbyter). Apart from witnessing phonological change, this kind of accusative subject is an indication of the morphosyntactic reorganization of grammatical relations between the arguments of the verb taken place in Late Latin (see Rovai, 2016, pp. 57–86). 7 In standard-Latin, the preposition ad requires an accusative (casam), while the preposition de requires an ablative (casa); the case of the subject is the nominative (casa). For more examples of the annotation of LLCT, see Korkiakangas and Passarotti, 2011. 8 There seems to be no overview study on the innovative prepositions in Late Latin. For the various aspects of the Late Latin preposition use, see Adams, 2013, pp. 582–93 (compound adverbs and prepositions); Svennung, 1935, pp. 325–82 (adverbs and prepositions), Löfstedt, 1961, p. 278 ff. (prepositions and preverbs); Rovai, 2013 (deverbal prepositions). ThLL provides important information on the pre-600 AD occurrences of certain Late Latin prepositions: fine/fini (rather dubious attestations in, e.g., Apuleius and Columella, s.v. finis, ThLL VI:1, 798, 43–68), foras/foris (prepositional use condemned by ancient grammarians, s.v. foras, ThLL VI:1, 1039, 76–1040, 23; 1046, 8–56. For da, see Aebischer, 1976, p. 1252. 9 Correlation coefficients lower than 0.30 are conventionally thought to stand for no correlation. With this kind of non-physical data, the correlation often remains rather low, although it is theory-compatible and seems to describe plausibly the examined phenomenon. 10 In modern standard Italian, these prepositions correspond to a, con, di, in, per, secondo, su, and sopra, respectively. In addition to minor phonologically motivated changes, some semantic shifts are also attested between Latin and Italian, but I do not consider them to be thorough enough to discredit the proposed research setting. For example, in Italian, a (< Lat. ad) can also express recipiency along with the original local/pertinentive meaning, and the semantic scope of the Classical Latin de, originally related to source and partitive meanings, has widened essentially to cover ownership in Italian (di) (Väänänen, 1956; Valentini, 2017). Note that I do not consider the Italian fra and tra to be direct ancestors of the Latin infra and intra, respectively, because their complex evolution involves both considerable semantic shift and phonetic reduction in respect of the Latin ancestral forms (on the merger of infra, intra, intro, and inter, see Svennung, 1935, pp. 366–71). 11 The exiguous correlation between the explaining variables (r = −0.032) fulfils the conditions for regression analysis. 12 The dating by the mean of the active years is bound to distort the distribution and decrease the correlation coefficient. 13 The number of documents is 2 in the 710s (4,305 characters), 12 in the 720s (23,154 characters), 15 in the 730s (20,348 characters), 16 in the 740s (25,806 characters), 31 in the 750s (53,151 characters), and 69 on average in each decade thereafter (124,069 characters on average per decade). 14 The decision of the Council of Tours in 813 that the sermons should be preached in rustica Romana lingua instead of Latin in the Romance-speaking parts of Gaul is usually cited on this occasion because it is apparently the first mention of a Romance language as distinct from Latin (Heene, 1991, p. 146; Wright, 2002, pp. 140–3). There is a vast and relatively recent literature on the transition between Latin and early Romance, largely as a reaction to Wright's book Late Latin and Early Romance (1982). I refer here a collection of articles edited in Wright, 1991a. 15 Alcuin of York is often considered a central figure in carrying out Charlemagne's intellectual renovation projects. He was charged with writing the treatises De orthographia (On Orthography) and Ars grammatica (Art of Grammar) (Irvine, 1994, pp. 313–16; Wright, 2002, pp. 127–46). 16 This article does not enter the discussion on how the written Romance languages came into being or if that preceded or followed the metalinguistic conceptualization of the distance between the varieties. For this, I refer to Wright, 1991b, and Lloyd, 1991. 17 For a thorough discussion on the (justification of the) terms ‘ecclesiastical’ and ‘lay’ in documentary context, see Costambeys, 2013. References Adams J. N. ( 2013 ). Social variation and the Latin language . Cambridge : Cambridge University Press . Aebischer P. ( 1951 ). La préposition da dans les chartes latines italiennes du moyen âge . Cultura Neolatina , 11 : 5 – 24 . Amelotti M. , Costamagna G. ( 1975 ). Alle origini del Notariato italiano . Milano : Giuffrè . Auernheimer B. ( 2003 ). Die Sprachplanung der karolingischen Bildungsreform im Spiegel von Heiligenviten . München, Leipzig : K.G. Saur . Bamman D. , Passarotti M. , Crane G. , Raynaud S. ( 2007 ). Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3). http://nlp.perseus.tufts.edu/syntax/treebank/ldt/1.5/docs/guidelines.pdf (accessed 19 May 2017). Barsocchini D. ( 1837 ). Memorie e documenti per servire all'istoria del Ducato di Lucca . Tomo V, parte II. Lucca : Francesco Bertini . Barsocchini D. ( 1841 ). Memorie e documenti per servire all'istoria del Ducato di Lucca . Tomo V, parte III. Lucca : Francesco Bertini . Bertini D. ( 1836 ). Memorie e documenti per servire all'istoria del Ducato di Lucca . Tomo IV, parte II. Lucca : Francesco Bertini . Brunetti F. ( 1833 ). Codice diplomatico toscano . Tomo I, parte II. Firenze : Leopoldo Allegrini e Giovanni Mazzoni . Bartoli Langeli A. ( 2006 ). Notai. Scrivere documenti nell'Italia medievale . Roma : Viella . Bresslau H. ( 1958 ). Handbuch der Urkundenlehre für Deutschland und Italien. Band 1 . Berlin : Walter de Gruyter . Brown W. C. , Costambeys M. , Innes M. , Kosto A. J. (eds) ( 2013 ). Documentary Culture and the Laity in the Middle Ages . Cambridge : Cambridge University Press . Burt J. ( 2006 ). Spelling in adults: the combined influences of language skills and reading experience . Journal of Psycholinguist Research , 35 ( 5 ): 447 – 70 . Google Scholar CrossRef Search ADS Castellani A. ( 1976 ). I più antichi testi italiani: edizione e commento . Pàtron: Bologna . Cavallo G. , Nicolaj G. (eds) ( 1997 –). Chartae Latinae Antiquiores. Facsimile-edition of the Latin Charters, 2nd Series, Ninth Century. Dietikon/Zürich : Urs Graf Verlag . Costambeys M. ( 2013 ). The laity, the clergy, the scribes and their archives: the documentary record of eighth- and ninth-century Italy. In Brown W. , Costambeys M. , Innes M. , Kosto A. (eds), Documentary Culture and the Laity in the Early Middle Ages . pp. 231 – 58 . Thesaurus Linguae Latinae (ThLL) . ( 1900 –). Deutsche Akademie der Wissenschaften zu Berlin . Lipsiae : B. G. Teubner . Diderichsen P. , Christensen S. , Schack J. ( 2015 ). Ranking corpus texts by spelling error rate. In Mambrini F. , Passarotti M. , Sporleder C. (eds), Proceedings of the Workshop on Corpus-Based Research in the Humanities (CRH), 10 December 2015 , Warsaw, Poland . http://crh4.ipipan.waw.pl/files/9814/4973/ 5451/CRH4_proceedings.pdf (accessed 19 May 2017). Goldschneider J. M. , DeKeyser R. M. ( 2001 ). Explaining the “natural order of L2 morpheme acquisition” in English: a meta-analysis of multiple determinants . Language Learning , 51 : 1 – 50 . Google Scholar CrossRef Search ADS Heene K. ( 1991 ). Audire, legere, vulgo: an attempt to define public use and comprehensibility of Carolingian hagiography. In Wright R. (ed), Latin and the Romance Languages in the Early Middle Ages . pp. 146 – 63 . Irvine M. ( 1994 ). The Making of Textual Culture: Grammatica and Literary Theory 350–1100 . Cambridge : Cambridge University Press . Keller H. ( 1973 ). La marca di Tuscia fino all'anno Mille. In Atti del V Congresso internazionale di studi sull'alto medioevo . Spoleto : Centro studi , pp. 117 – 36 . Korkiakangas T. ( 2016 ). Subject Case in the Latin of Tuscan Charters of the 8th and 9th Centuries . Espoo : Societas Scientiarum Fennica . Korkiakangas T. (in press). Spoken Latin behind written texts: formulaicity and salience in medieval documentary texts . Diachronica . Korkiakangas T. , Lassila M. ( 2013 ). Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. In Mambrini F. , Sporleder C. , Passarotti M. (eds), Proceedings of the Third Workshop on Annotation of Corpora for Research in the Humanities (ACRH-3), Sofia, 2013 . Sofia : Bulgarian Academy of Sciences , pp. 61 – 72 . Korkiakangas T. , Passarotti M. ( 2011 ). Challenges in annotating medieval Latin charters . Journal of Language Technology and Computational Linguistics , 26 ( 2 ): 103 – 14 . Larson P. ( 2000 ). Tra linguistica e fonti diplomatiche: quello che le carte dicono e non dicono. In Herman J. , Marinetti A. (eds), La preistoria dell'italiano . Tübingen : Niemeyer , pp. 151 – 66 . Late Latin Charter Treebank (LLCT) . ( 2015 ). Collection of Tuscan charters from AD 714-869. For the present available upon request from the author. Lloyd P.M. ( 1991 ). On the names of languages (and other things). In Wright R. (ed.), Latin and the Romance Languages in the Early Middle Ages . pp. 9 – 18 . Löfstedt B. ( 1961 ). Studien über die Sprache der langobardischen Gesetze. Beiträge zur frühmittelalterlichen Latinität . Uppsala : Almqvist & Wiksell . Niermeyer J. F. , van de Kieft C. ( 1976 ). Mediae Latinitatis Lexicon Minus , vol. II . Revised by Burgers J. W. J. Darmstadt : Wissenschaftliche Buchgesellschaft . Open Office Latin lexicon . Preprocessed version of the CIS-OCR team of The Center for Information and Language Processing at the University of Munich. https://github.com/cisocrgroup/Resources/tree/ master/lexica (accessed 20 May 2017). Rovai F. ( 2012 ). Sistemi di codifica argomentale: Tipologia ed evoluzione . Pisa : Pacini Editore . Rovai F. ( 2013 ). The development of deverbal prepositions in Latin: morpho-syntactic and semantico-pragmatic factors . Archivio Glottologico Italiano , 98 ( 2 ): 175 – 213 . Rutkowska H. , Rössler P. ( 2012 ). Orthographic variables. In Hernández-Campoy J. M. , Conde-Silvestre J. C. (eds), The Handbook of Historical Sociolinguistics . Chichester : Blackwell Publishing , pp. 213 – 36 . Sabatini F. ( 1965 ). Esigenze di realismo e dislocazione morfologica in testi preromanzi . Rivista di Cultura Classica e Medievale , 7 : 972 – 98 . Schiaparelli L. ( 1929 ). Codice diplomatico longobardo I. Fonti per la storia d'Italia, 62. Roma : Tipografia del Senato . Schiaparelli L. ( 1933a ). Codice diplomatico longobardo II. Fonti per la storia d'Italia, 63. Roma : Tipografia del Senato . Schiaparelli L. ( 1933b ). Note diplomatiche sulle carte longobarde, II: Tracce di antichi formulari nelle carte longobarde . Archivio storico italiano , 19 : 3 – 34 . Svennung J. ( 1935 ). Untersuchungen zu Palladius und zur lateinischen Fach- und Volkssprache . Uppsala : Almqvist & Wiksell . Väänänen V. ( 1956 ). La préposition de et le génitif: une mise au point . Revue de Linguistique Romane , 20 : 1 – 20 . Valentini C. ( 2017 ). L'evoluzione della codifica del genitivo dal tipo sintetico al tipo analitico nelle carte del Codice diplomatico longobardo. Ph.D. thesis, University of Florence. Van Uytfanghe M. ( 1991 ). The consciousness of a linguistic dichotomy (Latin–Romance) in Carolingian Gaul: the contradictions of the sources and of their interpretation. In Wright R. (ed.), Latin and the Romance Languages in the Early Middle Ages . pp. 114 – 29 . Whitaker's WORDS (Words 1.97FC) . http://mk270.github.io/whitakers-words/operational.html (accessed 19 May 2017). Wood C. , Kemp N. , Waldron S. , Hart L. ( 2014 ). Grammatical understanding, literacy and text messaging in school children and undergraduate students: A concurrent analysis . Computers and Education , 70 : 281 – 90 . Google Scholar CrossRef Search ADS Wright R. ( 1982 ). Late Latin and Early Romance in Spain and Carolingian France . Liverpool : Cairns . Wright R. (ed.) ( 1991a ). Latin and the Romance Languages in the Early Middle Ages , 2nd edn. , 1996. University Park, PA : The Pennsylvania State University Press . Wright R. ( 1991b ). The conceptual distinction between Latin and Romance: invention or evolution. In Wright R. (ed.), Latin and the Romance Languages in the Early Middle Ages . pp. 103 – 13 . Wright R. ( 2002 ). A Sociophilological Study of Late Latin . Turnhout : Brepols . Zamboni A. ( 2000 ). Alle origini dell'italiano: dinamiche e tipologie della transizione dal latino . Roma : Carocci . © The Author 2017. Published by Oxford University Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Spelling variation in historical text corpora: The case of early medieval documentary Latin JF - Digital Scholarship in the Humanities DO - 10.1093/llc/fqx061 DA - 2018-09-01 UR - https://www.deepdyve.com/lp/oxford-university-press/spelling-variation-in-historical-text-corpora-the-case-of-early-n4DiU4cD0X SP - 575 EP - 591 VL - 33 IS - 3 DP - DeepDyve ER -