Previous study has been hampered by the very incomplete dataset available for analysis. Consequently, this subproject will take the new corpus of all Sicilian texts as its starting point and transform it into a tool for systematic, computational linguistic analysis as the basis for a new and wide-ranging study of the linguistic history of the island. Such a study will go far beyond any existing study, not simply in the systematic nature of its coverage, but in its cross-linguistic range, and its temporal coverage from the Archaic period to late Antiquity. This will be achieved by extending the mark-up of the texts in the TEI corpus, through a systematic programme of the tokenization of sentences and words, parts-of-speech tagging of the individual words, lemmatization of individual words, and lastly syntactic analysis. A wide range of Natural Language Processing tools already exists to support this work, principally developed for literary texts. A pipeline of such tools, applicable to multiple ancient languages, has been consolidated in the Classical Languages Toolkit (Burns 2019). Current work on the PapyGreek project (PI Marja Vierros, on the Advisory Board for this project) is developing the Sematia platform, to enable the application of these tools to the more complex corpus of Greek papyrological texts, which are also encoded in EpiDoc (Vierros 2018; Vierros and Henriksson 2017). This provides the model and the tools for the creation of linguistic layers from the EpiDoc corpus of I.Sicily. Work is also underway in the LiLa project to consolidate and unify the tools specifically for Latin (Francesco Mambrini, researcher on the LiLa project is on the Advisory Board for this project). Additional support on the Advisory Board is provided by Professor Wolfgang De Melo (University of Oxford), an expert in Latin historical linguistics; and Dr Alex Mullen, PI on the LatinNow project studying sociolinguistics of the northwestern Roman empire through epigraphic evidence.
This sub-project will be led by a three-year post-doctoral researcher in the field of historical linguistics, over years 2-4 of the Crossreads project.