Tools for creating dependency treebanks of epigraphic texts

Background

Treebanks offer a powerful way to bring together a given corpus with our general understanding of how languages work. These are ‘structured corpora that are annotated for the assumptions that go into the research based on them’ (: 192). The last twenty years has seen a blossoming of such treebanked corpora of the world’s languages. These vary in what they cover. Most consist of morphological and syntactic layers. It is the tree-like representation of syntactic and semantic structures that gave rise to the term treebank (https://en.wikipedia.org/wiki/Treebank, accessed 14th June 2022).

        The last decade and a half have seen considerable advances in the syntactic treebanking of ancient Greek and Latin texts. Literary texts have received the greatest attention. Treebanks of literary texts include the PROIEL corpus (http://syntacticus.org/), the Perseus Ancient Greek and Latin Dependency Treebank (AGLDT, https://perseusdl.github.io/treebank_data/) and the Gorman Trees (https://perseids-publications.github.io/gorman-trees/).

        Papyri have also received attention: the PapyGreek project (https://papygreek.hum.helsinki.fi/) has released treebanks of both documentary and literary papyri. Finally, there is ongoing work on the automated syntactic parsing of Greek and Latin texts (https://github.com/perseids-publications/pedalion-trees/).

        Epigraphic texts have only recently started to be analysed in this way. The Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE (CEIPoM) (Pitts 2022) is one such project, including texts from ‘Oscan, Umbrian, Old Sabellic, Messapic and Venetic languages, as well as epigraphic Latin up to 100 BCE’. The ERC-funded Crossreads project (), whose focus is the ‘interplay of linguistic and textual material culture in ancient Sicily over a period of 1,500 years’, sets out to provide full linguistic analysis (tokenization, lemmatization, morphological and syntactic analysis) of the already-digitised inscriptions from the I.Sicily corpus for Ancient Sicily. A wide range of languages are attested within the corpus:

  • Greek
  • Latin
  • Phoenician-Punic
  • Sikel
  • Elymian
  • Hebrew

        Ultimately, therefore, the linguistically annotated corpus will provide the means for exploring linguistic and sociolinguistic relations on the island. In order to build the treebank, however, it has been necessary to build some of the required tools.

Choices of formalisms

Greek and Latin

There is more than one way of drawing a dependency tree of a given sentence. In recent years dependency treebanks of ancient Indo-European languages have been made according to the following standards:

  • Perseus Ancient Greek and Latin Dependency Treebank (AG[L]DT), based on Prague (Celano, Crane & Almas 2014)
  • Pragmatic Resources in Ancient Indo-European Languages (PROIEL) (Haug & Jøhndal 2008)

        All but one of the treebanked corpora in Ancient Greek and Latin have been annotated according the AGLDT standard, including CEIPoM, the only treebank concerned with epigraphic sources. (Pragmatic Resources in Ancient Indo-European Languages uses its own formalism.) These formalisms are well suited to the annotation of the aforementioned treebanks since they have been designed either with Greek and Latin specifically (AGLDT), or Old Indo-European languages more generally (PROIEL), in mind.

Hebrew and Phoenician-Punic

The choice of formalism for the I.Sicily corpus is not so straightforward, given the presence of at least two Semitic, (i.e.non-Indo-European) languages, namely, Hebrew and the closely-related Phoenician-Punic. Unlike Indo-European languages, where dependency formalisms have been mainstream, for treebanks of Hebrew phrase-structure formalisms have been preferred. Both ETCBC (https://github.com/etcbc/trees) and Clear Bible (https://github.com/Clear-Bible/macula-hebrew) have releasted phrase-structure tree analyses of the sentences in the Hebrew Bible.

        Hebrew (and Phoenician-Punic) lend themselves to phrase-structure analyses more than Latin and Greek, given their higher degree of non-configurationality. This does not at all preclude dependency analyses of Hebrew-Phoneician-Punic data. Indeed, work has recently started on a dependency treebank for Ancient Hebrew (Swanson & Tyers 2022). The formalism used is — in contrast to treebanks of Greek, Latin and other Old Indo-European languages — that of Universal Dependencies (UD de Marneffe et al. 2021), which has become a standard for treebanks outside of Classics.

Comparing formalisms

The AGLDT and PROIEL formalisms share the fact that they set out to represent the surface-syntactic structure, whereas UD targets the syntax-semantic layer of a given language. The difference is illustrated in the fact that in UD function words are subordinate to content words (Osborne & Maxwell 2015: 241): prepositions are subordinate to their objects, auxiliaries to their content word counterparts, and copulas to their predicate nominals (for a full list see Osborne & Gerdes 2019, 4–5). By contrast, in PROIEL and AGLDT prepositions head their objects, auxiliaries head their content words, and copulas head their predicate nominals.

        Despite the structural similarities between AGLDT and PROIEL, PROIEL and UD share an important characteristic, namely, the scope for the annotation of secondary dependencies. Secondary dependencies allow for the annotation of, for example, the subject of an infinitive in control sentence (see Figure 3 below).

Which formalism to use?

No decision has yet been made regarding which formalism to use for the annotation of the I.Sicily corpus. The decision requires the weighing of several factors, notably:

  • The specific research questions the corpus is designed to answer;
  • The needs of the community or communities that are (likely) to be the principal beneficiaries of the treebank.

        In fact, it may be that the likely function(s) and audiences for the treebank have conflicting needs. In the light of this, the focus of my work to-date has been to develop tools to facilitate the serving of as wide a range of purposes and audiences as possible. Foundational to this end is an annotator that is capable of handling all three of the formalisms that might be used.

Annotator

In order to annotate the corpus according to any one of the three dependency formalisms (UD, PROIEL, AGLDT) an annotator is needed that is able to cope with the different formalisms. No existing tool is able to do this for the three formalisms in question. Furthermore, the annotator for PROIEL no longer functions. In consequence, the Arethusa annotator for AGLDT (https://github.com/alpheios-project/arethusa) was modified to be able to annotate trees in all three formalisms. This necessitated the following changes:

  • The adding of edge and node labels for PROIEL and UD formalisms;
  • Adding the capacity for drawing secondary dependencies (both graphically, and in the capabilities of the Arethusa file format).

        Figure 1 and Figure 2 show the text of ISic001320 being annotated in the modified Arethusa environment.

fig1

Fig. 1. AGLDT annotation of ISic001320 within modified Arethusa environment.

fig2

Fig. 2. UD annotation of ISic001320 within modified Arethusa environment

 

        The Arethusa annotator is written in JavaScript using the AngularJS framework. It turned out not to be possible to use the existing graph-drawing engine, Dagre-D3 (https://github.com/dagrejs/dagre-d3), to draw secondary dependencies. Graphs were instead drawn using D3’s force directed graph engine (https://github.com/d3/d3-force).

        Figure 3 gives an example of a control sentence annotated with secondary dependencies according to the PROIEL formalism (modified PROIEL text).

 

fig3

Fig. 3. Control construction (Matthew 16 v. 22) annotated according to PROIEL dependencies.

Graphviz visualisation

 

While an annotator is vital for generating syntactically annotated sentences, for publication and other purposes it is also necessary to have a static means of presenting the information. Graphviz (https://graphviz.org/) is an opensource tool for graph visualisation. The PROIEL project provide visualisations of their trees using Graphviz.

Graphviz compiles DOT files to an image. DOT files contain information specifying the relationships between nodes, node and edge labels as well as styling information. A Python application was written that represents the tree information in a DOT file, which is then compiled to an image using Graphviz.[i] Figure 4 and Figure 5 provide Graphviz representations of ISic001320 accordign to AGLDT and UD respectively.


[i] While the code for converting the tree information into a DOT file was written by myself, the GraphvizCompiler object from dependency2tree was used to call Graphviz from Python.

 

 

fig4

Fig. 4. — AGLDT annotation of ISic001320 represented using Graphviz

 

 

fig5

Fig. 5. — UD annotation of ISic001320 represented using Graphviz

Conclusion

The foregoing has provided a flavour of some of the considerations going into the creation of a treebank of the inscriptions in the I.Sicily project, and has presented a modification of the Arethusa annotation tool that is more versatile in handling the range of possible needs for the project, as well as a tool for representing dependency trees for publication via Graphviz. Other tools are also in preparation, including a conversion tool for converting between the three formalisms, to which a future post will be dedicated.

I.Sicily documents

ISic001320: Prag, J. R. W., Cummings, J., Chartrand, J., Vitale, V., Metcalfe, M., Llamazares, A. and Stoyanova, S. ‘I.Sicily 001320.’ Revised 2021-07-12.

References

Celano, G. G. A., Crane, G. and Almas, B. (eds) (2014) The Ancient Greek and Latin Dependency Treebank

de Marneffe, M. C., Manning, C. D., Nivre, J. and Zeman, D. (2021) ‘Universal Dependencies’, Computational Linguistics 47(2), 255–308. doi: 10.1162/COLI_a_00402

Haug, D. T. T. and Jøhndal, M. L. (2008) “Creating a parallel treebank of the old Indo-European Bible translations.” In Sporleder, C. and Ribarov, K. (eds) Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), 27–34.

Osborne, T. and Gerdes, K. (2019) ‘The status of function words in dependency grammar: A critique of Universal Dependencies (UD)’, Glossa: a journal of general linguistics 4(1), 1–28. doi: 10.5334/gjgl.537

Osborne, T. and Maxwell, D. (2015) “A historical overview of the status of function words in Dependency Grammar.” In Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015). Uppsala, Sweden: Uppsala University, Uppsala, Sweden, 241–250. https://aclanthology.org/W15-2127.pdf. Last accessed 17th June 2022.

Pitts, R. (2022) ‘Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE (CEIPoM)’, Journal of Open Humanities Data 8(1), 1–4. doi: 10.5334/johd.65

Prag, J. R. W. (ed.) (2022) I.Sicily: . http://sicily.classics.ox.ac.uk; doi: 10.5281/zenodo.4021517

Swanson, D. G. and Tyers, F. M. (2022) “A Universal Dependencies Treebank of Ancient Hebrew.” In Proceedings of the 13th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, 2353–2361. http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.252.pdf

 

 
 
The blog of the CROSSREADS project, based at the CSAD in the Faculty of Classics, University of Oxford, between 2020-2025.  We will be adding regular updates on our research and news of our project publications. 
CROSSREADS: text, materiality and multiculturalism at the crossroads of the ancient Mediterranean has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant agreement No. 885040).