User name:
Password:


Welcome

Welcome to the , an online web application tool developed at the University of Málaga for the automatic part-of-speech (POS) tagging of Middle English (ME) texts.

Introduction

At the very basis of all corpus studies, POS tagging is one common type of corpus annotation as the assignation of part-of-speech tags to raw corpora allows the linguist to perform collocations studies, obtain word-frequency lists, or accomplish further analyses, such as syntactic parsing.

Hitherto, the state-of-the-art revealed just one attempt at creating such a system for the tagging of ME texts which was developed at the University of Texas at Austin (Moon & Baldridge, 2007), based on the alignment of modern parallel texts. This, however, presents a clear limitation as it relies on the existence of texts that have contemporary versions. Therefore, for the POS tagging of unique Mediaeval witnesses the implementation of the present tool was considered to be of great interest within the field of historical corpus linguistics.

Objectives

The aim is to obtain tagged texts that can be subsequently processed by other tools, such as TexSEn, for the retrieval of linguistic information.

The more aim seeks to spread knowledge about Mediaeval texts (mainly those concerning Medicine, Botany, Pharmacopoeia) and ease the online access and consultation of manuscripts for studies of Codicology, Palaeography, English Historical Linguistics, History of Science, among others.

Functions
& options

Note that not all functions and options are open to all users, three levels of access having been integrated into the system.

  1. Public users are welcome to test the Tag Demo for texts of up to 500 characters without registration.
  2. Registered users are allowed to use the POS tagging and multi-tagging options (max. 12 MB):
    1. With POS tagging the most likely lemma and category (i.e. word-class) are assigned to each item after a learning-based context-sensitive disambiguation process. Note that each lemma also appears with its corresponding word-class attached as in <that, d >, thus avoiding ambiguity when more than one tag could feasibly be applied as in <that, d> (determiner), <that, r> (pronoun), <that, c> (conjunction). The results can be exported onto an Excel spreadsheet for further handling. The multi-tagged text can be requested from this option. See the lemma subscripts, the list of categories, and a sample of POS tagging in appendices one and two.
    2. With multi-tagging nearly every possible POS tag for each item is displayed in its respective column, including lemma, category and grammatical accidence, as stored in the lexicon database. The results can also be exported onto an Excel spreadsheet. See the tag-set and a sample of multi-tagging in appendices three and four.
  3. Administrators have exclusive access to the following options or modules:
    1. Computer assisted tagging: the most plausible tag for each item can be selected according to its context from the tags generated in the multi-tagging process. For this purpose, a drop-down menu is offered for each item to confirm the tag assigned, select the most appropriate tag from the generated set, or edit a new tag by re-typing a new lemma if necessary. In the same menu, the linguist can consult the meanings of most lemmas for the sake of a successful disambiguation. This module can be authorized to researchers on request.
    2. Lexicon management: the lexicon’s database can be continually updated by programmers at any time. Lemmas and variants can be added, deleted, or edited for further modifications.
    3. Edition of rules and equivalent character strings (for programmers): this option allows changing the set of context-driven rules used in the disambiguation as well as the table of equivalent character strings found in allomorphs.
    4. Uploading, training and evaluation modules (for programmers): the first module serves to upload new texts, the second is designed to train the system with new tagged texts whereas the third assesses the automatic tagging performed by the system against the original manual tagging of the training texts, highlighting the tags that do not coincide to subsequently solve errors with new implementations. The evaluation module also displays the achieved success rate.

Target text

The system is currently designed to process of ME texts, wherein word-division (i.e. tokenization) is modernized but the original punctuation remains. However, the application can cope with the tokenization of diplomatic transcriptions rather successfully. Therefore, the untagged text to be processed by the system should follow such guidelines for best results. See text requirements in next section.

The modernizing of the punctuation of target texts has been so far rejected for two main reasons. Firstly, because the placing of punctuation symbols can be somewhat subjective, varying from one linguist to another. And secondly, because the tests performed on texts with modernized punctuation (MS Hunter 503) were not so favourable as to warrant the time-consuming task of modernizing the punctuation both of the training texts and of the texts to be tagged.

Ultimately, , where both tokenization and punctuation have been modernized, will be able to be tagged with a higher success rate when transcriptions following critical editions undergo the training process.

Tagging

Process

Three steps are involved in every tagging process:

  1. Tokenization (the text is divided into tokens, which is not a straightforward task in Middle English).
  2. Multitagging (each token is assigned possible POS tags).
  3. Disambiguation (the most plausible POS tag is chosen).

Set

See appendices one and three.

Criteria

Most items in the corpus are tagged individually as single words although some compound words, collocations or phrases are tagged under the same (multi-word) lemma for the sake of clarity. This chunking is sometimes subjective and its range of application varies depending on the researcher’s preferences.

Some examples of compounds, phrases and chunks are listed below:

  1. Compound nouns (common or proper): <elena campana>, <Benuemicius Grapheus>, <febre quartane>, <franke encense>
  2. Prepositional/adverbial phrases: <in to>, <out of>, <ther to>, <for to>, <elys where>, <with owte forthe>
  3. Latin phrases and titles: <fiat emplastrum>, <de utilitate particulari>
  4. Relative forms: <the which>, <the which that>

Methods

We provide a hybrid POS tagging system, combining rule-based and probabilistic methods to achieve higher success rates.
  1. The system is basically trained using the tagged corpus from the manually tagged transcriptions belonging to The Málaga Corpus of Late ME Scientific Prose, a collection of medical manuscripts belonging to the Hunter Collection housed at Glasgow University Library.
  2. In addition, a database of roots and inflections is introduced into the system. The list of roots is retrieved semi-automatically from dictionaries; whereas, the list of inflections is compiled manually as it requires linguistic knowledge of the ME language.
  3. Moreover, positive and negative rules are established to feature typical syntactic structures for the identification of units given their context relying on a previous research of 3-, 4- and 5-word grams.
  4. The tagging algorithm handles the existing most frequent tags for each item in the tagged corpus as well as generates all the other likely tags to propose the most plausible tag taking into account the number of occurrences within the corpus and the context surrounding the unit to be tagged, which is done by means of genetic algorithms.

Requirements

The system is an online web application tool accessed from any computer with a web browser, following the client-server model. It is developed in Java and JSP/Servlets, following the Model-View-Controller design-pattern and a multi-tier architecture (presentation tier, logic tier, data tier). Furthermore, for the system to recognize Middle English characters, such as <þ> and <ð>, the Unicode character repertory (UTF-8 code), containing such characters, is required.

The transcriptions that are to be tagged should follow certain editorial conventions to ensure best results. Semi-diplomatic editions (wherein word-division is modernized but original punctuation remains), used throughout the training process, are the ideal input. These transcriptions should always include:

  1. The expansion of abbreviations (otherwise automatic tagging proves impossible).
  2. The removal of symbols marking end-of-line-word-division and the subsequent union of the remaining halves.
  3. The manual union or separation of wrongly cut/joined items to facilitate their identification within the tagger lexicon.
  4. Localization references (page/folio, line, etc.) enclosed in “|”

A virtual keyboard can be optionally displayed by clicking on the <KB> icon, offering a whole range of Middle English characters for writing texts that are to be processed by the system.

Success rate

The tagger obtains an average accuracy of 85% which can be considered fairly acceptable in view of the randomness of the Middle English language concerning morphology and syntax. However, this rate may rise up to 95% when similar Late Middle English Scientific texts are tagged, as follows:

Treatise Success rate
92.3
90.2
92.1
94.1
93.6
89.6
93.7

As a whole, one third of the mis-taggings are produced by items not contained in the database as not all the allomorphs have been collected. Another third is attributed to the wrong selection of the correct lemma when more than one occurs. The rest is mainly due to a faulty tokenization. Note, however, that the errors caused by the database’s limitations, having a finite size, or those errors caused by a wrong splitting of the word must not be claimed on to the tagger’s efficiency.

Demo

Try the tagger on this demonstration page.

Acknowledgements

We would like to acknowledge the Council of Economy, Innovation and Science of the Andalusian Autonomous Government (Dirección General de Universidades) for awarding the research grant to the Research Project of Excelence P09-HUM 4790. We would also like to thank the Spanish Ministry of Science and Innovation (Dirección General de Investigación) for the research infrastructure built-up with the funding of the research project FFI2008-02336/FILO, as well as the University of Glasgow Library’s Special Collections Department for allowing the use of the digitized images of the Mediaeval Manuscripts and, last but not least, the University of Malaga for all the support received at all times.

Appendices

  1. Lemma subscripts and list of categories
  2. POS tagging sample
  3. Tag-set
  4. Multi-tagging sample


Description

This Web holds all the information concerning the research project entitled Middle English POS-tagger (ref. P09-HUM 4790) funded by the Council of Economy, Innovation and Science of the Andalusian Autonomous Government.

Objectives

1) Design and implementation of a POS-Tagger of Middle English.
2) Edition of Middle English scientific treatises belonging to the Hunter Collection of Glasgow University Library.
3) Annotation of the corpus resulting from the editions so that morpho-syntactic information can be retrieved for all sorts of philological studies.
4) Spreading knowledge about mediaeval texts (mainly those concerning Medicine, Botany, Pharmacopoeia) and easing the online access and consultation of the manuscripts for studies of Codicology, Paleography, English Historical Linguistics, History of Science, among others etc.

Research team

Dr Antonio Miranda García (main researcher)
Dra Inmaculada de las Peñas Cabrera (researcher)
Ms Melania Evelyn Sánchez Reed (support technician - philological assistant)
Mr Diego Jesús Subires Mancera ( support technician - computing assistant)

Acknowledgements

We would like to acknowledge the Council of Economy, Innovation and Science of the Andalusian Autonomous Government (Dirección General de Universidades) for awarding the research grant to the Research Project of Excelence (ref. P09-HUM 4790). We would also like to thank the Ministry of Science and Innovation (Dirección General de Investigación) of Spain for the research infrastructure built-up with the funding of the research project FFI2008-02336/FILO, as well as the University of Glasgow Library's Special Collections Department for allowing the use of the digitized images of the Mediaeval Manuscripts and, last but not least, the University of Malaga for all the support received at all times.