Apolda - User Documentation

1 Overview

Apolda (Automated Processing of Ontologies with Lexical Denotations for Annotation) is a plugin (processing resource) for GATE (http://gate.ac.uk/). The Apolda processing resource (PR) annotates a document like a gazetteer, but takes the terms from an (OWL) ontology rather than from a list. Apolda searches the document for OWL annotation properties (owl:AnnotationProperty) of the classes and instances of the ontology. The matches are annotated with the name of the class and the URL of the ontology, like in the OntoGazetteer. It can be specified by the initialisation parameters which annotation property should be used for annotation the documents. One can either specify a predefined property, like label or comment or a user defined one. In the example ontology provided we have defined the properties prefTextualRepresentation and altTexualRepresentation which should be used in Apolda.

2 Compatability

For proper usage Apolda requires Gate version 4.0, since previous versions do not fully support ontologies with annotation properties. The Apolda PR is based on the Owlim Ontology interface of the Ontology Tools package. Release 0.2 and release 0.3 are compatible with Gate 3.1 and snapshots of Gate 4.0 from before March 2007. However, in Gate 3.1 anottatioon properties are not fully supported. Thus Apolda can only be used with restriced functionality. Release 1.0, 1.1 and 1.2 require (at least) Gate 4.0. In Gate version 6 the interfaces for ontologies have been changed. Thus Gate 6 requires Apolda release 2. Apolda 2.0 has been tested with Gate 6.1.

3 Installation

Unpack the apolda archive
Iin the plugin management console (click the <?>-icon), add the directory Äpolda" to the list of plugins.

4 Usage

Apolda effectively can do the same as the OntoGazetteer. However, in some cases Apolda will be much more convenient. In the case of the OntoGazetteer one has to make lists of terms and assign each list to a concept of the ontology. Apolda uses the ontology directly. This presupposes that the textual representation of the concepts is part of the ontology. Apolda adds a annotation Mention (Concept before release 1.0) to the annotation set for each found representation of concept. This annotation has three features: identifier, class and ontology. The last feature has as value the URL of the ontology. The identifier is the name of the matched concept. The feature class gives the name of the class that was matched. In case the matched concept was a class, class and identifier will have the same value. In case an individual was found, identifier has the class this individual belongs to as its value. In case the individual belongs to more than one class, only one class used for annotation.

Textual representations are language dependent. This can be expressed by the usage of the language attribute of annotation properties. In the present implementation of Gate, these language attributes in OWL are ignored. Thus, this feature currently cannot be used. Finally, note that textual representations need not to be unique. Several concepts might have the same representation. In such cases Apolda adds both annotations to the annotation set. Subsequent processing resources might try to solve the resulting ambiguities.

Apolda assumes that there is a OWL-annotation property that represents the textual representation of concepts. Which property this is, can be specified in the initialisation. It is possible to specify two annotation properties. If more properties are used, the ontology should be slightly modified, e.g. by introducing a common super property for all types of annotations, using rdfs:subPropertyOf.

4.1 Dependencies and Position in the pipeline

Apolda works on tokens and not on the raw document like most gazetteers do. Thus you need to run a tokenizer first. If the matching with lemmas should be used, the lemmatizer should be run before Apolda. If you use the ANNIE tokeniser, you should consider to use the alternate rule set (ANNIE/resources/tokeniser/AlternateTokeniser.rules) for better recognition of hyphenated words.

4.2 Parameters

There are three inititalisation parameters:

ontology
prefRepresentation
altRepresentation
language

Only the first one is obligatory.

The ontology points to the ontology that is used for annotation. The parameters prefRepresentation and altRepresentation are the OWL annotation properties (owl:AnnotationProperty) that are supposed to give the textual representation for terms of the ontology. Apolda offers two parameters since many ontologies are designed in that way. Apolda, however, does not differentiate between the preferred and alternative representations. If neither of these two parameter is not set, the names of the classes and instances are used for annotation. The language parameter optionally specifies the language for Apolda. If a language is specified annotation properties with a language attribute different from the specified language will be ignored.

There is one runtime parameter besides the standard document, input annotation set and output annotation set parameters, namely the lemmaFeature. This parameter specifies the name of the feature under which a lemmatiser (or stemmer) has stored the lemma (or stem) of a token. If this parameter is set, Apolda will look for this feature in the annotations of a token and will produce a match if the lemma corresponds with a textual representation form the ontology. This is extremely useful for languages with rich morphology. Instead of specifying all possible variants as textual representations in the ontology, only the stem has to be specified.

4.3 Matching

If two terms match partially the same region, e.g. Rijn and Rembrandt van Rijn, only the longest match is added to the annotation set, if the borders of the shorter match are within those of the longer match. In all other cases (including possible matches of the same length!), both annotations are added.

If matching with lemmas is used, arbitrary sequences of lemmas and token strings can match. If only literal matches should be found the textual representation can be written between quotation marks.

File translated from T_EX by T_THgold, version 4.00.
On 15 Nov 2011, 22:11.