2023
19 - 21 April 2023
Symposium: HPI Research Symposium 2023
Location: Potsdam, Germany | Hasso Plattner Institut, University of Potsdam
Presentation: Ontology Transformation to a Language-Specific Viewpoint
17 - 20 July 2023
Towards an Ontology of Viewpoints
Author: Frances Gillis-Webber
Conference: Formal Ontology in Information Systems (FOIS) Conference 2023
Location: Sherbrooke, Quebec, Canada | University of Sherbrooke
Abstract
In a multilingual domain ontology developed using the labels approach, where each ontological entity is labelled with a language-tagged string, two scenarios result: (1) the ontology is 'language-independent', where there is an equal number of labels per natural language, or (2) the ontology is a 'primary-language' ontology, where one natural language takes precedence over the other languages used. In a multilingual ontology, it is assumed there is full equivalence between the different languages, however, each natural language, as an embodiment of a culture, differs in how it interprets and organises the world. The result is that although the viewpoint expressed by the multilingual domain ontology is thought to be universal, one natural language is very often privileged, typically English.
Using the culture-bound concepts of 'dowry' and 'bride price', we demonstrate the differences in perspective when considered for different languages and sub-domains. We propose an ontology, Model of Multiple Viewpoints (MULTI), where both language and culture are considered together, and language is classified as a social norm of a community. MULTI is formalised in OWL and aligned to DOLCE+DnS Ultralite, a foundational ontology suitable for modelling contexts. The evaluation of MULTI is done against the identified use cases. The expected result is that an ontology can be annotated with its viewpoint, thus making the viewpoint of the ontology explicit.
Concept Mismatches Between a Source and Target Natural Language
Author: Frances Gillis-Webber
Workshop: 2nd Workshop on Modular Knowledge 2023
Location: Co-located with FOIS
Abstract
Numerous mismatches have been identified when aligning heterogenous resources. In this paper, the focus is on the mismatches for a concept between a source and target viewpoint, where each viewpoint is natural language-specific. A concept is first defined as a 6-tuple, comprising of its viewpoint, the lexical realisation of the concept, the axiomatisation thereof, as well as asserted individuals. The same concept is then defined as another tuple, this time for a target viewpoint, with each element therein compared to the original. A total of 22 mismatches and correspondences have been identified, with three pertaining to lexical realisations, twelve pertaining to the axiomatisation of a concept, and seven pertaining to individuals and assertions.
12 - 15 September 2023
Refinement of the Classification of Translation Inequivalences - Extension of the vartrans Module in OntoLex-Lemon
Author: Frances Gillis-Webber
Conference: LDK 2023 – 4th Conference on Language, Data and Knowledge
Location: Vienna, Austria | University of Vienna
Abstract
Twenty language examples were identified for translation between a source and target language, however only eight of these examples can be classified by TRCAT. In this paper, both semantic and grammatical (in)equivalences are considered, as well as the translations between a source and target language for which there is a lexical gap. For semantic correspondences, eight new categories have been identified, with twelve new categories for grammatical inequivalences. The vartrans module was then extended to include these new categories, soft-reusing two of the categories from TRCAT, with classes and object properties added for grammar rules and language features. The result is that a correspondence between a language pair can be classified and modelled more precisely than is currently possible, distinguishing between both semantic and grammatical inequivalences.
to the top
Created: 16 May 2018 | Updated: 19 July 2023
2020
22 - 23 June 2020
Towards an ontology based on Hallig-Wartburg’s Begriffssystem for Historical Linguistic Linked Data
Authors: Sabine Tittel, Frances Gillis-Webber and Alessandro A. Nannini
Conference: 7th Workshop on Linked Data in Linguistics: Building Tools and Infrastructures (LDL 2020), co-located with LREC 2020. Presented online.
Abstract
To empower end users in searching for historical linguistic content with a performance that far exceeds the research functions offered by websites of, e.g., historical dictionaries, is undoubtedly a major advantage of (Linguistic) Linked Open Data ([L]LOD).
An important aim of lexicography is to enable a language-independent, onomasiological approach, and the modelling of linguistic resources following the LOD paradigm facilitates the semantic mapping to ontologies making this approach possible.
Hallig-Wartburg's Begriffssystem (HW) is a well-known extra-linguistic conceptual system used as an onomasiological framework by many historical lexicographical and lexicological works.
Published in 1952, HW has meanwhile been digitised. With proprietary XML data as the starting point, our goal is the transformation of HW into Linked Open Data in order to facilitate its use by linguistic resources modelled as LOD.
In this paper, we describe the particularities of the HW conceptual model and the method of converting HW: We discuss two approaches, (i) the representation of HW in RDF using SKOS, the SKOS thesaurus extension, and XKOS, and (ii) the creation of a lightweight ontology expressed in OWL, based on the RDF/SKOS model.
The outcome is illustrated with use cases of medieval Gascon, and Italian.
11 - 16 May 2020
A Framework for Shared Agreement of Language Tags beyond ISO 639
Authors: Frances Gillis-Webber and Sabine Tittel
Conference: 12th edition of the Language Resources and Evaluation Conference (LREC 2020)
Abstract
The identification and annotation of languages in an unambiguous and standardized way is essential for the description of linguistic data.
It is the prerequisite for machine-based interpretation, aggregation, and re-use of the data with respect to different languages. This makes it a key aspect especially for Linked Data and the multilingual Semantic Web.
The standard for language tags is defined by IETF’s BCP 47 and ISO 639 provides the language codes that are the tags' main constituents.
However, for the identification of lesser-known languages, endangered languages, regional varieties or historical stages of a language, the ISO 639 codes are insufficient.
Also, the optional language sub-tags compliant with BCP 47 do not offer a possibility fine-grained enough to represent linguistic variation.
We propose a versatile pattern that extends the BCP 47 sub-tag privateuse and is, thus, able to overcome the limits of BCP 47 and ISO 639.
Sufficient coverage of the pattern is demonstrated with the use case of linguistic Linked Data of the endangered Gascon language.
We show how to use a URI shortcode for the extended sub-tag, making the length compliant with BCP 47.
We achieve this with a web application and API developed to encode and decode the language tag.
7 February 2020
Ontology-Based Data Access of Animals with Ontop - A Tutorial
Authors: Frances Gillis-Webber and C. Maria Keet
View: https://people.cs.uct.ac.za/~mkeet/OEbook/OBDAtutElephants.pdf
View: An Introduction to Ontology Engineering by C. Maria Keet (Textbook version 1.5, Appendix A.2, Page 256)
Abstract
The aim of this tutorial is to demonstrate the concept of Ontology-Based Data Access (OBDA), where one queries the data residing in a database through the ontology. We use the Ontop framework for this, which is compatible with the Protégé ontology development environment (ODE), and MySQL is used as the relational database.
to the top
Created: 16 May 2018 | Updated: 19 July 2023
2019
13 December 2019
MPhil Graduation
Location: Sarah Baartman Hall, University of Cape Town
Dissertation: The Construction of a Linguistic Linked Data Framework for Bilingual Lexicographic Resources
Supervisors: Richard Higgs and Connie Bitso
Department: Department of Knowledge and Information Stewardship (previously Library and Information Studies Centre), University of Cape Town
View: http://hdl.handle.net/11427/31568 | Download: DISSERTATION
Abstract
Little-known lexicographic resources can be of tremendous value to users once digitised. By extending the digitisation efforts for a lexicographic resource, converting the human readable digital object to a state that is also machine-readable, structured data can be created that is semantically interoperable, thereby enabling the lexicographic resource to access, and be accessed by, other semantically interoperable resources. The purpose of this study is to formulate a process when converting a lexicographic resource in print form to a machine-readable bilingual lexicographic resource applying linguistic linked data principles, using the English-Xhosa Dictionary for Nurses as a case study. This is accomplished by creating a linked data framework, in which data are expressed in the form of RDF triples and URIs, in a manner which allows for extensibility to a multilingual resource. Click languages with characters not typically represented by the Roman alphabet are also considered. The purpose of this linked data framework is to define each lexical entry as “historically dynamic”, instead of “ontologically static” (Rafferty, 2016:5). For a framework which has instances in constant evolution, focus is thus given to the management of provenance and linked data generation thereof. The output is an implementation framework which provides methodological guidelines for similar language resources in the interdisciplinary field of Library and Information Science.
1 - 3 October 2019
Identification of Languages in Linked Open Data: a Case Study of Linguistic Data of French Combining a Diatopic with a Diachronic Perspective
Authors: Sabine Tittel and Frances Gillis-Webber
Conference: eLex 2019: Smart Lexicography
View: Conference Proceedings | Download: PAPER
Abstract
When modelling linguistic resources as Linked Data, the identification of languages using language tags and language codes is a mandatory task. IETF’s BCP 47 defines the standard for tags, and ISO 639 provides the codes. However, these codes are insufficient for the identification of diatopic variation within a language and, also, for different historical language stages. This weakness hampers the accurate identification of data, which in turn leads to ambiguity when extending, aggregating and re-using this data—a key notion of Linked Open Data and the Semantic Web. We show the limitations of language identification with a case study of French linguistic data from both a diachronic and a diatopic perspective. Our exemplary data derives from dictionaries of Old French, Middle French, and of Modern French dialects, and from a Modern French linguistic atlas. For each exemplar, we propose a solution using the privateuse sub-tag of BCP 47’s language tag, staying within the boundaries of existing standards. Using a predefined pattern for the privateuse sub-tag, the solutions enable a dialect, a patois, in combination with a time period, to be defined and identified. This can lead to shared agreement of language tags that will increase interoperability within the context of Linked Data.
23 June 2019
A Model for Language Annotations on the Web
Authors: Frances Gillis-Webber, Sabine Tittel and C. Maria Keet
Conference: 1st Iberoamerican Knowledge Graphs and Semantic Web Conference (KGSWC 2019)
View: https://doi.org/10.1007/978-3-030-21395-4_1 (Communications in Computer and Information Science 2019, 1029) | SUPPLEMENTARY MATERIAL
Abstract
Several annotation models have been proposed to enable a multilingual Semantic Web. Such models hone in on the word and its morphology and assume the language tag and URI comes from external resources. These resources, such as ISO 639 and Glottolog, have limited coverage of the world’s languages and have a very limited thesaurus-like structure at best, which hampers language annotation, hence constraining research in Digital Humanities and other fields. To resolve this ‘outsourced’ task of the current models, we developed a model for representing information about languages, the Model for Language Annotation (MoLA), such that basic language information can be recorded consistently and therewith queried and analyzed as well. This includes the various types of languages, families, and the relations among them. MoLA is formalized in OWL so that it can integrate with Linguistic Linked Data resources. Sufficient coverage of MoLA is demonstrated with the use case of French.
20 - 23 May 2019
The Shortcomings of Language Tags for Linked Data when Modeling Lesser-Known Languages
Authors: Frances Gillis-Webber and Sabine Tittel
Conference: 2nd Conference on Language, Data and Knowledge (LDK 2019)
Location: Leipzig, Germany | University of Leipzig in the Assembly Hall and University Church of St. Paul
View: https://dx.doi.org/10.4230/OASIcs.LDK.2019.4 (OASIcs 2019, 70) | Download: PRESENTATION
Abstract
In recent years, the modeling of data from linguistic resources with Resource Description Framework (RDF), following the Linked Data paradigm and using the OntoLex-Lemon vocabulary, has become a prevalent method to create datasets for a multilingual web of data. An important aspect of data modeling is the use of language tags to mark lexicons, lexemes, word senses, etc. of a linguistic dataset. However, attempts to model data from lesser-known languages show significant shortcomings with the authoritative list of language codes by ISO 639: for many lesser-known languages spoken by minorities and also for historical stages of languages, language codes, the basis of language tags, are simply not available. This paper discusses these shortcomings based on the examples of three such languages, i.e., two varieties of click languages of Southern Africa together with Old French, and suggests solutions for the issues identified.
to the top
Created: 16 May 2018 | Updated: 19 July 2023
2018
6 November 2018
Conversion of the English-Xhosa Dictionary for Nurses to a Linguistic Linked Data Framework
Author: Frances Gillis-Webber
Journal: Special Issue of Information: Towards the Multilingual Web of Data
View: https://doi.org/10.3390/info9110274 (Information 2018, 9(11), 274)
Abstract
The English-Xhosa Dictionary for Nurses (EXDN) is a bilingual, unidirectional printed dictionary in the public domain, with English and isiXhosa as the language pair. By extending the digitisation efforts of EXDN from a human-readable digital object to a machine-readable state, using Resource Description Framework (RDF) as the data model, semantically interoperable structured data can be created, thus enabling EXDN’s data to be reused, aggregated and integrated with other language resources, where it can serve as a potential aid in the development of future language resources for isiXhosa, an under-resourced language in South Africa. The methodological guidelines for the construction of a Linguistic Linked Data framework (LLDF) for a lexicographic resource, as applied to EXDN, are described, where an LLDF can be defined as a framework: (1) which describes data in RDF, (2) using a model designed for the representation of linguistic information, (3) which adheres to Linked Data principles, and (4) which supports versioning, allowing for change. The result is a bidirectional lexicographic resource, previously bounded and static, now unbounded and evolving, with the ability to extend to multilingualism.
2 - 6 July 2018
Converting the English-Xhosa Dictionary for Nurses to Linguistic Linked Data
Author: Frances Gillis-Webber
Conference: International Congress of Linguists (ICL20)
Location: Cape Town, South Africa | Cape Town International Convention Centre (CTICC)
7 - 12 May 2018
Managing Provenance and Versioning for an (Evolving) Dictionary in Linked Data Format
Author: Frances Gillis-Webber
Conference: 6th Workshop on Linked Data in Linguistics: Towards Linguistic Data Science (LDL-2018), 11th edition of the Language Resources and Evaluation Conference (LREC 2018)
Location: Miyazaki, Japan | Phoenix Seagaia Conference Center
Download: PAPER | PRESENTATION
Abstract
The English-Xhosa Dictionary for Nurses is a unidirectional dictionary with English and isiXhosa as the language pair, published in 1935 and recently converted to Linguistic Linked Data. Using the Ontolex-Lemon model, an ontological framework was created, where the purpose was to present each lexical entry as “historically dynamic” instead of “ontologically static” (Veltman, 2006:6, cited in Rafferty, 2016:5), therefore the provenance information and generation of linked data for an ontological framework with instances constantly evolving was given particular attention. The output is a framework which provides guidelines for similar applications regarding URI patterns, provenance, versioning, and the generation of RDF data.
to the top
Created: 16 May 2018 | Updated: 19 July 2023
2017
to the top
Created: 16 May 2018 | Updated: 19 July 2023
2016
3 - 4 March 2016
Workshop: OWL & Protégé Tutorial
Location: Manchester, United Kingdom | University of Manchester
Funding: £500 travel grant kindly provided by University of Manchester
to the top
Created: 16 May 2018 | Updated: 19 July 2023