Vocabulary and Repository Metadata

From LinDA Wiki
Jump to: navigation, search

Vocabulary and Metadata Repository

Introduction

One major issue in the transition from traditional relational databases to semantically rich data representations, and RDF in particular, is to track down the appropriate vocabularies from which terms will be used to represent the various entities, subjects and operations of an organization. In semantic web, a vocabulary is a collection of terms that can be used in order to present the end user with different logical schemas, containing entities (classes) as well as attributes of these entities and connections between two different entities (properties). The existence of a complete, up to date vocabulary repository can be very helpful in the data transformation process, while letting users understand the underlying structure of linked data.

Another major issue with semantic web vocabulary repositories, as well as vocabulary repositories in general, is the ability to overcome the information overload and assist users in identifying the best terms that could be used to describe real world entities, in terms of semantics, vocabulary popularity and existing stated knowledge in major public databases.

The LinDA Vocabulary and Metadata Repository contains a collection of vocabularies, covering various thematic areas. The structure of the LinDA vocabulary repository presents a set of advantages regarding its usability and scalability in a business environment:

  • The vocabulary repository includes by default vocabularies that cover many areas of expertise and economic activities, such as business to business relations (B2B), trade, various industries and renewable energy.
  • The ability to search inside the vocabulary contents enables non-technical users to find references in specific actions and entities.
  • Vocabulary visualization offers a quick overview of the vocabulary contents and its relations.
  • The Suggest API automates the interconnection between real-world objects and vocabulary terms.
  • Vocabulary information is automatically updated by a central repository, but at the same time new, specialized vocabularies can be created at local repositories.

Requirements and limitations

The Vocabulary and Metadata Repository was designed and implemented using existing components and modules for specific parts of the architecture, in order to focus development effort in the core functionality of the repository. This way, specific packages are required for its installation and successful deployment. These packages are all well-established, thoroughly documented, supported by strong communities and actively maintained.

Django framework

Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. It is a free and open source web application framework, written in Python, which follows the model–view–controller architectural pattern. Django's primary goal is to ease the creation of complex, database-driven websites. Django emphasizes reusability and "pluggability" of components, rapid development, and the principle of “don't repeat yourself”.

Elasticsearch

ElasticSearch is a distributed, RESTful, free/open source search server. It uses Apache Lucene and tries to make all features of it available through the JSON and Java API. It supports faceting and percolating, which can be useful for notifying if new documents match for registered queries.

Microsoft Translator API

Microsoft Translator API is a cloud-based automatic translation (aka machine translation) service supporting multiple languages that reach more than 95% of world gross domestic product (GDP). Translator can be used to build applications, websites, and tools, or any solution requiring multilanguage support. Built for business, Microsoft Translator is a proven, customizable, and scalable solution for automatic translation.

JQuery

jQuery is a fast, small, and feature-rich JavaScript library. It makes things like HTML document traversal and manipulation, event handling, animation, and Ajax simpler with an easy-to-use API that works across a multitude of browsers.

Relationship to the LinDA Workbench

The Vocabulary and Metadata Repository is integrated with the LinDA Workbench, but the LinDA Workbench is not required by the repository (although it is recommended that they are installed together), and it can also be installed as a standalone tool. In order to achieve this independence various measures have been taken:

  • The Vocabulary Repository is a standalone Django Application, that can be included by any Django Project.
  • There are no hardcoded assumptions about the relationship between the repository and the workbench, and its location is adjustable by altering dynamic configuration.
  • The communication between the Vocabulary Repository and other tools (particularly the suggest featured it offers to the Transformation Engine Tool) is not implemented in the workbench, but exists as a web based API offered by the repository, that can operate between different servers and instantiations of these tools.

Limitations

The repository can only accept and interpret specific RDF formats as input, specifically the RDF/XML format, N-Triples and turtles. Vocabularies with documents in other RDF formats will be stored in the repository, but information about classes and properties may not be automatically extracted and the vocabulary visualization could also fail.

Also, vocabulary visualization for complex vocabularies (hundreds of classes and properties) might overload user with information, therefore becoming useless, or even make the client system’s browser face difficulties rendering the visualization.

Installation

The Vocabulary and Metadata Repository must be installed on a (physical or virtual) machine that can operate as a business server. The repository may or may not be installed on the same machine as other LinDA tools or the Workbench. The Python v2.7 language interpreter must be installed to deploy the Django project containing the repository. Also, the PiP tool (Python Package installer) will typically be needed to meet the project requirements. Finally, the project will need a read & write database connection to any commonly used database system, like MySQL, PostgreSQL, SQLite, Oracle and many others.

In order to install the Vocabulary Repository, access to the linda-app Django application is required. This application can be then integrated with any Django project running in an environment that satisfies the following library requirements:


Library

Description

Version

Purpose

django

The Django web application framework

1.7

Develop a web application to offer access to the repository through all major web browsers.

django-allauth

Django authentication.

Latest

Automatically include user authentication and authorization.

django-haystack

Django search.

Latest

Package exposing database entities to the elastic search.

rdflib & rdfextras

Libraries to parse rdf files.

Latest

Offers the ability to read and parse the RDF files describing each vocabulary.

xmltodict

Libraries to transform between xml structured data and native python dictionaries.

Latest

Read and write xml data.

microsofttranslator

Microsoft Translate python library to access the Microsoft Translate web API.

Latest

Enables internationalization of the vocabulary repository and user input language auto-detection.


Initially, just after installing and deploying the application, the vocabulary repository structure has been created, but the vocabulary database itself is empty. When the project administrator first connects to the vocabularies page, a request for updates by the central vocabulary repository is triggered. All vocabularies are downloaded from the central repository server to the local (business) server and subsequently installed. Upon vocabulary installation, the vocabulary graph is created, and class and property entities are stored in the vocabulary database and indexed for search. Vocabulary updates run automatically after a user-specified period of time or a server restart in behalf of the repository administrator, when the last visits the vocabularies page.

Usage

The vocabulary repository serves the purpose of presenting the final user with various ontologies, supporting the transformation of traditional data formats to linked data by suggesting classes and properties. Both immediate user activities using the vocabulary interface as well as the invocation of suggest API methods are considered as usage of the vocabulary repository. The usage of the repository can take place with actions that can be grouped in the following categories:

  • Navigation: Actions that let the user search for vocabularies and entities inside them, read vocabulary descriptions, download the vocabulary RDF documents in various formats and get access to vocabulary visualizations and best usage practices.
  • Usage feedback: Evaluation of vocabularies, discussions and commenting, that expose the advantages and disadvantages of choosing a vocabulary’s terminology to create transformation plans and guide the user base of an enterprise to vocabularies best representing its structure, operations and needs.
  • Repository enrichment: Authenticated users may create and upload new vocabularies containing ontologies that do not exist to the initial repository or are specific to the enterprise. Vocabulary owners may further update their vocabularies at any times. The repository automatically extracts metadata information contained in the vocabulary RDF document like classes and properties, as well as their relations.
  • Term suggestion: Web API methods that let other LinDA tools, including but not necessarily limited to the Transformation Engine, pick the most prevalent vocabulary terms that describe real world objects and relationships.

Navigation

Due to the size of vocabulary indexes, it is crucial for the usability and utter success of a vocabulary repository to assist term search in order for users to quickly access the intended vocabularies. When users navigate to the “Vocabularies” page, they are shown a catalogue of all repository entities, which they can select to view by vocabularies, by classes or by properties:


As the vocabularies page is an object list, only a teaser of each element is shown. A teaser is composed by the name or label of the entity, a small description so that users can quickly decide if it interests them or not, and some basic links to get more detailed information about each entity.

By selecting a vocabulary, users get access to a page with more details about the selected vocabulary, which also allows them to perform actions on it, according to their role in the website.


The vocabulary page contains:

  • Some basic information about the vocabulary, like its namespace URI, the prefix that is commonly used for it, a link to the website where it is defined (like a W3C recommendation document or a website dedicated to the vocabulary) and a short description of its purpose and contents.
  • Links to the source vocabulary document, both in its original version and in an automatically created RDF in all supported formats (RDF/XML, n3 and NTriples), as well as a link to an automatically created vocabulary visualization.
  • Metadata about the vocabulary owner and when it was created.
  • Information about classes and properties that it defines.
  • Feedback controls, including rate and comment capabilities for authenticated users.
  • A usage example that indicates how the major entities defined in the vocabulary are supposed to be used in order to create semantically correct RDF documents (optional).

The visualization of a vocabulary, even being limited by the number of elements that can be visualized in a web page without causing information overload, is many times found useful for users to get a quick view of the described ontology.


Users can also view the details of both classes and properties that have been extracted by the installed vocabularies.

  • For both classes and vocabularies the resource URI, a humanized label, a description and a link to the vocabulary that defines them are provided.
  • Classes show a list of all the properties that they are the rdfs:domain of (properties that they have), as well as a list of all the properties that they are the rdfs:range of (properties they return them).
  • On the other hand, properties present the user with the classes that are their rdfs:domain and rdfs:range.
  • All elements are presented in a way that allows users to navigate between vocabularies, classes and properties seamlessly and without any interruption, something that helps them select the best entities describing real world entities and their connections.



Usage feedback

Feedback, evaluation of the presented material and community discussions are all great tools in order to promote the appropriate material according to each enterprise community needs and solve questions and problems that end users may face. In the LinDA Vocabulary and Metadata repository, two main mechanisms have been developed to let users express feedback and interact with each other:

  • Vocabulary rating: By rating it, users let others know how well a particular vocabulary is suited for a specific business’s needs. Highly rated vocabularies are more likely to contain material that can be used to describe business objects and actions well.
  • Vocabulary discussions: Through commenting, statements about a vocabulary and its contents can be expressed and questions might get solved. While parsing a whole conversation is much slower than evaluating a vocabulary by its rating, it could award user with a lot of extra information about the vocabulary from other users that have used it or tried its terms out in data transformations.

Repository enrichment

It is possible that, while offering most of the useful terms and relations, the repository out of the box might not contain objects, actions and restrictions that come from specific business needs. For this reason, specific actions have been taken to enrich the initial repository contents with useful metadata, as well as let users add more content to the repository:

  • Usage examples are a case of vocabulary metadata that was added to facilitate vocabulary usage by end users. Examples were gathered from various online sources, such as publications, standard recommendations, websites devoted to specific ontologies, presentations and online forums. They have been chosen in a way that presents the basic features of every vocabulary, giving the reader an idea of how to compose data sources based on the particular vocabulary and of the vocabulary’s structure in general.
  • An administration panel which lets super users create new vocabularies. After filling in the basic information required, mainly an ontology document, the repository will automatically create information about entities described in the vocabulary without user intervention. Administrators are urged not to edit existing vocabularies but to extend them using RDF constructs like rdf:about. Editing vocabularies will lead to inconsistencies between the local and the central repositories, and changes could be overwritten by future updates.



Term suggestion

Through the repository, the ability to search for term references is exposed both to the end user and to other LinDA tools. To improve search speed, database records are exposed to an Elasticsearch server, which is then used to find information relevant to a query, removing workload from the application server.

Also, using the Microsoft Translate Web API, it is possible that users search terms in their own native language or other languages used in various aspects of the business, overcoming the fact that metadata for ontology entities is typically found in English. By enabling localization, the search term’s language is recognized, and if found in a language other than English its most used translation is passed to perform the query. End users can see the translated term with the search results, so they might elaborate in case they feel the translation has changed the original meaning of the term. By turning localization off queries perform considerably faster, as there is no need for an extra HTTP request to the Microsoft Translate API.



Apart from searching through the web application, other tools can use the Vocabulary Repository Suggest API. By invoking a RESTful method and passing a search term as a parameter, applications get as a response a JSON document that contains all suggested vocabularies, classes and properties. Also, the search type might be specified, further narrowing down search results. This API enables all other LinDA tools to have a common ontology repository to ask for terms in the process of transforming real world objects to semantically rich data.


Suggest API

One of the core functionalities that the repository provides is an API to expose the aformentioned functionality. This enables 3rd party developers and SMEs to build their products on top of the vocabulary repository provided capabilities and enhance them with more advanced features. One of those such tools is of course the transformation engine which bases its functionality on top of this API to finalize with the data trasnformation.

This is a RESTful public API, making available to anyone who is interested to consume it without any restrictions. It reuses the database of vocabularies that has already been populated and it presents this information in JSON format.

One simple such response of a request can be found below:

}

It is clear that it returns the label of the vocabulary, its ranking from the platform, the original uri and ofcourse the vocabulary name. This format was chosen so that an application can easily search through numerous results and quickly pick its match. For more information it can always look the details for a specific one.

The path for the API is located at ‘coreapi/recommend/’ appended to that of tha platform.

So, the full url is http://platformUrl/coreapi/recommend/ and the parameters it accepts are as listed bellow:

  1. class

Class is the search attribute that enables applications to search according to a keyword on a class name or with the full name of that.

  1. property

This search attribute is for enabling applications to trigger search only according a specific property that is of interest. For example, an application could search for a property name like this, coreapi/recommend/?property=name and part of the result would be



{

},

 {

},

 {

}]


  1. q

This q general parameter is here to allow a developer or an application to search through the API with a general term in all attributes without specifing that it is looking for a class or a property. In a sense, it could be summarized as a combined appended result of both the aformentioned in a single response for the same search term.

  1. page

This is for enabling a pagination mechanism to make it easier for applications to browse numerous results. Each page, returns 20 results and of course the application can automatically ask the next page and the next until the results are finished.

An example request with pagination could be modelled as:

coreapi/recommend/?class=friend&page=3