Transformation Engine - RDF Conversion Tool
In order to produce an RDF graph based on guidelines regarding design issues of Linked Data the following sub components of the Transformation Tool have been designed as part of the Transformation Engine:
- Recommendation Service: The functionality of this sub component allows the Data View (UI) to map data columns to properties and RDF triples to classes of popular Linked Data vocabularies. This sub component eliminates the effort and cost resulting from searching for adequate vocabularies and its design is a result of the lack of such help tools in the Linked Data tool landscape. This component relies entirely on the LinDA Vocabulary and Metadata Repository
- NER Service: this sub component substitutes literals with URIs in order to insure disambiguation by integrating well-known Linked Data triple stores (this process utilises tools that implement concepts of Named Entity Recognition). For the Transformation Tool DBpedia has been selected as an adequate solution for this task. The NER Service calls DBpedia Lookup Keyword Search API by supplying column names or a user’s manually entered keywords. The filtered results are passed to the Data View (UI) for the user’s selection.
- RDF Publication Service: this sub component is in charge of publishing the newly transformed RDF to a triple store (remote or local). The RDF Publication service provides means to name the to-be published RDF as well as overwriting an existing one.
- R2RML Generator: this sub component is responsible for generating an R2RML mapping based on the transformation design employed by the user. The generator parses an internal data structure provided by the TT and maps it to the R2RML standard. The output is a turtle file that can be downloaded or subsequently used for setting up SPARQL endpoints.
- RDF Transformation: based on the semantics of the transformation mapping designed by the user, the original data source (RDB or CSV/Excel) is transformed into RDF by this sub component. This sub component also provides a download functionality for the created RDF. The Transformation result is in n3 serialisation.
- Type Guessing Service: this sub component provides the functionality to auto-suggest the data type of each column (e.g. xsd:decimal)
- Transformation Mappings Repository: this sub component mimics a repository that holds all previously created and saved transformation mappings. This functionality allows the user to re-load and edit existing transformation mappings.
The implementation of the TT has been ported to Python 3 in the second year of the project and implemented as a Django App. This reduced the overhead and simplified the integration into the LinDA workbench. Also the choice to use Python and the Django framework was based on minimizing the number of different technologies in the project.
The Linked Data Services Layer represents the Django App backend and communicates with the Presentation Layer via the Django template system. This layer holds all programming logic for accessing files and databases, RDF transformation and other necessary features and functionality.
In order to publish RDFs to OpenRDF Sesame, the backend communicates over the HTTP protocol using a RESTful interface provided by the Triple Store
The Data Layer is an abstraction of different storages (local and remote) and the models necessary for the functioning of the TT. Transformation mappings are stored on the fly and synchronised with the Mapping designer in the Presentation Layer during the actual process of designing a transformation. This is stored in the Data Layer and also persistently upon the user’s requests.
Libraries and dependencies
The following is a list of third party libraries used in the TT:
- JQuery - for frontend development
- Pandas Data Analysis Library : Python library for data structures and data analysis
- XLRD : Python Library to extract data from Excel files
- Fontawesome for icons
All source code with information about how to setup the Transformation Tool can be found in the GitHub repository of the project, at https://github.com/LinDA-tools/TransformationTool for the standalone Django App and https://github.com/LinDA-tools/transformation to be integrated into the LinDA workbench.
User Interface and Workflow
This section describes the workflow of data transformations in the Transformation Tool. As the LinDA workbench is an interactive tool, this workflow needs to be described in terms of its user interface along a simple example.
The conversion of a sample CSV file to Linked Data is explained along the steps of the corresponding wizard interface, with a series of screenshots. After that, the differences of relational data conversions (RDB) to CSV conversions are summarized, but without screenshots.
From the start page the user selects the transformations section of the LinDA workbench, and the following data-source selection page is shown.
On the data-source selection page, the user can choose to transform a CSV or Excel file or a database, or to reopen a previously stored transformation endeavour for inspection, completion, or modification. In this example, the “CSV/Excel” branch is chosen. The click on the corresponding button leads to the next page, the Data Upload page.
On the Data Upload page, the user (i) selects a CSV or Excel file and (ii) clicks the upload button. As a result, the column headers and the first 10 lines are shown, chiefly in order to allow the user to check that s/he chose the correct file, and that, in the case of a CSV file, the parsing process has returned correct results. If that is not the case, the user can set the characters for the line end, quote, field delimiter, and/or escape character, click the Apply button, and check again.
When the upload of the source file is finished, the user clicks “next step”, and the Column Selection page is shown.
On the Column Selection page, the user selects which data columns are to be included in the transformation, using the checkboxes in the column headers. At least one column must be selected. When the column selection is finished, the user clicks “next step”, and the RDF subject specification page is shown.
LinDA’s transformation approach of tabular data requires that for every line of data, an RDF subject is generated. How this is to be done is specified on the RDF subject specification page.
The user has two principal options:
- “Use blank nodes only”: The user need not make any further specifications. The price to be paid is that the automatically generated subjects will be non-descriptive.
- “Provide a subject” (shown in the screenshot above): Here, the subjects are composed from a base URI (the same for all lines) and the contents of selected columns, typically with intermediate textual glue. Typically, those columns are selected which constitute the primary key of the table (in database parlance), so as to make the generated URIs unique. Note that the selection of columns that go into the subject URIs is independent from the selection of columns that are selected for the transformation (in the previous step).
The user has to specify all these settings in the entry boxes. A column reference can be inserted by clicking on the “+” sign in the corresponding column header.
From this page on, also the RDF view is shown, in addition to the data view. In the RDF view, the user sees the transformation result (triples) of the first ten data lines in their current state of completion.
When the specification of the subject URIs is finished, the user clicks “next step”, and the RDF predicate specification page is shown.
LinDA’s transformation approach of tabular data requires that for every data column selected for transformation, an RDF property needs to be specified. This is done on the RDF predicate specification page.
In order to make this process as automatic as possible, a dedicated web service is employed, which is fed with terms and provides suggestions of RDF properties for the user to choose from. For every (selected) column, the web service is initially fed with the original column header from the source data file. For those columns where the web service does not find appropriate results, the user can replace the search words in the entry box with something different, upon which the web service is invoked again. If results are found, then usually there are more than one, so the user has to choose.
When the specification of the RDF properties is finished, the user clicks “next step”, and the RDF object specification page is shown.
The result of a transformation is a set of RDF statements, each of which consists of a subject, a predicate, and an object. In principle, the objects are RDF literals derived from the table entries.
On this page, one of two things can optionally be done per column:
- The objects remain RDF literals, but are additionally equipped with an RDF data type. There is a small canonical set of datatypes to choose from.
- The literals are replaced by appropriate URIs (e.g., the string “Athens” is replaced by an URI that unambiguously refers to the Greek capital, e.g. http://dbpedia.org/resource/Athens). It can easily be imagined that this procedure has ambiguity issues and needs to be employed with care.
Which of these features is to be used, if any, is selected column-wise by the user, in a drop-down menu at the top of each column in the data view.
When the specification of the object actions is finished, the user clicks “next step”, and the data enrichment page is shown.
On the data enrichment page, the user can again employ the web service in order to find one or more subject types. For every generated subject URI, rdf:type links will be established to all subject types specified here.
When the specification of the subject types is finished, the user clicks “next step”, and the Publish page is shown.
On the Publish page, the user specifies a name for the data set just generated, and can carry out one or more of the following actions:
- “Publish to triple store”: The data set is inserted into the triple store of the LinDA server under the specified name.
- “Download RDF”: The data set downloaded onto the user’s computer in a readable RDF serialization format.
- “Download R2RML”: An R2RML script representing the transformation decisions is downloaded. This is useful only in the case of relational database transformations (see below), for the set-up of a SPARQL endpoint.
After this, the transformation is finished, and the workbench is free for other work.
Additional features for transformation of relational data
The user interface for transformations of relational data is similar, but not identical, to that of tabular data. The reason behind this is that relational data is richer in structural information than its tabular counterpart. No new set of screenshots is given; instead, the differences are briefly enumerated:
- Instead of specifying a file for upload, database address and credentials have to be supplied.
- As databases usually have more than one table, the table-related steps in the user interface may need to be carried out more than once accordingly.
- The default subject URI generated by the transformation tool also contains the table name, in order to avoid undesired ambiguities.
- The user is given the opportunity to add database views of his/her own to the already existing tables and views. Since the mapping of database tables (and views) is structurally still the Direct Mapping, the possibility of adding new views allows for controlling the transformation result structurally. An example where this is useful is the elimination of the Third Normal Form (3NF), which is recognized as part of the database etiquette but has no business in Linked Data.
- Tables may be interlinked through Foreign Keys. If a Foreign Key column is also declared as such in the table schema, the Foreign Key entries are automatically substituted by the subject URIs of the referenced records.
(If such a substitution is desired for an undeclared Foreign Key column, the user would currently have to resort to a view declaration whose schema includes the missing Foreign Key declaration. A simpler dedicated procedure is planned as a future extension.)
- Again for reasons related to 3NF, many-to-many relations in databases are typically expressed in dedicated tables with exactly two columns that contain Foreign Keys. The transformation tool recognizes such situations and transforms many-to-many relations into direct relations between the subject URIs involved.