In an ideal Linked Data world, all data would be in the Linked Data format from the beginning, and thus be processable by Linked Data procedures. In reality, however, only a small fraction of all data is available as Linked Data. The majority of data exists in various legacy formats. This is where data transformation towards Linked Data comes in; after all, format uniformity is the pivotal raison d’être of Linked Data. – In the LinDA workbench web application, there is a Transformation Engine for this step, which typically is the first step in the workflows provided by the workbench. The “visible” part of LinDA´s transformation engine is the Mapping Designer, in which the user specifies the details of the transformation process.
In the last years, Linked Data has been a dynamic emerging field of research. So it is no surprise that quite a large number of data transformation tools have appeared on the stage (cf. the list in http://linda-project.eu/linked-data-tools/). It is also no surprise that all these tools do not follow a clearly recognizable evolution path; instead, there are many me-too tools that focus on some particular detail deemed important by the authors. A survival-of-the-fittest consolidation has yet to come about.
Transformation tools towards Linked Data can be assessed by the following classes of properties:
- Handling the idiosyncrasies of the respective source format: For example, there are tools that indulge in topics like flexibly specifying multiple tables in spreadsheets and how they are to be transformed individually.
- Non-functional properties: For example, many tools focus on the interoperability with other tools or other formats in certain frameworks. Other focuses are efficiency for bulk data, or automation of the mapping design.
- Properties of the generated Linked Data: Here, concepts and decisions that are commonly subsumed under semantic enrichments come into play. For example, some relevant topics are:
- Finding a place in the spectrum between simplistic Direct Mapping and the full-fledged reverse engineering of the Entity-Relationship structure behind the source data;
- methods of finding and choosing vocabularies;
- named-entity recognition (NER).
In the early days of Linked Data, most published transformation tools focus on point 1 and 2, and their large number only reflects the variety of preferences of what a tool should focus on. Point 3, however, is more difficult to address, for several reasons. First, the issue of what constitutes “good” Linked Data has not yet fully been settled, nor is its relevance fully recognized. An initial contribution to a solution of this issue may be the 5* ranking scheme for Linked Open Data devised by Tim Berners-Lee himself. Second, devising concepts for, say, a suitable NER or an interactive vocabulary selection is much more difficult than devising concepts for, say, URI generation or spreadsheet slicing.
The LinDA Transformation Engine and the LinDA Mapping Designer have made some progress in the meantime, and the purpose of this blog entry is to report about it.
As to the input data formats, the main focus of the research efforts is lying on CSV files support (comma-separated values files; also with separation characters other than commas) and Excel® files. The hands-on experiences gathered from CSV use cases help to understand and to respond to practical application requirements, and this knowledge will be directly applicable to the implementation of other input formats.
The graphical user interface (GUI) and its intuitiveness have been improved. As before, the GUI guides the user through the several steps that it takes to define a mapping from some input format to Linked Data (such GUIs are called wizards). Now, at every step, the first few records of the input data and the corresponding resulting RDF triples are simultaneously shown on the screen. This gives the user an immediate impression and feedback of the effects of his/her settings.
For convenience, the CSV dialect, i.e., the actual line and columns delimiter and quote characters, is automatically guessed from the input data at hand, and the resulting automatic guess can be manually superseded when necessary.
Every data column selected for transformation gets an associated RDF property. The automated derivation of this RDF property from the respective column header is currently being integrated. (In the internal structure of the LinDA workbench, subtasks related to vocabularies are “outsourced” to another component, the Vocabulary and Metadata Repository.)
As to the structure of the generated output, it is planned to incorporate some kind of Named-Entity Recognition (NER) into LinDA´s transformation engine. This will add one more star of the above-mentioned 5* classification scheme for Linked Data to the data delivered by LinDA´s transformation engine. It is planned that in the end, all five stars will be achieved.
So to make a conclusion, it is currently clearly intended to provide a no-frills transformation engine for the LinDA workbench, one that combines the essential features with a user-friendly interface and a shallow learning curve.