LinDA Analytics and Data Mining

From LinDA Wiki
Jump to: navigation, search

LinDA Analytics and Data Mining

Github

https://github.com/LinDA-tools/LindaWorkbench/tree/master/linda/analytics

Motivation

Data analytics are related with the organization and analysis of sets of data for the discovery of patterns, the classification of the available data, the realization of forecasting and trends analysis etc. They have the potential to help SMEs to identify the data that is most important to the current and future business decisions, provide insights based on the analysis, answer specific business questions and facilitate/guide the decision making process. However, in order to be able to exploit the full power that analytics may provide to businesses, challenges related to the proper handling of data from heterogeneous data sources and formats need to be handled (a set of challenges for the production of Linked Data analytics is available in ). In order to produce meaningful analytics, the context that is available in the input datasets has to contain data that may lead to the extraction of advanced information.

Taking into account existing challenges, as well as the functionalities provided by the rest of the LinDA components, the LinDA Analytics and Data Mining component is designed and developed. This component aims at the provision of user friendly interfaces to end users for the realisation of analysis through the support of a wide set of algorithms and based on Linked Data input.

Overview

A library of basic and robust data analytic functionality is provided enabling SMEs to utilise and share analytic methods on Linked Data for the discovery and communication of meaningful new patterns that were unattainable or hidden in the previous isolated data structures. High importance is given on the design and deployment of tools with emphasis on their user-friendliness. Thus, the approach followed regards the design and deployment of workflows for algorithms execution based on their categorisation. An indicative categorisation includes Classifiers for identifying to which of a set of categories a new observation belongs based on a training set, Clusterers (unsupervised learning) for grouping a set of objects in such a way that objects in the same group are more similar to each other, Statistical and Forecasting Analysis for discovering interesting relations between variables and providing information regarding future trends and Attribute Selection (evaluators and search methods) algorithms for selecting a subset of relevant features for use in model construction based on evaluation metrics.

Components and Functionalities

The LinDA Analytics and Data Mining Ecosystem is supporting functionalities for the selection of the desired dataset from the LinDA data sources, the creation of a specific dataset based on the use of the SPARQL querying mechanisms along with the RDF-Any transformation components, the selection and configuration of the proper analytics extraction algorithm, the realisation of the analysis, the collection of the output in the selected supported format and the interconnection of the output dataset with the input dataset used for the analysis as well as the analytics process specified. Authorisation schemes as well as tracking of the activities (analyses realised along with corresponding input and output) are also supported.

The LinDA Analytics and Data Mining Ecosystem consists of the following components:

  1. Dataset Selection and Processing Component: this component is responsible for the provision of end user interfaces for the selection and processing of the dataset to be used in the analysis. This component actually provides appropriate interconnection interfaces to other LinDA components such as the Linda Data Sources and the Publication and Consumption Framework.
  2. Analytics Algorithm Selection and Configuration Component: this component coordinates the process of suggesting and configuring an analytics extraction process. Based on the category of the selected algorithm, a specific workflow is being followed by the end user with guidance relevant to the appropriate configuration of the algorithm parameters’ setup.
  3. Analytics Algorithm Execution Component: this component realises the implementation of the analytics process. Interconnections with external tools that support the execution of various sets of algorithms are supported within the component, while the overall process is transparent to the end user.
  4. Output Handling Component: this component is responsible for providing the output of the analysis in the appropriate format in order to be meaningful for the end user, as well as easily exploitable by other LinDA tools (e.g. LinDA Visualisation tool). Various output formats are supported based on the requirements imposed by each algorithm. In case of RDF output, interconnection of the output dataset to the input dataset and the executed analytics process is provided.


In order to support Linked Data Analytics, an ontology is being designed based on a set of existing ontologies and specific extensions- targeting at the interconnection of the output dataset to the input dataset or datasets, as well as the description of the analytic process followed in each case. Specifically, concepts from the Friend Of A Friend (FOAF) Ontology, the PROV Ontology and the Semantic science Integrated Ontology (SIO) are used towards the definition of the LinDA Analytics Ontology. Each analytic process realised in the LinDA Analytics and Data Mining Ecosystem is represented based on the defined Ontology, saved and be accessible for further use in the future. Furthermore, in case that a re-execution of an analytic process is required, the end user is able to re-execute the algorithm based on direct access to the input sources or re-execute the query for the preparation of the input dataset and then execute the appropriate algorithm.



The produced output from the analytic process is in RDF format and is automatically interconnected with the input data sources. Thus, the end users are able to have access to the analytics processes executed in the past along with the input and output files used or produced. This functionality is considered critical in cases that re-evaluation of algorithms/processes have to be realised in the examined business scenarios where input data are updated. The produced output may be also directed to the LinDA visualisations tool for the production of meaningful graphs from the end user.

Technical Details

The front-end components are based on the Django Python Web Framework, while the Analytics and Data Mining Ecosystem are developed as a Django module, ready to be plugged in every Django installation. Python and more specifically the requests http library and the urllib/urllib2 modules from the Python standard library are used for this purpose. The produced module is a RESTful API consumer of a Java RESTful server that is responsible for realising the analytics and data mining part. Java-based RESTful web services are developed with RESTEasy, an implementation of the JAX-RS specification and deployed by using a JBoss server.

The Analytics and Data Mining tool is based on existing open-source platforms for extracting data analytics. Specifically, the Weka open-source tool consist the basis for the execution of the algorithms, while specific components’ extensions and interfaces are going to be developed for proper interconnection with the rest LinDA tools as well as the simplification of the overall process. In addition to Weka, further open-source tools such as the R project for statistical computing are supported.

Limitations

As stated earlier, the design and deployment of the analytics and data mining component is realised with main objective its user-friendliness towards the SMEs employees. However, even if specific workflows are designed per algorithm category that hide part of the complexity, detailed knowledge of the algorithm’s functionality and configuration parameters is required. Since, in most cases, a set of analysis has to be realised by interplaying also with the parameters, the end user is responsible for properly configuring the considered parameters per algorithm according to his needs.

Furthermore, the quality of the input Linked Data is considered as another factor that plays important role during the analysis. As mentioned also in the challenges for producing Linked Data analytics, the existence of meaningful and well-interconnected input data is of major importance.

Conclusion and Future Work

Summarizing, we could argue that at the current implementation phase, all the desired functionalities are implemented and validated via a set of supported algorithms and the use of testing datasets. In the upcoming period, the integration of further algorithms to the analytics and data mining component is envisaged, focusing mainly on the requirements imposed by the analytics pilots within LinDA. By the term integration, we refer to the design and deployment of user friendly interfaces for the algorithm’s configuration and execution parts as well as the proposal of the use of default values (in case that the end user is not aware of the algorithms configuration parameters).

Furthermore, given that the input data sources usually provide different ways of referring the base URL of each record as well as the need for interconnecting the output datasets to the input datasets, it is in our plans the deployment of a component that homogenizes the base URL description and provides unique reference to the input data sources via a standardized way for all the cases.


Usage

At the introductory page the end user has to select algorithm category, algorithm to be executed, provide input for the configuration of the algorithm’s parameters (default values are automatically loaded), provide description of the analytics process to be realised (for further use in the future) and select the SPARQL queries (based on a list of available queries to be used for the preparation of the input datasets or directly load a file in CSV or arff format. Autocomplete functionalities are supported, aiming at the facilitation of end user to the selection of the appropriate query. In case of choosing a predined query, he just types some of the words or the name of the query and this appears as a part of an autocomplete list. The end user also selects the desired output format (RDF/XML, NTriples, Turtle, CSV, etc.) and proceed to the analysis execution part. At the right side bar, a list with the analytic processes that are already executed is available and can be loaded. At various parts, information buttons provide further description and tips to the end user.


The result of the analytic process is depicted. At the upper side, the RDF output is provided that can be added to the existing LindaWorkbench datasources. Furthermore, the user can see the model of the chosen algorithm – in case it exists. At the right part, the user can see more info about the process. By pressing on the selected queries info button the user may also see the SPARQL query and download the query result in CSV format.