LinDA Analytics and Data Mining

From LinDA Wiki
Jump to: navigation, search

- Back to LinDA Tools

LinDA Analytics and Data Mining

Analytics.jpg

LinDA Analytics tool enables the execution of conventional analytic processes (e.g.regression, clustering, forecasting) against linked data.

Github

https://github.com/LinDA-tools/LindaWorkbench/tree/master/linda/analytics

Features

  • Select and execute an analytic processes against a specific SPARQL query, through user-friendly interfaces with pre-configured parameters for specific algorithms.
  • Integration of Weka and R analytic algorithms. Support for J48, M5P, Apriori, LinearRegression, Arima,Morans I, Kriging, NCF correlogram, ClustersNumber, *Kmeans, Ward Hierarchical Agglomerative, Model Based Clustering, LinearRegression in R algorithms
  • Ontology for mapping information relevant to the analytic process and the input and output sources.
  • Support input and output of analytic results in RDF format (N3, Turtle, RDF / XML).
  • Support tracking of analytics processes executed per user
  • Interconnection with the visualization components for visualising the analytics’ output.
  • Ability to save / load analytic processes per user
  • List saved analytic processes
  • Ability to load datasets / training datasets in CSV or ARFF format
  • Re-evaluate the analytic process keeping the same trained model
  • Before re-evaluation the analytics procsess , the user can a) change the output format, b) change the evaluation query c) Change the parameters of the current algorithm d) search within his analytics, e) Automatic refill of the analytic process description
  • User feedback on tool usage: a) queries participation on analytics, b) analytics effectiveness, c) data quality and output reuse, d) performance time of each analytic process.

Example Tutorial

Before trying Analytics tool: In order to use the LinDA Analytics tool the user has to login first before running an analysis. In order to login go http://linda.epu.ntua.gr/ and press the Sign in . You can create your own LinDA user via Sign up

Example 1 : How to prepare and interpret an analysis

  1. From the Linda main page click Queries.
  2. Find query with id 53 with description “out-of-pocket-health-expenditure, political stability, gov stability, GDP per capita”
  3. Select the Query and do right click -> Select Run & Edit.
  4. Now you are redirected to the following link http://linda.epu.ntua.gr/query-designer/53/
  5. Press “Run” in order to execute the query and see the results. the dependent value (the variable you want to see how is affected by the other variables) should be the “OOP exp”. The dependent variable should always be the last column. if it is not, put it as last and save the query. The output of query should look like this:
  6. Press “Analyze”
  7. You now see the analytics form, with the trained query 53 as preselected.
  8. Select the “Regression Analysis” option from “Select an algorithm category” dropdown list.
  9. Select the “Linear Regression” option from “Select an algorithm” dropdown list.
  10. if you want to see extra info about linear regression algorithm or the regression category, press the info (i) button next to to the dropdown lists
  11. Leave empty the default params options.
  12. Insert a brief description of the analytic process. use the info button to see what kind of information could be useful to you. If the field stays empty then is automatically fullfield with : Analytics for + query description + with algorithm + algorithm name
  13. Check same query as evalutation query.
  14. Select RDF output. you can leave N-triples that is the default.
  15. Press submit and wait.
  16. Visualize the analytic output.
  17. Observe the Analytics model file.
  18. Goverment effectiveness seems to be hightly correlated with Out Of Pocket expenditures in a negative way. ( -0.17770 < 0)
  19. The result seems to be very accurate as p-value is 2.882e-05. The closer the p-value is to 0 , more accurate is the analysis result.
  20. The observed value OOP seems to be semi correlated with the variables gdp_per_capita, pol_stability and Gov_effect as Multiple R-squared: is 0.2247 . The closer Multiple R-squared is to 1, the correlations is highter.
  21. Answer the questions from the “My evaluation is:” questionaire. This step is optional but important in order to get statistics about the data quality of the participating queries and the quality and reusability of the analytic output.

Example 2: Explore the data quality hints and reuse the analytic output

  1. From the Linda main page click Analytics.
  2. Select the “Clustering Algorithms” option from “Select an algorithm category” dropdown list.
  3. Select the “KMeans” option from “Select an algorithm” dropdown list.
  4. At parameters field put “->k 3” which means that you wish to categorize the query data in 3 clusters.
  5. Insert a brief description of the analytic process. use the info button to see what kind of information could be useful to you. If the field stays empty then is automatically fullfield with : Analytics for + query description + with algorithm + algorithm name
  6. At the “Define specific sparql query” option, start typing “out-of-pocket-health-expenditure” and select the first suggestion from autocomplete field (query with id 53).
  7. Click on evaluation button. You are able to see in how many analytic processes the specific query has participated, what type of algorithms are since now applyied to the specific query and what is the quality of the data query according to the rest of the users.
  8. Select RDF output. you can leave N-triples that is the default.
  9. Press submit and wait.
  10. Visualize the analytic output.
  11. Observe the Analytics CLUSTPLOT image.
  12. You can see three well defined, not overlapped clusters.
  13. Observe the Analytics model image.
  14. You can observe that:
    1. at the first cluster belong the countries where the following conditions take place: hight gdp per capita , hight political stability, hight Government effectivenesst and low Out of Pocket expenditures.
    2. at the second cluster belong the countries where the following conditions take place: low gdp per capita , low political stability, low Government effectivenesst and hight Out of Pocket expenditures.
    3. The third cluster looks similar to second cluster with the only difference that the government effectiveness is less negative than in the second cluster, which leads to low out of pocket expenditures. Seems that the goverment effectiveness affects strongly the OOP expenditures in an inverse way. (the better the goverment effectiveness is , the lower become the OOP). The current result is in accordance with the results of the Example1.
  15. Now lets see in which cluster belongs Greece.
  16. Answer the questions from the “My evaluation is:” questionaire. This step is optional but important in order to get statistics about the data quality of the participating queries and the quality and reusability of the analytic output.
  17. Check “Save as Linda Datasource”.
  18. The new Datasource is saved to LindaWorkbench rdf server.
  19. Based on the new Datasource, it has been created the query with id 68 “Check in which cluster of OOP belongs Greece depending on gdp, gov_effectiveness and political stability”.
  20. Run & Edit query 68.
  21. Add as Filter to “RefArea” field , the “Ellada” value.
  22. Re-Run the query. It turns out that Greece belongs to cluster2.
  23. Add as Filter to “Cluster” field , the “cluster2” value.
  24. Re-Run the query. It turns out that Greece, Eesti, Latvia, Malta, Cyprus belong in same cluster.
  25. You can further continue analysing….depending on specific clusters instead of making analysis on the top of all EU28. For example if you add to query 53 as filters all countries of cluster 2 (Greece, Bulgaria, Eesti, Malta … etc) and re-evaluate the linear regression of exapmle 1, you will get a higher Adjusted R-squared value of 0.5708 which means reforces the resullt from the kmeans algorithm and reveals that the certain group of countries have similar economic behaviour. From the above countries only Bulgaria has procceed with the OTC liberization in 2012 which lead in an reduction of OOP expenditures. We can reasonally make the hipotesis that the same will happen with Greece in case a liberazation is permited.

Example 3: How to re-evaluate an analytics output

  1. Go to previous analytic output with kmean algorithm.
  2. Go to “Current Analytic Info” lateral column of the page.
  3. Click on “Algorithm Parameters” the edit icon
  4. “Pass new parameters:” ->k 4
  5. Submit new parameters.
  6. Check on “Analytics input Query” the information icon (i).
  7. You must see a Query pop up. Click on the Run & Edit Query link.
  8. This is an other way to acess queries within the analytic process. Click on the Grey area to make popup disapear.
  9. Check Re-Evaluate option and wait.
  10. Now clusters are 4 instead of 3.

Note: For algorithms that need both training and evaluation data query, the user can check on Re-evaluating by keepint the training model in order to avoid , time consuming, redundant train of the analytics model.

Example 4: How to search within user analytics

  1. From the Linda main page click Analytics.
  2. At the right side column you can see your Analytic processes until now..
  3. Go to “Search My Analytics” field
  4. You can search by algorithm or type a word included in Analytic process description
  5. Type “linear”.
  6. You should see all processes that have runned linear regression algorithm.
  7. Hover on the results to see the full description of the processes.
  8. Clean the Filter and type enter to see the full list of analytic processes.

Example 5: Explore analytics & performance statistics

  1. From the Linda main page click Analytics.
  2. Click on “View Analytic Statistics” button.
  3. You can observe 3 type of statistical info:
    1. Datasources & Queries Participation to Analytic Processes (Click on a Datasource)
    2. Algorithms participation to Analytic Processes(Click on an Algorithm)
    3. How analytic processes are created and updated upon time.


Motivation

Data analytics are related with the organization and analysis of sets of data for the discovery of patterns, the classification of the available data, the realization of forecasting and trends analysis etc. They have the potential to help SMEs to identify the data that is most important to the current and future business decisions, provide insights based on the analysis, answer specific business questions and facilitate/guide the decision making process. However, in order to be able to exploit the full power that analytics may provide to businesses, challenges related to the proper handling of data from heterogeneous data sources and formats need to be handled (a set of challenges for the production of Linked Data analytics is available in ). In order to produce meaningful analytics, the context that is available in the input datasets has to contain data that may lead to the extraction of advanced information.

Taking into account existing challenges, as well as the functionalities provided by the rest of the LinDA components, the LinDA Analytics and Data Mining component is designed and developed. This component aims at the provision of user friendly interfaces to end users for the realisation of analysis through the support of a wide set of algorithms and based on Linked Data input.

Overview

A library of basic and robust data analytic functionality is provided enabling SMEs to utilise and share analytic methods on Linked Data for the discovery and communication of meaningful new patterns that were unattainable or hidden in the previous isolated data structures. High importance is given on the design and deployment of tools with emphasis on their user-friendliness. Thus, the approach followed regards the design and deployment of workflows for algorithms execution based on their categorisation. An indicative categorisation includes Classifiers for identifying to which of a set of categories a new observation belongs based on a training set, Clusterers (unsupervised learning) for grouping a set of objects in such a way that objects in the same group are more similar to each other, Statistical and Forecasting Analysis for discovering interesting relations between variables and providing information regarding future trends and Attribute Selection (evaluators and search methods) algorithms for selecting a subset of relevant features for use in model construction based on evaluation metrics.

Components and Functionalities

The LinDA Analytics and Data Mining Ecosystem is supporting functionalities for the selection of the desired dataset from the LinDA data sources, the creation of a specific dataset based on the use of the SPARQL querying mechanisms along with the RDF-Any transformation components, the selection and configuration of the proper analytics extraction algorithm, the realisation of the analysis, the collection of the output in the selected supported format and the interconnection of the output dataset with the input dataset used for the analysis as well as the analytics process specified. Authorisation schemes as well as tracking of the activities (analyses realised along with corresponding input and output) are also supported.

The LinDA Analytics and Data Mining Ecosystem consists of the following components (see Figure 4):

  1. Dataset Selection and Processing Component: this component is responsible for the provision of end user interfaces for the selection and processing of the dataset to be used in the analysis. This component actually provides appropriate interconnection interfaces to other LinDA components such as the Linda Data Sources and the Publication and Consumption Framework.
  2. Analytics Algorithm Selection and Configuration Component: this component coordinates the process of suggesting and configuring an analytics extraction process. Based on the category of the selected algorithm, a specific workflow is being followed by the end user with guidance relevant to the appropriate configuration of the algorithm parameters’ setup.
  3. Analytics Algorithm Execution Component: this component realises the implementation of the analytics process. Interconnections with external tools that support the execution of various sets of algorithms are supported within the component, while the overall process is transparent to the end user.
  4. Output Handling Component: this component is responsible for providing the output of the analysis in the appropriate format in order to be meaningful for the end user, as well as easily exploitable by other LinDA tools (e.g. LinDA Visualisation tool). Various output formats are supported based on the requirements imposed by each algorithm. In case of RDF output, interconnection of the output dataset to the input dataset and the executed analytics process is provided.
A 1.png
Figure: Components view of the LinDA Analytics and Data Mining Ecosystem


In order to support Linked Data Analytics, an ontology is being designed (Figure 5) based on a set of existing ontologies and specific extensions- targeting at the interconnection of the output dataset to the input dataset or datasets, as well as the description of the analytic process followed in each case. Specifically, concepts from the Friend Of A Friend (FOAF) Ontology, the PROV Ontology and the Semantic science Integrated Ontology (SIO) are used towards the definition of the LinDA Analytics Ontology. Each analytic process realised in the LinDA Analytics and Data Mining Ecosystem is represented based on the defined Ontology, saved and be accessible for further use in the future. Furthermore, in case that a re-execution of an analytic process is required, the end user is able to re-execute the algorithm based on direct access to the input sources or re-execute the query for the preparation of the input dataset and then execute the appropriate algorithm.


A 2.png
Figure: LinDA Analytics and Data Mining Ecosystem Ontology


The produced output from the analytic process is in RDF format and is automatically interconnected with the input data sources. Thus, the end users are able to have access to the analytics processes executed in the past along with the input and output files used or produced. This functionality is considered critical in cases that re-evaluation of algorithms/processes have to be realised in the examined business scenarios where input data are updated. The produced output may be also directed to the LinDA visualisations tool for the production of meaningful graphs from the end user. The interconnection interfaces of the LinDA Analytics and Data Mining Ecosystem with the rest of the LinDA components are shown in Figure 6.

A 3.png
Figure: Interconnection of LinDA Analytics and Data Mining Ecosystem with the rest LinDA components

Technical Details

The front-end components are based on the Django Python Web Framework, while the Analytics and Data Mining Ecosystem are developed as a Django module, ready to be plugged in every Django installation. Python and more specifically the requests http library and the urllib/urllib2 modules from the Python standard library are used for this purpose. The produced module is a RESTful API consumer of a Java RESTful server that is responsible for realising the analytics and data mining part. Java-based RESTful web services are developed with RESTEasy, an implementation of the JAX-RS specification and deployed by using a JBoss server.

The Analytics and Data Mining tool is based on existing open-source platforms for extracting data analytics. Specifically, the Weka open-source tool consist the basis for the execution of the algorithms, while specific components’ extensions and interfaces are going to be developed for proper interconnection with the rest LinDA tools as well as the simplification of the overall process. In addition to Weka, further open-source tools such as the R project for statistical computing are supported.

Limitations

As stated earlier, the design and deployment of the analytics and data mining component is realised with main objective its user-friendliness towards the SMEs employees. However, even if specific workflows are designed per algorithm category that hide part of the complexity, detailed knowledge of the algorithm’s functionality and configuration parameters is required. Since, in most cases, a set of analysis has to be realised by interplaying also with the parameters, the end user is responsible for properly configuring the considered parameters per algorithm according to his needs.

Furthermore, the quality of the input Linked Data is considered as another factor that plays important role during the analysis. As mentioned also in the challenges for producing Linked Data analytics, the existence of meaningful and well-interconnected input data is of major importance.

Conclusion and Future Work

Summarizing, we could argue that at the current implementation phase, all the desired functionalities are implemented and validated via a set of supported algorithms and the use of testing datasets. In the upcoming period, the integration of further algorithms to the analytics and data mining component is envisaged, focusing mainly on the requirements imposed by the analytics pilots within LinDA. By the term integration, we refer to the design and deployment of user friendly interfaces for the algorithm’s configuration and execution parts as well as the proposal of the use of default values (in case that the end user is not aware of the algorithms configuration parameters).

Furthermore, given that the input data sources usually provide different ways of referring the base URL of each record as well as the need for interconnecting the output datasets to the input datasets, it is in our plans the deployment of a component that homogenizes the base URL description and provides unique reference to the input data sources via a standardized way for all the cases.


Usage

A set of screenshots depicting the process for realisation of analysis are presented below, along with short description of each step. In Figure 7, the introductory page to the LinDA analytics process is depicted. The end user has to select algorithm category, algorithm to be executed, provide input for the configuration of the algorithm’s parameters (default values are automatically loaded), provide description of the analytics process to be realised (for further use in the future) and select the SPARQL queries (based on a list of available queries – see Figure 10) to be used for the preparation of the input datasets or directly load a file in CSV or arff format. Autocomplete functionalities are supported, aiming at the facilitation of end user to the selection of the appropriate query. In case of choosing a predined query, he just types some of the words or the name of the query and this appears as a part of an autocomplete list. The end user also selects the desired output format (RDF/XML, NTriples, Turtle, CSV, etc.) and proceed to the analysis execution part. At the right side bar, a list with the analytic processes that are already executed is available and can be loaded. At various parts, information buttons provide further description and tips to the end user (as shown in Figure 8Figure 9 an indicative algorithm and the analytics process correspondingly).

A 4.png

Figure: Initiation of an analytics process


A 5.png

Figure: Algorithm Description

A 6.png

Figure: Analytics Process Description


A 7.png

Figure: Selection of Datasource Queries


A 8.png

Figure: Analytic Output Result Information


In Figure 11, the result of the analytic process is depicted. At the upper side, the RDF output is provided that can be added to the existing LindaWorkbench datasources. Furthermore, the user can see the model of the chosen algorithm – in case it exists. At the right part, the user can see more info about the process. By pressing on the selected queries info button the user may also see the SPARQL query and download the query result in CSV format. This is depicted in Figure 12.


A 9.png

Figure: Analytics Train Query Info