The Toolchain used in the Mapping Development Process

The mapping artefacts, including the various mapping suite packages that developed in the process described in this document, are stored and maintained in the OP-TED/ted-rdf-mapping GitHub repository.

To assist the semantic engineers in the development of mapping suites, an internal toolchain has been developed, which consists of a number of command line tools (CLIs) that can be run on the mapping suites to produce various outputs.

The following CLIs are available (the name of the tools to be executed on the command line are specified in parentheses).

Resources Injector (resources_injector)

This CLI injects the requested resources listed on the "Resources" spreadsheet of the Conceptual Mappings into the MappingSuite. Each Form has a resources list that represents the controlled value that are needed in the mapping process.

RML Modules Injector (rml_modules_injector)

This CLI injects the technical mappings modules into the mapping suite. Each form has a module list that is needed in order to run the mapping_runner. The modules names are listed on the "RML_Modules" spreadsheet of the Conceptual Mappings.

SPARQL Test Generator (sparql_generator)

This CLI generates a set of SPARQL queries from the conceptual mapping that will be executed by the sparql_runner CLI (described in the next section). Each generated query can be used to test if the related conceptual mapping is correctly generating RDF data or not.

SPARQL Queries Runner (sparql_runner)

This CLI executes all the sparql queries generated by sparql_generator into a separate RDF result file. The result file is a report per RDF file that contains the queries and their associated success indicators (Valid, Invalid, Unverifiable, Warning or Error).

Each indicator below helps semantic engineers and reviewers understand the result:

  • Valid: The XPATH to which the query is associated was found in the XML notice, and in the SPARQL query returned "True".

This is the ideal case.

  • Invalid: The XPATH to which the query is associated was found in the XML notice, and the SPARQL query returned "False".

This can occur for various reasons. It could be an error in the mapping rule (which needs to be fixed in the technical mapping), or in the generated SPARQL query (which should be fixed by updating the conceptual mapping).

This indicator could also be a false negative where the validation and reporting tool is not able to generate a more complex SPARQL query that captures special cases e.g., if there is an XPATH condition, or if the query is executed in situations where it is not necessary.

  • Unverifiable: The XPATH to which the query is associated was NOT found in the XML notice, and the SPARQL query returned "False".

This is not an error, it is an expected False result. In this situation, the SPARQL query can’t verify the validity of the mapping on the current input data as it is not applicable to it.

  • Warning: The XPATH to which the query is associated was NOT found in the XML notice, but the SPARQL query returned "True".

This might be due to an error in the mapping or the query, but in most cases it indicates that the SPARQL query is "incomplete" i.e., too localised, and does not capture the full context of when a specific graph patterned should be matched. However, it matches some valid property paths in the output that were created from other XPATHs which have a similar ending.

  • Error: The SPARQL query is incorrect. That means that the conceptual mapping has to be amended.

Mapping Test Runner (mapping_runner)

This CLI applies the mapping on a certain test notice file, a batch of notice files (organised in a folder), or on all available test notices, and generates output files representing the corresponding RDF graph for each notice (see RDF output examples).

Metadata generator (metadata_generator)

This tool extracts the relevant metadata from the conceptual mapping file (by default: conceptual_mappings.xlsx) and stores it in a JSON file (by default: metadata.json). This metadata file will be used by various other processes (both CLIs and DAGs), mainly to inform them about the applicability of this mapping to various notices.

YARRRML to RML Converter (yarrrml2rml_converter)

This command line tool allows the conversion of a mapping expressed in the more user-friendly YARRRML syntax to RML. This is a very useful tool, especially at initial phases of the mapping development, or for newcomers, as it is easier and faster to write YARRRML rules than RML rules. This tool is not used anymore in the current development process as, due to technical reasons, the mappings are developed directly in RML, not in YARRRML.

XPATH Coverage Runner (xpath_coverage_runner)

Generates reports describing XPATH coverage of the notices.

SHACL Validation Runner (shacl_runner)

Generates SHACL Validation Reports for RDF files.

Mapping Suite Processor (mapping_suite_processor)

This CLI runs all the necessary CLIs mentioned above in a logical order, to fully process a mapping suite, starting with the generation of metadata and finishing with running the mapping on all the (specified) tests data, and generating all the possible associated validation artefacts. It can be run for a certain package, (set of) notice(s), or groups of commands.

Other relevant tools and libraries

Other relevant tools that are used in the mapping process that worth mentioning, are:

Matey: a browser-based application that helps writing YARRRML rules, and converting them to RML rules that can be also executed online. Matey uses the yarrrml-parser (described next) in the backend.

The RMLio/yarrrml-parser library, available on GitHub, allows the conversion of YARRRML rules to RML or R2RML rules. Since this is a library, besides using it to power Matey, it can be also used independently, or as an integrated part of our CLI tools.

The RMLio/rmlmapper-java library, available on GitHub, allows the execution of a set of RML mappings on a set of data sources, to generate high quality RDF data.


Any comments on the documentation?