Toolchain used in the mapping development process
The mapping artefacts, including the various mapping suite packages that are being developed by the process described in this document, are stored and maintained in the OP-TED/ted-rdf-mapping GitHub repository.
To assist the semantic engineers in the development of mapping suites a toolchain has been developed. There are a number of command line tools (CLIs) available in the OP-TED/ted-rdf-conversion-pipeline GitHub repository that can be run on these mapping suites to process them in various ways. In order to run these CLIs the ted-sws project needs to be installed in the rdf-mapping environment. This can be done by following the installation instructions provided here. The documentation for the usage of these CLI tools can be found here, however we will provide below some more details about the most relevant ones.
After the installation, the following CLIs will be available (the name of the tools to be executed on the command line are specified in parentheses).
This CLI injects the requested resources listed on the "Resources" spreadsheet of the Conceptual Mappings into the MappingSuite. Each Form has resources list that represent the controlled value that are needed in the mapping process.
Consult the authority tables used in the EPO available from the EU Vocabularies.
This CLI injects the technical mappings modules from the
src/mappings folder (see modules chapter) into the mapping suite. Each form has a module list that is needed in order to run the mapping_runner.
The modules name are listed on the "RML_Modules" spreadsheet of the Conceptual Mappings
This CLI generates a set of SPARQL queries from the conceptual mapping that will be executed by the
sparql_runner CLI (described in the next section). Each generated query will be used to test if the related conceptual mapping is correctly generating RDF data or not.
This CLI executes all the sparql queries generated by sparql_generator into each RDF result file. The results of this execution are one report per RDF file that contains the queries associated to indicators (Valid, Invalid, Unverifiable, Warning or Error).
Each indicator helps semantic engineers and reviewers to figure out different information as it shown down below:
Valid: The XPATH to which the query is associated was found in the XML notice and he SPARQL query returned True. This is the ideal case.
Invalid: The XPATH to which the query is associated was found in the XML notice and the SPARQL query returned False. This can occur for various reasons. It could be a problem in the mapping rule (which needs to be fixed in the technical mapping), or in the generated SPARQL query (which could be fixed by updating the conceptual mapping), but it also could be also a "false alarm" in cases when the validation and reporting tool is not capable to generate the more advanced SPARQL query that can capture the special cases when these queries should be executed (e.g. if there is an XPATH condition), and the query is executed even in situations when it is not necessary.
Unverifiable: The XPATH to which the query is associated was NOT found in the XML notice and the SPARQL query returned False. This is not a problem, per se. It is an expected False result. In such situation the given SPARQL query can’t help us to validate the correctness of the mapping, on this input data, as it is not applicable to it.
Warning: The XPATH to which the query is associated was NOT found in the XML notice but the SPARQL query returned True. This might be due to an error in the mapping or the query, but in most cases it is due to the fact that the SPARQL query is "incomplete", i.e. too localized, and it does not capture the full context of when should a certain graph patterned be matched, and it matches some valid property paths in the output that were created from other XPATHs, which have similar ending, for example.
Error: The SPARQL query is incorrect. That means that the conceptual mapping has to be reviewed.
This CLI applies the mapping on a certain test notice file, a batch of notice files (organised in a folder), or on all available test notices, and generates output files representing the corresponding RDF graph for each notice (see RDF output examples here).
This tool extracts the relevant metadata from the conceptual mapping file (by default:
conceptual_mappings.xlsx) and stores it in a JSON file (by default:
metadata.json). This metadata file will be used by various other processes (both CLIs and DAGs), mainly to inform them about the applicability of this mapping to various notices.
This command line tool allows the conversion of a mapping expressed in the more user-friendly YARRRML syntax to RML. This is a very useful tool, especially at initial phases of the mapping development, or for newcomers, as it is easier and faster to write YARRRML rules than RML rules. This tool is not used any more in the current development process as due to technical reasons we develop the mappings directly in RML, not in YARRRML.
Generates reports describing XPATH coverage of the notices.
Generates SHACL Validation Reports for RDF files.
This CLI runs all the necessary CLIs mentioned above in a logical order, to fully process a mapping suite, starting with the generation of metadata and up to running the mapping on all the (specified) tests data, and generating all the possible associated validation artefacts. It can be run for a certain package, (set of) notice(s), or groups of commands.
Other relevant tools that are used in the mapping process that worth mentioning, are:
Matey: a browser-based application that helps writing YARRRML rules, and converting them to RML rules that can be also executed online. Matey uses the yarrrml-parser (described next) in the backend.
The RMLio/yarrrml-parser library, available on GitHub, allows the conversion of YARRRML rules to RML or R2RML rules. Since this is a library, besides using it to power Matey, it can be also used independently, or as an integrated part of our CLI tools.
The RMLio/rmlmapper-java library, available on GitHub, allows the execution of a set of RML mappings on a set of data sources, to generate high quality RDF data.