Mapping suite package structure

In this section we describe the structure of a “mapping suite package” in GitHub. Such a package contains everything that is needed for the development and testing of a given “mapping suite” that is applicable to a certain set of notices. After the package is finalised, it can be used by a process to apply it to a large number of notices stored in a database, and would transform those notices into RDF data.

A package is represented by a well-defined folder structure containing certain files. This folder structure is repeated for every developed mapping. Initial organisation of these packages is per Form number, but it may evolve.

The structure of the package changes through the different phases of the mapping development process. Below we describe how such a package looks in three phases of the mapping development.

Mapping suite package description for Semantic Engineers

In the first, initial, phase, when the Semantic Engineers start working on a new mapping suite, they will have to set up a package folder structure similar to the one described below, and will work on (or with) the files contained there.

Assumption: Regarding the naming and organisation of the various mapping suites, one package per form number is assumed to be THE way to organise these packages.

Challenge: Are there better ways to deal with certain sections (sub-sections) that repeat across multiple forms? Consider Section I, for example, which in case of forms F03, F06, F25 contains “almost” the same information, therefore only one mapping should be written for it and RE-used in “final” form-mapping-packages. The problem is also discussed in a dedicated section below.

The structure of an example mapping package folder structure is presented below:

/package_Fxx
	/transformation
		conceptual_mappings.xlsx
		/mappings
			*.rml.ttl
		/resources
			*.json, *.xml, *.csv
	/test_data
		*.xml
  • /package_Fxx root folder of the mapping suite

  • /transformation/conceptual_mappings.xlsx manually created (from the Google Sheet template described here)

  • /transformation/resources additional resources possibly needed by the transformation rules;
    The content of this folder should be automatically generated by the mapping package processor, based on the "Resources" sheet of the conceptual_mappings.xlsx, from the "source of truth" ted-rdf-conversion-pipeline/ted-sws/resources.

  • /transformation/mappings/*.rml.ttl the relevant RML transformation rules, organized in module files, which are copied from the "source" mappings folder, according to the information specified in the "RML Modules" sheet of the conceptual_mappings.xlsx. IMPORTANT!!! In these rules the source XML is always referring to data/source.xml, which corresponds to the ../../data/source.xml file that will be copied (and renamed) from the test_data folder at the time of the execution of the mapping.

  • /test_data manually and carefully selected test data possibly grouped in suborders, e.g. /test_data/batch-D1/*.xml

  • technical_mappings.yarrrml.yaml (optional) manually created, and used in earlier days of the mapping development, but currently not used

Mapping suite package description for the Software Engineers

A package provided by the semantic engineers (SE) is enriched with additional artefacts that are generated automatically using the package expanding tools which take as input the artefacts provided by the SE. Here are some examples of these additional artefacts that are being generated:

  • Metadata describing the parameters for selecting the notices that the mappings can be applied to, various version information, etc.

  • SPARQL queries that can be used to validate and/or test the generated outputs

  • SHACL shapes that can be used to validate and the structure of the generated outputs

  • New ones may be added at the time of writing this document

After the package processing/expansion, the structure of the example mapping package presented in the previous subsection would look like this:

/package_Fxx
	metadata.json
	/transformation
		conceptual_mappings.xlsx
		/mappings
			*.rml.ttl
		/resources
			*.json, *.xml, *.csv
	/data
		source.xml
	/output
		*.rdf
	/validation
		/sparql
			/cm_assertions
				*.rq
		/shacl # this is a constant, when we know what the SHACL is (currently unknown)
			*.shacl.ttl # data shape file(s)
	/test_data # manually and carefully selected test data
		*.xml
  • metadata.json automatically generated from Metadata sheet of conceptual_mapping.xlsx

  • /data # this is a placeholder created at runtime to process the inputs. It serves only when the mapping suite is being tested, or executed by some script.

  • source.xml this file is generated during runtime by copying a given test data file

  • /output this is a placeholder created at runtime to store outputs. It serves only when the mapping suite is being tested, or executed by some script.

  • /validation/sparql/cm_assertions SPARQL queries automatically generated from the conceptual mapping

Mapping suite package description for the Semantic Engineers after the expansion

After the “execution” of a mapping, the mapping package will be further enriched, and will contain additional files, as a result of running the mapping suite on the included test data.

/package_Fxx
	metadata.json
	/transformation
		conceptual_mappings.xlsx
		/mappings
			*.rml.ttl
		/resources
			*.json, *.xml, *.csv
	/data
		source.xml
	/output
		/<notice_file1>
			<notice_file1>.ttl
			/test_suite_report
				*.ttl, *.html, *.json # e.g. sparql_cm_assertions.html, shacl_epo.html, xml_coverage.html
		/<notice_file2>
			...
		/<notice_file3>
			...
	/validation
		/sparql
			/cm_assertions
				*.rq
		/shacl
			/epo
				ePO_shacl_shapes.rdf
			shacl_result_query.rq
	/test_data
		<notice_file1>.xml
		<notice_file2>.xml
		<notice_file3>.xml
		*.xml
  • /output/<notice_file1> for each example file we create a folder that will contain all the generated artefacts for that sample file

  • /output/test_suite_report validation reports summarising all individual reports

  • /output/<notice_file1>/<notice_file1>.ttl the output of the transformation *