Preparing Test Data

Representative sample data selection

This section describes TED notice data samples and methods used to generate them. At first a sampling is performed on the notices from 2021, and then on a wider set.

Sample TED notices from 2021

This data sample (test_data/sampling_2021) contains carefully selected TED notices based on the following criteria: maximise representativeness, minimise the number of selected documents. The selected notices are guaranteed to cover all possible XPath configurations available in the data. The sampling was performed automatically.

  • The current sampling exercise was performed on all records collected in 2021;

  • The data space was partitioned twice based on two criteria: the notice form number (e.g. F03, F06) and the eForms Subtype (e.g. cn-standard, can-utilities);

  • The minimal set of the most representative notices have been selected for each form_number and eforms_subtype .

Sampling result statistics

Below are two tables with statistics indicating the total number for each selection criteria, the percentage relative to the total number of considered notices (i.e. total in 2021), and the number of selected most representative notices.

Sampling based on form_number:

form_number nr. of notices percentage of total number of notices nr. of the most representative notices

F03

249618

37,35

20

F02

223430

33,43

18

F14

77564

11,6

11

F20

30244

4,52

7

F06

27746

4,15

21

F05

22452

3,36

29

F01

11483

1,72

23

F15

7788

1,17

43

F21

5844

0,87

20

F18

2035

0,3

25

F08

2034

0,3

10

F24

1764

0,26

19

F17

1680

0,25

27

F12

1546

0,23

16

F25

1115

0,17

19

F13

957

0,14

13

F04

932

0,14

20

F22

93

0,01

13

F16

56

0,01

11

F23

0

0

0

F07

0

0

0

F19

0

0

0

Total

668381

100%

365

Sampling based on eforms_subtype:

eforms_subtype nr. of notices percentage of total number of notices nr. of most representative notices

29

249618

37,35

19

16

223430

33,43

19

41

77564

11,6

11

30

27746

4,15

21

17

22452

3,36

27

38

19151

2,87

8

39

10797

1,62

7

4

7177

1,07

10

25

7006

1,05

17

20

5771

0,86

18

7

3872

0,58

21

31

2035

0,3

24

1

1839

0,28

8

19

1764

0,26

19

18

1680

0,25

26

23

1502

0,22

15

32

1115

0,17

19

36

930

0,14

13

26

499

0,07

15

5

460

0,07

11

10

434

0,06

17

8

411

0,06

14

40

296

0,04

6

27

235

0,04

17

2

157

0,02

4

21

93

0,01

13

12

73

0,01

5

11

61

0,01

12

6

56

0,01

11

28

48

0,01

8

24

44

0,01

10

3

38

0,01

5

37

27

0

4

13

0

0

0

15

0

0

0

22

0

0

0

33

0

0

0

34

0

0

0

35

0

0

0

Total

668381

100%

454

Sampling of TED notices based on XSD versions

A separate sampling on all the TED notices published between 2014 and 2022 was also made. The samples were split up based on the various XSD versions according to which they were created. These samples can be found in the following folder structure

test_data
   sampling_2014_2022
      R2.0.9.S01.E01            (702 notices)
      R2.0.9.S02.E01            (808 notices)
      R2.0.9.S03.E01            (809 notices)
      R2.0.9.S04.E01            (778 notices)
      R2.0.9.S05.E01            (662 notices)

Any comments on the documentation?