PDT 2.0: internal format conversion tools

Table of Contents

1. Conversion of PDT1.0-like analytical annotation to PML
2. Conversion of a PML a-data instance to CSTS
3. Conversion of a PML m-data instance to CSTS
4. Conversion of PDT 2.0 data to FS for Netgraph
5. Conversion of PDT 2.0 data to a binary Perl Storable format (for speed)

Conversion between data formats is a tough task unless all the formats can bear exactly the same amount of information. Unfortunately, this is not the case of data formats that emerged over the years of history of PDT. The following scripts aim to make at least some of the conversions easier. They may also serve as examples for more complex transformations required for a particular purpose.

In the distribution, the scripts are located in the directory tools/format-conversions/pdt_formats. Most of the scripts also require the btred tool from the TrEd toolkit.

The usual application of the scripts is as follows:

$ btred -m script.btred files ...

With a working ntred configuration, the same task can be parallelized on several machines using:

$ ntred --init files ...
$ ntred -m script.btred 
$ ntred --quit

1. Conversion of PDT1.0-like analytical annotation to PML

btred -m old2pml.btred [-e all | a-data | m-data | w-data ] [-o [--with-MM] [--with-MD] -- ] [-s strip-filename-suffix] [-a append-filename-suffix] [-p strip-filename-prefix] [-r prepend-filename-prefix] file...

The script converts data with PDT 1.0 and similar analytical annotation in CSTS, FS or any other format supported by TrEd to PML.

By default, the tool creates PML instances for the a-layer (only if available in the source), m-layer, and w-layer. It is also possible to restrict the conversion to a certain layer only by calling the appropriate macro of the script, e.g.

$ btred -m old2pml.btred -e m-data files ...

only creates the corresponding m-layer instances.

The output m-file is by default populated only with the manual morphological annotation. Whether the tagger markup and/or morphology analysis should also be included in the m-file as alternatives to the annotator's markup, is controlled by using --with-MD and --with-MM script flags respectively (script flags are passed to btred between -o and --).

The names of output files are based on the input files as follows: the input filename is first transformed according to command-line options -a|-s|-r|-p passed to btred or ntred. If the resulting filename contains a .gz suffix, it is stripped. Finally, according to the layer of annotation, one of the suffixes .a, .m, and .w is appended.

2. Conversion of a PML a-data instance to CSTS

btred -m adata2csts.btred [-s strip-filename-suffix] [-a append-filename-suffix] [-p strip-filename-prefix] [-r prepend-filename-prefix] file...

The script converts PDT 2.0 a-data (after knitting them with the corresponding m-data and w-data) to CSTS with the following limitations on the resulting CSTS (especially with respect to the content of PML w-data and m-data):

the output character encoding is iso-8859-2 as required by the CSTS specification
no CSTS header is created (not required by CSTS DTD)
only one of each of doc, c, p elements is created (i.e. paragraph boundaries marked in the w-data are ignored)
the required header a of the doc element is populated with dummy content
tokens in w-data not-referenced from m-data are not represented in the resulting CSTS
morphological annotation other than the referenced from a-data is ignored
morphological annotation is dumped as l and t regardless of its true source

The names of the output files are based on the input files as follows: the input filename is first transformed according to command-line options -a|-s|-r|-p passed to btred or ntred. If the resulting filename contains a .gz suffix, it is stripped. Finally, the .csts suffix is appended.

3. Conversion of a PML m-data instance to CSTS

xsltproc [-o output-file] mdata2csts.xsl file

saxon [-o output-file] file mdata2csts.xsl

This conversion is implemented as an XSL transformation and requires an XSLT processor such as xsltproc, saxon. A very simplistic conversion of PDT 2.0 m-data to CSTS is provided by this XSLT stylesheet. It suffers from the following limitations:

the output character encoding is iso-8859-2 as required by the CSTS specification
no CSTS header is created (not required by CSTS DTD)
only one of each of doc, c, p elements is created (i.e. paragraph boundaries marked in the w-data are ignored)
the required header a of the doc element is populated with dummy content
information contained on the w-data layer is completely ignored
the script expects no alternatives in the morphological annotation of the data and always translates this annotation as l and t regardless of its true source

4. Conversion of PDT 2.0 data to FS for Netgraph

btred -m pml2netgraph.btred [-s strip-filename-suffix] [-a append-filename-suffix] [-p strip-filename-prefix] [-r prepend-filename-prefix] file...

This script transforms a-data or t-data from PML instances to a corresponding FS file suitable for use with the Netgraph server. In case of a-data, only minor changes to the naming of attributes are done (m-data and w-data are naturally embedded). In case of t-data the conversion is more complex: for every a-node referred to in the element a of a t-node, a phantom copy is created and planted as a hidden child-node of the referring t-node. This copy also embeds complete m-layer and w-layer information of the original a-node. Node attributes of the a-layer start with the prefix a/, those of m-layer resp. w-layer start with m/. resp. with w/.

Unlike the above conversions, this script actually modifies the file loaded in btred, which means that when applied using ntred, an explicit ntred --save-files command must be issued before ntred --quit.

The names of the output files are based on the input files as follows: possible .pls or .gz suffixes are removed, .fs suffix is appended and the result is transformed according to command-line options -a|-s|-r|-p passed to btred or ntred --save.

5. Conversion of PDT 2.0 data to a binary Perl Storable format (for speed)

btred -m pml2pls.btred [-s strip-filename-suffix] [-a append-filename-suffix] [-p strip-filename-prefix] [-r prepend-filename-prefix] file...

This script transforms given PML files to a binary format based on the Storable Perl module. This format allows applications from the TrEd toolkit to retrieve the data with extreme speed.

Apart from simple saving the data as Storable objects, the script tries to achieve internal consistency of references from t-data to a-data by changing the internal reference to a-data file in the t-data file accordingly. For example, a reference to an a-data file ref_filename in a t-data file processed with this script is changed as to refer to ref_filename.pls.gz. Hence, it is required to apply this script both on a-data and t-data files.

This script modifies the file loaded in btred, which means that when it is applied using ntred, an explicit ntred --save-files command must be issued before ntred --quit.

The filenames of the referenced files are internally changed as follows: the directory path is completely stripped off, a possible .gz suffix is removed, .pls.gz suffix is appended and the result is transformed according to command-line options -a|-s|-r|-p (in case of ntred, these must be repeated also with ntred -m pml2pls.btred command).

The names of the output files are based on the input files as follows: a possible .gz suffix is removed, .pls.gz suffix is appended and the result is transformed according to command-line options -a|-s|-r|-p (in case of ntred, these are the options used with ntred --save).

A typical usage (converting both a-data and t-data files to a gzipped Storable format *.pls.gz in one step):

$ btred -Y -m pml2pls.btred -g '*.t' '*.a'

The -Y flag suppresses possible doubled loading of the a-data as secondary files required by the t-data.

The same case using ntred but without gzipping and moving the files form source/ directory to target/:

$ ntred --init -Y -g '*.t' '*.a'
$ ntred -m pml2pls.btred  -s .gz
$ ntred --save-files  -p source/ -r target/