16. I/O backends

16.1. Treex::PML::IO
16.2. FS backend
16.3. Storable backend
16.4. CSTS_SGML_SP backend
16.5. NTRED backend
16.6. PML backend
16.7. TrXML backend
16.8. TEIXML backend
16.9. Writing a new I/O backend

This section briefly introduces the existing I/O backends and then provides a quick introduction to writing a custom backend.

16.1. Treex::PML::IO

Actually, this is not a I/O backend per se. It is a base class and a function library used by Treex::PML and may of the other I/O backends. It provides a simple abstraction layer over some common low-level tasks, such as pipe-line redirection, gzip-compression, URL resolving, as well as fetching and uploading files using remote protocols. Due to this backend, TrEd transparently handles gzip-compression/de-compression of files with .gz extension as well as remote file transfer over ftp://, http://, ssh:// and other protocols. Note however that availability of remote file transfer highly depends on a particular setup (currently only UNIX systems are fully supported) and may require some external tools, such as curl, kioclient, ssh, gzip.

16.2. FS backend

FS backend deals with a format called FS (feature structure). FS-format was the first format supported by tred, and for a long time also the only one. As a result, some methods of the Perl classes used by TrEd and defined in the the underlying Perl library Treex::PML still bear the name of the format (even though they are now equally used to represent data obtained from any other I/O backend).

FS format provides a simple and effective way for storing trees. In this format, each node of the tree uses the same set of attributes declared in the FS-file header. FS format supports string values, enumerated values and flat lists of these (i.e. strings consisting of a |-separated list of values of the first two types). There is no direct support for nested AVS structures, complex lists and alternatives.

Recent versions of TrEd and Treex::PML provide a simple nested-AVS emulation (attiribute-value structure) for FS attribute values, meaning the following: an attribute whose name contains one or more slashes is represented as a (possibly nested) AVS structure where each slash represents one level of nesting. Attributes sharing a common name-part followed by a slash are thus represented as members of the same structure. For example, attirubtes a, b/u/x, b/v/x and b/v/y result in the following structure:


{
  a => value-of-a,
  b => { u => { x => value-of-a/u/x },
         v => { x => value-of-a/v/x,
                y => value-of-a/v/y 
              }
       }

}

In case that attributes with names a a/b would both exists, the nested-AVS emulation is abandoned and all attributes are represented literally (i.e. with slashes in their names).

Note

Even with the recently added nested-AVS emulation, FS format is not in general fully capable of capturing data originating in other backends such as PML.

FS format is fully described in http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/fs.html

16.3. Storable backend

This backend is based on an excelent Perl module Storable, which can store and retrieve almost arbitrary Perl structure in an instant. Because of its general nature, this backend can save and keep intact files originating from any other backend. Due to its speed with which it saves and retrieves data, it is often useful to transfer all data temporarily into this format (e.g. using btred) and revert to the original format only after all work on the data is done.

16.4. CSTS_SGML_SP backend

This backend provides support for the CSTS format. CSTS stands for Czech sentence tree structure. Since it is an application of SGML, the backend requires an external sgml parser (namely nsgmls) and a document type definition (DTD) (csts.doctype). This backend represents CSTS data as trees with a fixed set of attributes, specific to the purpose for which CSTS was created, namely morphological and syntactical annotation of Czech texts. CSTS has been the primary format of the Prague Dependency Treebank 1.0.

16.5. NTRED backend

This backend used for exchanging of data between TrEd and btred servers, or in other words to peek into the memory of running btred servers. This backend only accepts filenames (or we should rather say URLs) starting with the ntred:// protocol specification and followed by a real file name. When this backend opens a file, it uses ntred client to fetch the file from any currently running btred server. If some btred server has a in-memory copy of the specified file (possibly edited during previous ntred requests), it sends this copy to the client and via the NTRED backend to TrEd. Saving is performed in a reversed way, i.e. ntred client is used to communicate the (possibly edited) file back to the btred server replaces its previous in-memory copy of the file with the file obtained in this way.

Instead of requesting a whole file from the servers, is also possible (and sometimes faster) to request only a single tree. This can be achieved by appending a suffix of the form @N to the ntred:// URL, where N is the number of the requested tree (counting from 1). In that case, TrEd obtains a file containing only the requested tree. So, such URL as ntred://filename@3##1.4, opens a file containing the 3rd tree in filename (as represented in memory of a btred server) and opens that file on the 4th node of its first and only tree. URLs of this form are produced by the TredMacro function NPosition() (see Section 15.8, “Public API: pre-defined macros”).

16.6. PML backend

16.6.1. Introduction

This backend provies support for a generic XML-based format called PML (the Prague Markup Language), used for capturing rich linguistic annotation. PML is the base of the data format of the Prague Dependency Treebank 2.0.

Since all XML data can be easily transformed into PML and back (usually with a few lines of XSLT), this backend also provides a bridge between any other XML data format and TrEd, as described below in Section 16.6.2, “Support for non-PML XML-based formats”

Each application of PML is described using a special XML file called PML schema. This schema file defines which elements and attributes construe the nodes and structure of the tree, declares value types of node attributes, etc.

Updated version of the full PML specification can be found on the PML project page.

While PML can possibly capture all kinds of structured data, the PML backend of TrEd is limited only to those applications of PML which satisfy the following criteria:

  • the data contain exactly one PML sequence or PML list with the PML role #TREES consisting of data with the role #NODE. In case of a PML sequence, the #NODE-elements must form a contiguous block in the sequence, but may be preceded and/or followed by some non-#NODE elements.

  • Of all PML data types only PML structures and PML containers may bear the role #NODE.

  • A structure with the role #NODE may have a member with the role #CHILDNODES, containing a list or sequence of structures/containers with the role #NODE.

  • A container with the role #NODE may contain a list or sequence with the role #CHILDNODES consisting of structures/containers with the role #NODE.

16.6.2. Support for non-PML XML-based formats

PML backend has its own configuration file pmlbackend_conf.xml which is looked for in the directories listed in ResourcePath. The transform_map section of the configuration file is automatically merged with all files named pmlbackend_conf.inc found in ResourcePath (e.g. in a resource subdirectory of an extension package). The configuration file may specify transformation rules for transparent conversion from legacy XML format to PML and back. The configuration file is a PML file and may look as on the following example:

<?xml version="1.0"?>
<pmlbackend xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
  <head>
    <schema href="pmlbackend_conf_schema.xml"/>
  </head>
  <transform_map>

    <!--
      Transformation for Alpino XML format.
     -->
    <transform id="alpino" test="alpino_ds[@version='1.1']">
      <in type="xslt" href="alpino2pml.xsl"/>
      <out type="xslt" href="pml2alpino.xsl"/>
    </transform>

    <!--
      Transformation for TEI P5 XML format.
     -->
    <transform id="tei" test="TEI.2">
      <in type="xslt" href="tei2pml.xsl">
        <param name="only_fLib">'foo bar'</param>
      </in>
      <out type="xslt" href="pml2tei.xsl"/>
    </transform>

  </transform_map>
</pmlbackend>

This example configuration file defines two XSLT-based transformations (type="xslt", i.e. XSLT 1.0, is currently the only type of transformation implemented by PML backend), each of which consists of a XSLT stylesheet declared in the <in> tag, used by PML backend to convert documents from the original XML format to PML, and a XSLT stylesheet declared in the <out> tag, used by PML backend to convert documents from PML back to the original XML format. Each stylesheet can take zero or more parameters specified as <param name="parameter-name">parameter-value</param>, where parameter-value is an XPath expression evaluated in the context of the transformed document by the XSLT processor. The transformations are further conditioned by an XPath expression in the test attribute, which selects to which documents is the transformation applicable.

When PML backend opens an XML document and detects that this document does not belong to the PML namespace, it evaluates the XPath expression test for every transformation rule in the order in which the rules appear in the configuration file, until one of the expressions returns a true value (boolean true, non-zero number, non-empty node-set, or non-empty string) or the last expression fails. The input stylesheet of the transformation whose test had first succeeded is used to transform the document into PML. The id of this transformation is remembered and the output stylesheet of the same transformation is used to convert back from PML when the document is saved.

We now summarize the steps necessary for adding support for a new XML-based format to TrEd (via XSLT and PML backend):

  • Write a PML schema for the resulting PML version of the data so that all necessary information stored in the original format is captured.

  • Write a XSLT transformation from the original format to a PML format described by the previously written PML schema.

  • Write a XSLT transformation from the PML format to the original format. This is step is not necessary if one only wishes to open the documents in TrEd for reading.

  • Create a pmlbackend_conf.xml in one of the ResourcePath directories unless it already exists and add a transformation rule to it with the input and output XSLT stylesheets and an XPath test approximating documents in the format. If writing an extension, one can create a file pmlbackend_conf.inc in the resources directory of the extension, instead.

    Instead of specifying the output XSLT stylesheet one may also define an identity output transformation which simply writes back the data in PML. In that case the out tag should look as follows:

    <out type="identity"/>

16.6.3. Internal representation of PML data

This section describes how PML data types are represented in TrEd.

PML schema is represented by a Treex::PML::Schema object. This object can be retrieved from the current Treex::PML::Document using the macro MetaData('schema').

PML structures are represented as Treex::PML::Struct objects, and so are PML containers, whose attributes and content value become members of the Treex::PML::Struct object, the content value being represented by a special member named #content. In both cases, if the role is #NODE, then Treex::PML::Node object is used instead of Treex::PML::Struct. PML lists are represented as Treex::PML::List objects and PML alternatives as Treex::PML::Alt objects. PML sequences as Treex::PML::Seq objects and its elements as Treex::PML::Seq::Element objects. The only exceptions are: a PML sequences with the role #TREES, which is represented by the list of the trees of the Treex::PML::Document object and a PML sequence with role #CHILDNODES, which is represented by the child-nodes of the node it belongs to. Elements of these sequences therefore represented by by Treex::PML::Node (rather than Treex::PML::Seq::Element) objects with a dedicated attribute #name carrying the element's name.

The non-tree data structures contained in the root element of the PML instance can be obtained either from MetaData('pml_root'). If the root element contains a sequence with role #TREES, MetaData('pml_root') is empty and non-tree members of the sequence preceding trees are stored in a sequence MetaData('pml_prolog') while non-tree members that follow all trees are stored in a sequence MetaData('pml_epilog').

PML structures/containers that have a member or attribute with the role #ID are indexed by their IDs in a HASH AppData('id-hash').

If the PML schema declares a reference to an external resource and this declaration has the attribute readas="dom", then the PML backend loads the corresponding PML instance as a DOM (Document Object Model) tree (using the Perl module XML::LibXML) and attaches this DOM tree to the application data section of the in-memory representation of the file.

If the PML schema declares a reference to an external resource and this declaration has the attribute readas="pml", then the PML backend loads the corresponding PML instance as a Treex::PML::Instance object and attaches this object to the application data section of the in-memory representation of the file.

If the PML schema declares a reference to an external resource and this declaration bears the attribute readas="trees", then the PML backend passes the file-name of the corresponding resource to TrEd and TrEd loads it as an ordinary file. This file can be edited and treated as any other file in TrEd. In btred, this file is opened as a so called secondary file, i.e. a file which is not implicitly processed by the macro specified by user, but as it is loaded in memory, the macro may explicitly choose to process it.

The following HASH references carry information about external resources: MetaData('refnames') maps reference names to reference IDs, MetaData('references') maps reference IDs to URLs, and, for DOM and Treex::PML::Instance resources, AppData('ref') maps reference IDs to objects representing the resources in TrEd.

The PML backend supports so-called knitting of PML instances, i.e. replacing certain type of PML references with the content of the referenced entities occurring in other PML instance. Conversely, when a PML instance on which this knitting has been applied is saved, the (possible edited) content replaces the content of the referenced entities in its original PML instance. Knitting only applies to:

  1. members of PML structures containing a PML reference and having PML role #KNIT,

  2. to members of PML structures containing a list with PML role #KNIT, with PML references as list members.

If knitting applies to such a member, then a possible trailing .rf part of its name is stripped and its content (either a single PML reference or a list of them) is replaced with the corresponding entities in the referenced PML instance. It is required that all such PML references refer to resources specified in the PML schema either as readas="dom" (in which case Treex::PML representation of the referred data structure is created and transformed back into DOM only at save) or readas="pml" (in which case, after knitting, the referred and referring PML instances share the knitted data structure).

16.7. TrXML backend

The TrXML backend was intended as a XML replacement of the FS format. Unfortunatelly, never fully developed and thoroughly tested. It is obsoleted by PML and therefore not recommended for any future work.

16.8. TEIXML backend

This backend reads and stores trees represented in a specific subset of the TEI XML format. The format used by this backend was (is?) used in the Slovene Treebank Project.

16.9. Writing a new I/O backend

While for XML-based formats it is recommended to use XSTL transformation to PML as described in Section 16.6.2, “Support for non-PML XML-based formats”, adding support for other formats requires writing a new I/O backend. An I/O backend is a Perl module defining at least the following five subroutines (listed in the order in which they are typically called by TrEd):

test($filename,$encoding)

This function should only quickly peek in the given file in order to determine if it is a file suitable by the backend. If this function accepts the file by returns a defined non-zero value (e.g. 1), then the file is processed by this backend. If the file is not suitable for the backend, this function must reject the file by returning 0 or undef, so that other backends in the list of backends could try their luck.

open_backend($filename,$mode,$encoding)

This function should open and return a filehandle for a given file. If $mode is r, then this filehandle should be open for reading, if $mode is w, it should be open for writing. The third, $encoding, contains the encoding specified by the user in the defaultFileEncoding configuration option. This information may be ignored if the data format provides another way to determine the encoding. Most backends do not re-implement this function, but simply import (i.e. inherit) it from the base class Treex::PML::IO.

read($filehandle,$fsfile)

This is the key function that implements converting data from the specific data format to the corresponding memory representation in TrEd. This function obtains two arguments: the $filehandle previously obtained by a call to backend's open_backend, and an empty Treex::PML::Document object (i.e. with no trees). It is supposed to parse the data format, build tree representation of the data (usually using functions such as Treex::PML::Factory->createNode(), and $child->paste(($parent,$ordering_attribute) and populate the Treex::PML::Document with the resulting trees (e.g. using its changeTrees method). It should also setup Treex::PML::FSFormat object associated with the $fsfile ($fsfile->FS). Any additional information related to the file (but not representable as trees or Treex::PML::FSFormat) may be attached to the file e.g. using $fsfile->changeMetaData($key,$value).

write($filehandle,$fsfile)

This function is the opposite of read. By examining the Treex::PML::Document object $fsfile (especially its trees and meta data), it should write the corresponding representation in the specific data format to the given $filehandle.

close_backend($filehandle)

This function should close a given filehandle created by a previous call to open_backend. It usually only consists of applying a Perl function close on the filehandle, but if additional cleanup is necessary, it should be done here. Most backends do not re-implement this function, but simply import (i.e. inherit) it from the base class Treex::PML::IO.

There are several ways to make TrEd know about a user-defined I/O backend, namely: