15. I/O backends

15.1. IOBackend
15.2. FSBackend
15.3. StorableBackend
15.4. CSTS_SGML_SP_Backend
15.5. NTREDBackend
15.6. PMLBackend
15.7. TrXMLBackend
15.8. TEIXMLBackend
15.9. Writing a new I/O backend

This section briefly introduces the existing I/O backends and then provides a quick introduction to writing a custom backend.

15.1. IOBackend

Actually, this is not a I/O backend per se. It is a base class underlying most of the other I/O backends. It provides a simple abstraction layer over some common low-level tasks, such as pipe-line redirection, gzip-compression, URL resolving, as well as fetching and uploading files using remote protocols. Due to this backend, TrEd transparently handles gzip-compression/de-compression of files with .gz extension as well as remote file transfer over ftp://, http://, ssh:// and other protocols. Note however that availability of remote file transfer highly depends on a particular setup (currently only UNIX systems are fully supported) and may require some external tools, such as curl, kioclient, ssh, gzip.

15.2. FSBackend

FSBackends deals with a format called FS (feature structure). FS-format was the first format supported by tred, and for a long time also the only one. As a result, some of the Perl structures such as FSNode, FSFile used by TrEd as well as the underlying Perl library Fslib.pm itself bear its name (even though they are now equally used to represent data obtained from any other I/O backend).

FS format provides a simple and effective way for storing trees. In this format, each node of the tree uses the same set of attributes declared in the FS-file header. FS format supports string values, enumerated values and flat lists of these (i.e. strings consisting of a |-separated list of values of the first two types). There is no direct support for nested AVS structures, complex lists and alternatives.

Recent versions of TrEd and Fslib provide a simple nested-AVS emulation (attiribute-value structure) for FS attribute values, meaning the following: an attribute whose name contains one or more slashes is represented as a (possibly nested) AVS structure where each slash represents one level of nesting. Attributes sharing a common name-part followed by a slash are thus represented as members of the same structure. For example, attirubtes a, b/u/x, b/v/x and b/v/y result in the following structure:


{
  a => value-of-a,
  b => { u => { x => value-of-a/u/x },
         v => { x => value-of-a/v/x,
                y => value-of-a/v/y 
              }
       }

}

In case that attributes with names a a/b would both exists, the nested-AVS emulation is abandoned and all attributes are represented literally (i.e. with slashes in their names).

Note

Even with the recently added nested-AVS emulation, FS format is not in general fully capable of capturing data originating in other backends such as PML.

FS format is fully described in http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/fs.html

15.3. StorableBackend

This backend is based on an excelent Perl module Storable, which can store and retrieve almost arbitrary Perl structure in an instant. Because of its general nature, this backend can save and keep intact files originating from any other backend. Due to its speed with which it saves and retrieves data, it is often useful to transfer all data temporarily into this format (e.g. using btred) and revert to the original format only after all work on the data is done.

15.4. CSTS_SGML_SP_Backend

This backend provides support for the CSTS format. CSTS stands for “Czech sentence tree structure”. Since it is an application of SGML, the backend requires an external sgml parser (namely nsgmls) and a document type definition (DTD) (csts.doctype). This backend represents CSTS data as trees with a fixed set of attributes, specific to the purpose for which CSTS was created, namely morphological and syntactical annotation of Czech texts. CSTS has been the primary format of the Prague Dependency Treebank 1.0.

15.5. NTREDBackend

This backend used for exchanging of data between TrEd and btred servers, or in other words to “peek” into the memory of running btred servers. This backend only accepts filenames (or we should rather say URLs) starting with the ntred:// protocol specification and followed by a real file name. When this backend opens a file, it uses ntred client to fetch the file from any currently running btred server. If some btred server has a in-memory copy of the specified file (possibly edited during previous ntred requests), it sends this copy to the client and via NTREDBackend to TrEd. Saving is performed in a reversed way, i.e. ntred client is used to communicate the (possibly edited) file back to the btred server replaces its previous in-memory copy of the file with the file obtained in this way.

Instead of requesting a whole file from the servers, is also possible (and sometimes faster) to request only a single tree. This can be achieved by appending a suffix of the form @N to the ntred:// URL, where N is the number of the requested tree (counting from 1). In that case, TrEd obtains a file containing only the requested tree. So, such URL as ntred://filename@3##1.4, opens a file containing the 3rd tree in filename (as represented in memory of a btred server) and opens that file on the 4th node of its first and only tree. URLs of this form are produced by the TredMacro function NPosition() (see Section 14.8, “Public API: pre-defined macros”).

15.6. PMLBackend

This backend provies partial support for a new generic XML-based format called PML (the Prague Markup Language), used for capturing rich linguistic annotation. This format is the primary format of the Prague Dependency Treebank 2.0. Each application of PML is described using a special XML file called PML schema. This schema file defines which elements and attributes construe the nodes and structure of the tree, it defines value types of node attributes, etc.

Updated version of the full PML specification can be found on the PML project page. Note however that PML is an on-going project, so consider it as “work in progress”.

Because of the generic nature of PML, the PMLBackend of TrEd is restricted only to those applications of PML which satisfy the following criteria:

  • there is exactly one PML sequence or PML element of a list type with PML role #TREES and this element appears under the root element.

  • If the entity with role #TREES is a PML sequence, then all its members are elements with role #NODE. If it is a PML element of a list type, then all members of the list are PML structures with role #NODE.

  • Each PML element with role #NODE may contain a sequence of its child elements. This sequence must have PML role #CHILDNODES.

  • Each PML structure with role #NODE may contain a member of a list type, which constitutes the list of child-nodes. This member must have PML role #CHILDNODES.

TrEd represents PML data types in a very natural way. PML structures are represented as AVS structures, PML lists as Fslib::List objects and PML alternatives as Fslib::Alt objects. Other non-atomic PML types, such as PML elements, text data in mixed content, attributes, etc., are represented as AVS with the following four special members: #type (type of the entity, e.g. element, text), #name (name of the entity) #ns (XML namespace), #content (content of the entity). Attributes of a PML element are represented as additional members of the AVS representing the element.

If the PML schema declares a reference to an external resource and this declaration bears the attribute readas="dom", then PMLBackend loads the corresponding resource for the PML instance as a DOM (Document Object Model) tree (using the Perl module XML::LibXML) and attaches this DOM tree to the application data section of the in-memory representation of the file.

If the PML schema declares a reference to an external resource and this declaration bears the attribute readas="trees", then PMLBackend passes the file-name of the corresponding resource to TrEd and TrEd loads it as an ordinary file. This file can be edited and treated as any other file in TrEd. In btred, this file is opened as a so called secondary file, i.e. a file which is not implicitly processed by the macro specified by user, but as it is loaded in memory, the macro may explicitly choose to process it.

PMLBackend also supports so-called “knitting” of PML instances, i.e. replacing certain type of PML references with the content of the referenced entities occurring in another PML instance. Conversely, when a PML instance on which this “knitting” has been applied, the (possible edited) content replaces the content of the referenced entities in its original PML instance. Knitting only applies to:

  1. members of PML structures containing a PML reference and having PML role #KNIT,

  2. to members of PML structures containing a list with PML role #KNIT, with PML references as list members.

If such a member is encountered, then a possible trailing .rf part of its name is removed and its content (one or more PML references) is replaced with the corresponding entities in the referenced PML instance. It is required that all such PML references refer to resources specified in the PML schema as readas="dom".

15.7. TrXMLBackend

TrXMLBackend was intended as a XML replacement of the FS format. Unfortunatelly, never fully developed and thoroughly tested. This may still happen in the future but untill then, it is not recommended for serious work.

15.8. TEIXMLBackend

This backend reads and stores trees represented in a specific subset of the TEI XML format. This format was (is?) used in the Slovene Treebank Project.

15.9. Writing a new I/O backend

An I/O backend is a Perl module defining at least the following five subroutines (listed in the order in which they are typically called by TrEd):

test($filename,$encoding)

This function should only quickly peek in the given file in order to determine if it is a file suitable by the backend. If this function accepts the file by returns a defined non-zero value (e.g. 1), then the file is processed by this backend. If the file is not suitable for the backend, this function must reject the file by returning 0 or undef, so that other backends in the list of backends could try their luck.

open_backend($filename,$mode,$encoding)

This function should open and return a filehandle for a given file. If $mode is r, then this filehandle should be open for reading, if $mode is w, it should be open for writing. The third, $encoding, contains the encoding specified by the user in the defaultFileEncoding configuration option. This information may be ignored if the data format provides another way to determine the encoding. Most backends do not re-implement this function, but simply import (i.e. inherit) it from the base class IOBackend.

read($filehandle,$fsfile)

This is the key function that implements converting data from the specific data format to the corresponding memory representation in TrEd. This function obtains two arguments: the $filehandle previously obtained by a call to backend's open_backend, and an empty FSFile object (i.e. with no trees). It is supposed to parse the data format, build tree representation of the data (usually using functions such as FSNode->new(), and FSlib::Paste($child,$parent,$fsfile->FS) and populate the FSFile with the resulting trees (e.g. using its changeTrees method). It should also setup FSFormat object associated with the $fsfile ($fsfile->FS). Any additional information related to the file (but not representable as trees or FSFormat) may be attached to the file e.g. using $fsfile->changeMetaData($key,$value).

write($filehandle,$fsfile)

This function is the opposite of read. By examining the FSFile object $fsfile (especially its trees and meta data), it should write the corresponding representation in the specific data format to the given $filehandle.

close_backend($filehandle)

This function should close a given filehandle created by a previous call to open_backend. It usually only consists of applying a Perl function close on the filehandle, but if additional cleanup is necessary, it should be done here. Most backends do not re-implement this function, but simply import (i.e. inherit) it from the base class IOBackend.

There are several ways to make TrEd know about a user-defined I/O backend, namely: