This section briefly introduces the existing I/O backends and then provides a quick introduction to writing a custom backend.
Actually, this is not a I/O backend per se. It is a base class and a function
library
used by Treex::PML and may of the other I/O backends. It provides a simple
abstraction layer over some common low-level tasks, such as pipe-line redirection,
gzip-compression, URL resolving, as well as fetching and
uploading files using remote protocols. Due to this backend,
TrEd transparently handles
gzip-compression/de-compression of files with .gz
extension as well as remote file transfer over
ftp://
,
http://
,
ssh://
and other protocols. Note however that availability
of remote file transfer highly depends on a particular setup
(currently only UNIX systems are fully supported) and may require
some external tools, such as curl,
kioclient,
ssh,
gzip.
FS backend deals with a format called FS (feature structure). FS-format
was the first format supported by tred, and for a long time also the only one.
As a result, some methods of the Perl classes
used by TrEd and defined in the the underlying Perl library
Treex::PML
still bear the name of the format (even though they
are now equally used to represent data obtained from any other I/O
backend).
FS format provides a simple and effective way for storing trees.
In this format, each node of the
tree uses the same set of attributes declared in the FS-file header.
FS format supports string values,
enumerated values and flat lists of these
(i.e. strings consisting of a |
-separated list of
values of the first two types). There is no direct support
for nested AVS structures, complex lists and alternatives.
Recent versions of
TrEd and Treex::PML
provide a simple nested-AVS emulation
(attiribute-value structure) for FS attribute values, meaning the
following: an attribute whose name contains one
or more slashes is represented as a (possibly nested) AVS structure where
each slash represents one level of nesting. Attributes sharing a
common name-part followed by a slash are thus represented as members of
the same structure. For example, attirubtes a
,
b/u/x
,
b/v/x
and
b/v/y
result in the following structure:
{
a => value-of-a
,
b => { u => { x => value-of-a/u/x
},
v => { x => value-of-a/v/x
,
y => value-of-a/v/y
}
}
}
In case that attributes with names a
a/b
would both exists, the nested-AVS emulation is
abandoned and all attributes are represented literally (i.e.
with slashes in their names).
Even with the recently added nested-AVS emulation, FS format is not in general fully capable of capturing data originating in other backends such as PML.
FS format is fully described in http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/fs.html
This backend is based on an excelent Perl module
Storable
, which can store
and retrieve almost arbitrary Perl structure in an instant. Because of its
general nature, this backend can save and keep intact files
originating from any other backend. Due to its speed
with which it saves and retrieves data, it is often
useful to transfer all data temporarily into this format
(e.g. using btred) and
revert to the original format only after all work on the data is done.
This backend provides
support for the CSTS format. CSTS stands for “Czech
sentence tree structure”. Since it is an application of SGML,
the backend requires an external sgml parser (namely
nsgmls) and a
document type definition (DTD)
(csts.doctype
). This backend represents
CSTS data as trees with a fixed set of attributes, specific to
the purpose for which CSTS was created, namely morphological and
syntactical annotation of Czech texts.
CSTS has been the primary format of the Prague Dependency
Treebank 1.0.
This backend used for
exchanging of data between TrEd and
btred servers, or in other words to “peek”
into the memory of running btred servers.
This backend only accepts filenames (or we should rather
say URLs) starting with the ntred://
protocol specification and followed by a real file name. When
this backend opens a file, it uses ntred
client to fetch the file from any currently running btred
server. If some btred server has a in-memory copy of
the specified file (possibly edited during previous
ntred requests), it sends this copy to the
client and via the NTRED
backend to
TrEd. Saving is performed in a
reversed way, i.e. ntred client
is used to communicate the (possibly edited) file back
to the btred server replaces its previous in-memory copy of the
file with the file obtained in this way.
Instead of requesting a whole file from the
servers, is also possible (and sometimes faster) to request
only a single tree. This can be achieved by appending
a suffix of the form
@
to the
N
ntred://
URL, where
N
is the number of the requested
tree (counting from 1). In that case,
TrEd obtains a file containing only
the requested tree. So, such URL as
ntred://
,
opens a file containing the 3rd tree in filename
@3##1.4filename
(as
represented in memory of a btred server) and opens that
file on the 4th node of its first and only tree. URLs of this
form are produced by the TredMacro
function NPosition()
(see Section 15.8, “Public API: pre-defined macros”).
This backend provies support for a generic XML-based format called PML (the Prague Markup Language), used for capturing rich linguistic annotation. PML is the base of the data format of the Prague Dependency Treebank 2.0.
Since all XML data can be easily transformed into PML and back
(usually with a few lines of XSLT), this backend also provides
a bridge between any other XML data format and
TrEd
, as described below in
Section 16.6.2, “Support for non-PML XML-based formats”
Each application of PML is described using a special XML file called PML schema. This schema file defines which elements and attributes construe the nodes and structure of the tree, declares value types of node attributes, etc.
Updated version of the full PML specification can be found on the PML project page.
While PML can possibly capture all kinds of structured data, the
PML
backend of TrEd
is limited only to those applications of PML which satisfy the
following criteria:
the data contain exactly one PML sequence or PML list
with the PML role #TREES
consisting of
data with the role #NODE
.
In case of a PML sequence, the
#NODE
-elements must form a contiguous
block in the sequence, but may be preceded and/or
followed by some non-#NODE
elements.
Of all PML data types only PML structures and PML containers
may bear the role #NODE
.
A structure with the role #NODE
may have a member with the role #CHILDNODES
,
containing a list or sequence of structures/containers
with the role #NODE
.
A container with
the role #NODE
may contain
a list or sequence with the role #CHILDNODES
consisting of structures/containers
with the role #NODE
.
PML backend has its own configuration file
pmlbackend_conf.xml
which is looked for
in the directories listed in ResourcePath.
The transform_map
section of the configuration file is automatically merged
with all files named
pmlbackend_conf.inc
found in
ResourcePath
(e.g. in a resource
subdirectory
of an extension package).
The
configuration file may specify transformation rules for
transparent conversion from legacy XML format to PML and
back. The configuration file is a PML file and may look
as on the following example:
<?xml version="1.0"?> <pmlbackend xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head> <schema href="pmlbackend_conf_schema.xml"/> </head> <transform_map> <!-- Transformation for Alpino XML format. --> <transform id="alpino" test="alpino_ds[@version='1.1']"> <in type="xslt" href="alpino2pml.xsl"/> <out type="xslt" href="pml2alpino.xsl"/> </transform> <!-- Transformation for TEI P5 XML format. --> <transform id="tei" test="TEI.2"> <in type="xslt" href="tei2pml.xsl"> <param name="only_fLib">'foo bar'</param> </in> <out type="xslt" href="pml2tei.xsl"/> </transform> </transform_map> </pmlbackend>
This example configuration file defines two XSLT-based
transformations (type="xslt"
, i.e. XSLT
1.0, is currently the only type of transformation
implemented by PML backend), each of which consists of a XSLT
stylesheet declared in the <in>
tag,
used by PML backend to convert documents from the original
XML format to PML, and a XSLT stylesheet declared in the
<out>
tag, used by PML backend to
convert documents from PML back to the original XML format.
Each stylesheet can take zero or more parameters
specified as <param
name="
,
where parameter-name
">parameter-value
</param>parameter-value
is an XPath
expression evaluated in the context of the transformed
document by the XSLT processor.
The transformations are further conditioned by
an XPath expression in the test
attribute,
which selects to which documents is the transformation
applicable.
When PML backend opens an XML document and detects that this
document does not belong to the PML namespace, it evaluates
the XPath expression test
for every
transformation rule in the order in which the rules appear in the
configuration file, until one of the expressions returns a true value
(boolean true, non-zero number, non-empty node-set, or
non-empty string) or the last expression fails. The input
stylesheet of the transformation whose test had first
succeeded is used to transform the document into PML.
The id
of this transformation
is remembered and the
output stylesheet of the same transformation
is used to convert back from PML when
the document is saved.
We now summarize the steps necessary for adding support for a new XML-based format to TrEd (via XSLT and PML backend):
Write a PML schema for the resulting PML version of the data so that all necessary information stored in the original format is captured.
Write a XSLT transformation from the original format to a PML format described by the previously written PML schema.
Write a XSLT transformation from the PML format to the original format. This is step is not necessary if one only wishes to open the documents in TrEd for reading.
Create a pmlbackend_conf.xml
in one
of the ResourcePath
directories unless it already exists and
add a transformation rule to it with
the input and output XSLT stylesheets and
an XPath test approximating documents in the format.
If writing an extension,
one can create a file
pmlbackend_conf.inc
in the resources
directory of the extension, instead.
Instead of specifying the output XSLT stylesheet one may
also define an identity output transformation
which simply writes back the data in PML.
In that case the out
tag
should look as follows:
<out type="identity"/>
This section describes how PML data types are represented in TrEd.
PML schema is represented by a Treex::PML::Schema
object. This object can be retrieved from the current
Treex::PML::Document
using the macro
MetaData('schema')
.
PML structures are represented as
Treex::PML::Struct
objects,
and so are PML containers, whose attributes
and content value become members of the
Treex::PML::Struct
object, the content value
being represented by a special member named
#content
. In both cases, if the role is
#NODE
, then Treex::PML::Node
object is used instead of Treex::PML::Struct
.
PML lists are represented as Treex::PML::List
objects and PML alternatives as Treex::PML::Alt
objects. PML sequences as Treex::PML::Seq
objects and its elements as
Treex::PML::Seq::Element
objects. The only
exceptions are: a PML sequences with the role
#TREES
, which is represented by the list of
the trees of the Treex::PML::Document
object and a PML
sequence with role #CHILDNODES
, which is
represented by the child-nodes of the node it belongs to.
Elements of these sequences therefore represented by by
Treex::PML::Node
(rather than
Treex::PML::Seq::Element
) objects with a
dedicated attribute #name
carrying the
element's name.
The non-tree data structures contained in the root element of
the PML instance can be obtained either from
MetaData('pml_root')
. If the root element
contains a sequence with role #TREES
,
MetaData('pml_root')
is empty and
non-tree members of the sequence preceding trees are
stored in a sequence
MetaData('pml_prolog')
while
non-tree members that follow all trees are stored in a sequence
MetaData('pml_epilog')
.
PML structures/containers that have a member or attribute
with the role #ID
are indexed
by their IDs in a HASH AppData('id-hash')
.
If the PML schema declares a reference to an external resource
and this declaration has the attribute
readas="dom"
, then the
PML
backend
loads the corresponding PML instance
as a DOM (Document Object Model) tree (using the Perl module
XML::LibXML
) and attaches this DOM tree
to the application data section of the in-memory
representation of the file.
If the PML schema declares a reference to an external resource
and this declaration has the attribute
readas="pml"
, then the
PML
backend
loads the corresponding PML instance
as a Treex::PML::Instance
object and attaches this object
to the application data section of the in-memory
representation of the file.
If the PML schema declares a reference to an external resource
and this declaration bears the attribute
readas="trees"
, then the
PML
backend passes the file-name
of the corresponding resource to
TrEd
and TrEd loads it as an ordinary
file. This file can be edited and treated as any other file in
TrEd. In btred,
this file is opened as a so called
secondary file, i.e. a file
which is not implicitly processed by the macro specified by
user, but as it is loaded in memory, the macro may explicitly
choose to process it.
The following HASH references carry information about
external resources: MetaData('refnames')
maps reference names to reference IDs,
MetaData('references')
maps reference IDs to URLs,
and, for DOM and Treex::PML::Instance resources,
AppData('ref')
maps
reference IDs to objects representing the
resources in TrEd
.
The
PML
backend supports so-called “knitting”
of PML instances, i.e. replacing certain type of PML
references with the content of the referenced entities
occurring in other PML instance.
Conversely, when a PML instance on which this
“knitting”
has been applied is saved, the (possible edited) content
replaces the content of the referenced entities
in its original PML instance.
Knitting only applies to:
members of PML structures containing a
PML reference and having PML role #KNIT
,
to members of PML structures containing a list with PML
role #KNIT
, with PML references as list
members.
If knitting applies to such a member, then a possible trailing
.rf
part of its name is stripped and its
content (either a single PML reference or a list of them) is
replaced with the corresponding entities in the referenced PML
instance. It is required that all such PML references refer to
resources specified in the PML schema either as
readas="dom"
(in which case Treex::PML
representation of the referred data structure is created and
transformed back into DOM only at save) or
readas="pml"
(in which case, after
knitting, the referred and referring PML instances share the
knitted data structure).
The
TrXML
backend was intended as a XML
replacement of the FS format. Unfortunatelly, never fully
developed and thoroughly tested. It is obsoleted by PML and
therefore not recommended for any future work.
This backend reads and stores trees represented in a specific subset of the TEI XML format. The format used by this backend was (is?) used in the Slovene Treebank Project.
While for XML-based formats it is recommended to use XSTL transformation to PML as described in Section 16.6.2, “Support for non-PML XML-based formats”, adding support for other formats requires writing a new I/O backend. An I/O backend is a Perl module defining at least the following five subroutines (listed in the order in which they are typically called by TrEd):
test
($filename
,$encoding
)
This function should only quickly peek in the given file
in order to determine if it is a file suitable by the
backend. If this function accepts the file by returns a
defined non-zero value (e.g. 1), then the file is
processed by this backend. If the file is not suitable
for the backend, this function must reject the file
by returning 0 or undef
,
so that other backends in the list of backends could
try their luck.
open_backend
($filename
,$mode
,$encoding
)
This function should open and return a filehandle for a given
file. If $mode
is
r
, then this filehandle should be
open for reading, if $mode
is
w
, it should be open for writing.
The third, $encoding
, contains
the encoding specified by the user in the defaultFileEncoding
configuration option. This information may be ignored if
the data format provides another way to determine the
encoding. Most backends do not re-implement this
function, but simply import (i.e. inherit) it from the
base class Treex::PML::IO
.
read
($filehandle
,$fsfile
)
This is the key function that implements converting data
from the specific data format to the corresponding
memory representation in TrEd.
This function obtains two arguments: the
$filehandle
previously
obtained by a call to backend's open_backend
,
and an empty Treex::PML::Document
object (i.e.
with no trees). It is supposed to parse the data format,
build tree representation of the data (usually using
functions such as Treex::PML::Factory->createNode()
,
and
and populate the $child->paste(
($parent
,$ordering_attribute
)Treex::PML::Document
with the resulting trees (e.g. using its
changeTrees
method).
It should also setup Treex::PML::FSFormat
object associated with the
$fsfile
(
).
Any additional information related to the file (but not representable
as trees or $fsfile
->FSTreex::PML::FSFormat
) may be attached to the file e.g. using
.
$fsfile
->changeMetaData(
$key
,$value
)
write
($filehandle
,$fsfile
)
This function is the opposite of
read
. By examining the
Treex::PML::Document
object
$fsfile
(especially its trees and meta data),
it should write the corresponding representation
in the specific data format to the given
$filehandle
.
close_backend
($filehandle
)
This function should close a given filehandle
created by a previous call to
open_backend
. It usually only
consists of applying a Perl function
close
on the filehandle, but if
additional cleanup is necessary, it should be done here.
Most backends do not re-implement this function, but
simply import (i.e. inherit) it from the base class
Treex::PML::IO
.
There are several ways to make TrEd know about a user-defined I/O backend, namely:
listing addtional backends in the Treex::PML::IOs configuration option,
listing addtional backends after
-B
on the command-line (see Section 14, “Command-line options”
defining a get_backends_hook