Treex::PML::Instance - Perl extension for loading/saving PML data
use Treex::PML::Instance; Treex::PML::AddResourcePath( "$ENV{HOME}/my_pml_schemas" ); my $pml = Treex::PML::Instance->load({ filename => 'foo.xml' }); my $schema = $pml->get_schema; my $data = $pml->get_root; $pml->save();
This class provides a simple implementation of a PML instance.
None by default.
The following export tags are available:
Imports the following constants:
name of the "<LM>" (list-member) tag
name of the "<LM>" (alt-member) tag
XML namespace URI for PML instances
XML namespace URI for PML schemas
space-separated list of supported PML-schema version numbers
Imports internal _die, _warn, and _debug diagnostics commands.
The option 'config' of the methods load() and save() can provide a
parsed configuration file. The configuration file is a PML instance whose PML
schema is defined in the file pmlbackend_conf_schema.xml
distributed with Treex::PML in
Treex/PML/Backend/pmlbackend_conf_schema.xml
.
This file can set defaults for some options of load() and save() and it can also define rules for pre-processing the input documents before parsing them as PML and for post-processing the output documents after serializing them as PML. Currently only XSLT 1.0, Perl and external-command pre-processing and XSLT 1.0 post-processing are implemented.
The PMLTransform
backend, when intialized (e.g. by calling
by calling AddBackend('PMLTransform')
), automatically loads the
first configuration file named pmlbackend_conf.xml
it finds in the
Treex::PML
's resource paths. Additionally, it searches for all
configuration files named pmlbackend_conf.inc
in the resource paths
and merges their transformation rules into in-memory image of the main
configuration file. Then, PMLTransform
uses this resulting configuration for all
load/save operations.
IMPORTANT NOTE: it is recommended to add the PMLTransform
backend as the last
I/O backend since its test() method automatically accepts any XML file
(with the prospect of attempting to transform it during the read()
phase)! So it must be added into the I/O backends list after all other backends
working with XML-based formats.
Here is an example of a configuration file (see the schema for more details).
<?xml version="1.0" encoding="utf-8"?> <pmlbackend xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head> <schema href="pmlbackend_conf_schema.xml"/> </head> <options> <load> <validate_cdata>1</validate_cdata> <use_resources>1</use_resources> </load> <save> <indent>4</indent> <validate_cdata>1</validate_cdata> <write_single_LM>1</write_single_LM> </save> </options> <transform_map> <transform id="alpino" test="alpino_ds[@version='1.1' or @version='1.2']"> <in type="xslt" href="alpino2pml.xsl"/> <out type="xslt" href="pml2alpino.xsl"/> </transform> <transform id="sdata" root="sdata" ns="http://ufal.mff.cuni.cz/pdt/pml/"> <in type="perl" command="require SDataMerge; return SDataMerge::transform(@_);"/> </transform> <transform id="tei" test="*[namespace-uri()='http://www.tei-c.org/ns/1.0']"> <in type="pipe" command="tei2pml.sh"> <param name="--stdin" /> <param name="--stdout" /> </in> </transform> </transform_map> </pmlbackend>
NOTE: Don't call this constructor directly, use Treex::PML::Factory->createPMLInstance() instead!
Create a new empty PML instance object.
NOTE: Don't call this method as a constructor directly, use Treex::PML::Factory->createPMLInstance() instead!
Read a PML instance from file, filehandle, string, or DOM. This method may be used both on an existing object (in which case it operates on and returns this object) or as a constructor (in which case it creates a new Treex::PML::Instance object and returns it). Possible options are:
{ filename => $filename, # and/or fh => \*FH, # or string => $xml_string, # or dom => $document, # (XML::LibXML::Document) config => $cfg_pml, # (Treex::PML::Instance) parser_options => \%opt, # (XML::LibXML parser options) no_trees => $bool, no_references => $bool, no_knit => $bool, selected_references => { name => $bool, ... }, selected_knits => { name => $bool, ... } }
where filename
may be used either by itself or in combination with
any of fh
, string
, or dom
, which are otherwise mutually
exclusive. The config
option may be used to pass a Treex::PML::Instance
with the parsed PML backend configuration file (see CONFIGURATION). The
parser_options
option may be used to pass a HASH reference
containing options for the XML::LibXML parser (depending on
implementation, these will be used to configure either an
XML::LibXML::Reader or an XML::LibXML::Parser). If no_trees
is
true, then the roles #TREES, #NODE and #CHILDNODES are ignored. The
option selected_references
determines which reffiles (with
non-empty readas attribute) to read; if true, the reffile with a given
name is read, if false, it is never read; if a value is not given for
some reffile, the reffile is read unless the no_references
flag is
on. The options selected_knits
and no_knits
determine data from
which reffiles can be copied into this document following the rules
for the role #KNIT. Their meaning is just like that for
selected_references
and no_references
. Moreover,
no_references
implies no_knit
, unless no_knit
is explicitly
specified.
Returns 1 if the last load() was successful.
Save PML instance to a file or file-handle. Possible options are:
filename, fh, config, refs_save, write_single_LM
. If both
filename
and fh
are specified, fh
is used, but the filename
associated with the Treex::PML::Instance object is changed to filename
. If
neither is given, the filename currently associated with the
Treex::PML::Instance object is used. The config
option may be used to pass a
Treex::PML::Instance representing the parsed PML backend configuration file
(see CONFIGURATION). The refs_save
option may be used to
specify which reference files should be saved along with the
Treex::PML::Instance and where to. The value of refs_save
, if given, should
be a HASH reference mapping reference IDs to the target URLs
(filenames). If refs_save
is given, only those references listed in
the HASH are saved along with the Treex::PML::Instance. If refs_save
is
undefined or not given, all references are saved (to their original
locations). In both cases, only files declared as readas='dom' or
readas='pml' can be saved.
Translates the current Treex::PML::Instance
object to a Treex::PML::Document
object
(using Treex::PML::Document MetaData and AppData fields for storage of non-tree
data). If fsfile argument is not provided, creates a new Treex::PML::Document
object,
otherwise operates on a given fsfile. Returns the resulting Treex::PML::Document
object.
Translates a Treex::PML::Document
object to a Treex::PML::Instance
object. Non-tree
data are fetched from Treex::PML::Document MetaData and AppData fields. If called
on an instance, modifies and returns the instance, otherwise creates
and returns a new instance.
Retrieve a possibly nested value from the attribute data structure of $obj. The path argument uses an XPath-like expression of the form
step1/step2/...
where each step (depending on the value retrieved by the preceding part of the expression) can be one of:
to retrieve that member
to retrieve that attribute
to retrieve the first element of that name
to retrieve n-th element /counting from 1/ from a list, sequence, or an alternative
to retrieve n-th element named 'name' from a sequence
to retrieve the n-th element of a sequence provided the n-th element's name is 'name'
In the preceding cases, [n] can be negative, in which case the retrieved value is the n-th element from the end of the list or sequence.
If a step of the form [n] is not given for a list or alternative value then [1] is assumed and the next step is processed.
If the value retrieved by some step is undefined or the step does not match the data type of the value retrieved by the preceding steps, the evaluation is stopped and undef is returned.
For example,
my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/[-4]/baz/[5]bam');
is roughly equivalent to
my $el = $obj->{foo}->values('bar')->[1]->[-4]->{baz}->[4]; my $value = $el->name eq 'bam' ? $el->value : undef;
but without the side effect of creating array or hash structures where there is none. To be more specific, if, say $obj->{x} is not defined, then the Perl expression
if ($obj->{x}[3]{y}) {...}
automatically causes a side-effect of creating an ARRAY reference in $obj->{x} and a HASH reference in the fourth element of this ARRAY. An analogous construct
Treex::PML::Instance::get_data($obj,'foo/[4]/baz');
simply returns undef without either of these side-effects.
The following behave the same (provided that the path /foo/bar[2] retrieves a list, sequence or an alternative and /foo/bar[2]/[1]/baz retrieves a sequence):
my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/[1]/baz/[1]bam'); my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/baz/bam');
This function returns all matches of a given attribute path on the
object. It works just as Treex::PML::Instance::get_data except that it recurses
into all values of a list, alt or sequence instead of just the first
one on attribute-path steps that do not give an exact
index. Furthermore, unlike Treex::PML::Instance::get_data
, this functions
does expands trailing Lists and Alts, which means this:
If the path leads to a List or Alt value, the members values
are returned instead; this replacement is applied recursively.
The expansion of trailing Lists and Alts can be prevented by appending a slash followed by a dot to the attribute path ("$path/.").
Store a given value to a possibly nested attribute of $obj specified
by path. The path argument uses the XPath-like syntax described above
for the method Treex::PML::Instance::get_data
. If $strict==0 and a non-index step is to be
processed on an alternative or list, then step [1] is assumed and the
1st element of the list or alternative is used for further processing
of the path expression (except when this occurs in the last step, in
which case the entire list or alternative is overwritten by the given
value). If $strict==1 and a non-index step is to be processed on an
alternative or list, a warning is issued and undef is returned. If
$strict==2, the same approach as with $strict==1 is taken, but croak is
used instead of warn.
This function traverses a given PML data structure and dispatches callbacks at all occurrences of given attribute paths.
If called on other object that Treex::PML::Instance (i.e. Treex::PML::Struct, Treex::PML::List, etc.), the corresponding data type (Treex::PML::Schema::* object) can be provided in the \%opts argument as
{ type => $type_decl }
The callback gets one argument: a hash reference of the form
{ value => $matched_obj, path => $matched_obj_path, type => $obj_type_decl }
where $matched_obj_path is full canonical path to the matching
object. The type key is present in hash only if for_each_match
was
called on a Treex::PML::Instance or if Treex::PML::Schema type of the initial object was
given in \%opts.
The path syntax is as described in Treex::PML::Instance::get_data
, with the
following differences:
1. Path steps of the form [n] or name[n], where n is a number, are not supported (but steps of the form [n]name work).
2. Additionally, steps can be separated with //. Like in XPath, this indicates a descendant axis, that allows arbitrary structures between the steps. I.e. a//z matches any data matched by a/z, a/b/z, /a/b/c/z, etc. One can also use // at the very beginning of an expression (//a/b) to match arbitrarily nested occurrence of a/b (e.g. one matching x/y/z/a/b).
This function returns all data matching given path or, if the second
argument is an array reference, any of given paths. The path(s), as
well as $obj and \%opts argument are as in
Treex::PML::Instance::for_each_match
. The function returns an array in
array context and an array reference in scalar context.
Like Treex::PML::Instance::get_all_matches
, but returns only the number of
matching objects (without creating any intermediate list).
Hash a given object under a given ID. If warn is true, then a warning is issued if the ID already wash hashed with a different object.
Lookup an object by ID.
Return the filename (string) or URL (URI object) of the PML instance.
Return URL of the PML instance as URI object.
Change filename of the PML instance.
Return ID of the XSL-based transformation specification which was used to convert between an original non-PML format and PML (and back).
Set ID of an XSL-transformation specification which is to be used for conversion from PML to an external non-PML format (and back).
Return Treex::PML::Schema
object associated with the PML instance.
Associate a Treex::PML::Schema
with the PML instance (this method should
not be used for an instance containing data).
Return URL of the PML schema file associated with the PML instance.
Change URL of the PML schema file associated with the PML instance.
Return the root data structure.
Set the root data structure.
Return a Treex::PML::List
object containing data structures with role
'#NODE' belonging in the first block (list or sequence) with role
'#TREES' occuring in the PML instance.
If the PML instance consists of a sequence with role '#TREES', return a
Treex::PML::Seq
object containing the maximal (but possibly empty)
initial segment of this sequience consisting of elements with role
other than '#NODE'.
If the PML instance consists of a sequence with role '#TREES', return
a Treex::PML::Seq
object containing all elements of the sequence
following the first maximal contiguous subsequence of elements with
role '#NODE'.
Return the type declaration associated with the list of trees.
Returns a HASHref mapping file reference IDs to URLs.
Set a given HASHref as a map between refrence IDs and URLs.
Returns a list of reference IDs associated with a given name.
Returns a list of hash references. Each element represents a document referenced from the current instance. The list contains only references that were associated with a name (pre-declared in the PML schema). However, a 'name' can be associated with several document references. The elements in the list returned by this method have the following keys:
the value of the 'readas' attribute of the corresponding PML schema declaration
the symbolic name of the (type of the) reference as declared in the PML schema
an URI of the target document
an ID use in the current PML instance to refer to the target document
Returns a HASHref mapping file reference names to reference IDs. Each value of the hash is either a ID string (if there is just one reference with a given name) or a Treex::PML::Alt containing all IDs associated with a given name.
Set a given HASHref as a map between refrence IDs and URLs.
Return a DOM or Treex::PML::Instance object representing the referenced resource with a given ID (applies only to resources declared as readas='dom' or readas='pml').
Use a given DOM or Treex::PML::Instance object as a resource of the current Treex::PML::Instance with a given ID (note that this may break knitting).
Prague Markup Language (PML) format: http://ufal.mff.cuni.cz/jazz/PML/
Tree editor TrEd: http://ufal.mff.cuni.cz/~pajas/tred
Related packages: Treex::PML, Treex::PML::Schema, Treex::PML::Document
Copyright (C) 2006-2010 by Petr Pajas
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.2 or, at your option, any later version of Perl 5 you may have available.