Prague Markup Language (PML)
These pages are obsolete! Please go to https://ufal.mff.cuni.cz/pml instead.
Prague Markup Language (PML) is an XML-based, universally applicable
data format based on abstract data types intended primarily for
interchange of linguistic annotations. It is completely independent
of a particular annotation schema. It can capture simple linear
annotations as well as annotations with one or more richly structured
interconnected annotation layers, dependency or constituency trees. A
concrete PML-based format for a specific annotation is defined by
describing the data layout and XML vocabulary in a special file called
PML Schema and referring to this schema file from individual data
files (instances). The schema can be used to validate the
instances. It is also used by applications to ``understand'' the
structure of the data and to choose optimal in-memory representation.
The generic nature of PML makes it very easy to convert data
from other formats to PML without loss of information.
PML and was developed
at the Institute
of Formal and Applied Linguistics of the
Charles University in
Prague. It was first used in the
Prague Dependency Treebank 2.0
and several other treebanks since.
Conversion tools for various existing treebank formats are available, too.
Specification
Prague Markup Language specification (in HTML)
or the same in PDF
PML Toolkit
pmltk-1.1.5.tar.gz - The package contains:
- Latest version of the PML specification (PDF and html)
- pml_validate - PML instance and schema validation tool
- pml_simplify - PML schema simplification tool
- pml_copy - a tool for copying, moving, compressing, and uncompressing related PML instances
without breaking their mutual references.
- Format conversion tools from (and sometimes to) other formats, in particular the CoNLL format,
the formats of the Penn, Tiger, Sinica, Alpino, Arabic, Hydarabad, and Latin treebanks, etc.
- Perl API with support for on-the-fly conversion from/to other formats via pluggable backends and XSLT
- Current PML schemas for PDT 2.0 and some other treebanks
- Auxiliary tools
For detailed content of the package see the README file.
Querying over PML
- PML-TQ (PML Tree Query)
- a query system for PML-based treebanks with natural support
for cross-layer queries. The system is powered either by a relation
database or a sequential engine operating directly on PML files. The
system includes a command-line client as well as a full-featured
user interface in TrEd with a graphical query builder and visualizer
of the results. The query language also includes a sub-language for
the generation of listings and statistical reports.
Graphical Tools
- TrEd Toolkit
- a highly-customizable and scriptable tree editor
(annotation tool for syntactic and other tree-based analyises)
and a set of command-line tools for automated data processing.
- MEd
- an annotation tool used for speech reconstruction,
easily adaptable for other linearly-structured annotations
of text or audio data on multiple layers
- LAW (Lexical Annotation Workbench)
- an integrated environment
for morphological annotation. It supports simple morphological
annotation (assigning a lemma and tag to a word), integration and
comparison of different annotations of the same text, searching for
particular word, tag etc.
Processing tools
- Validation, format conversion, APIs
- See PML Toolkit above.
- Parallel data processing
- The TrEd Toolkit provides
tools called ntred and jtred which can be used
for parallel processing of PML data on a computer cluster
NLP tools
- TectoMT
-
a highly modular NLP (Natural Language Processing)
software system implemented in Perl programming language under
Linux. It is primarily aimed at Machine Translation, making use
of the ideas and technology created during the Prague Dependency
Treebank project. At the same time, it is also hoped to
significantly facilitate and accelerate development of software
solutions of many other NLP tasks, especially due to re-usability
of the numerous integrated processing modules (called blocks),
which are equipped with uniform object-oriented interfaces.
- Other
- Several taggers and parsers for Czech and other languages developed at UFAL
Orther useful tools
Sun Multi-Schema XML Validator (MSV)
- contains a validator of Relax NG grammars with embedded Schematron rules.
XML Editing Shell
- a scripting language and interactive shell for manipulating XML.
Current PML schemas for PDT 2.0 annotation
wdata_schema.xml,
mdata_schema.xml,
adata_schema.xml,
tdata_schema.xml
Bibliography
Pajas Petr, Štěpánek Jan: System for Querying Syntactically Annotated Corpora,
in Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Suntec, Singapore, pp. 33-36, 2009
Pajas Petr, Štěpánek Jan: Recent Advances in a Feature-Rich Framework for Treebank Annotation,
in The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, Manchester, pp. 673-680, 2008
Petr Pajas, Jan Štěpánek. XML-Based Representation of Multi-Layered Annotation in the PDT 2.0 . In: Proceedings of LREC 2006 Workshop on Merging and Layering Linguistic Information. ELRA, Genoa, Italy, 2006.
Acknowledgment
The development of PML is a part of the project "Integration of language
resources for information extraction from natural texts",
Information Society of Grant Agency of Academy of Sciences of
the Czech Republic: 1ET101120503