Prague Markup Language (PML)

These pages are obsolete! Please go to https://ufal.mff.cuni.cz/pml instead.

Prague Markup Language (PML) is an XML-based, universally applicable data format based on abstract data types intended primarily for interchange of linguistic annotations. It is completely independent of a particular annotation schema. It can capture simple linear annotations as well as annotations with one or more richly structured interconnected annotation layers, dependency or constituency trees. A concrete PML-based format for a specific annotation is defined by describing the data layout and XML vocabulary in a special file called PML Schema and referring to this schema file from individual data files (instances). The schema can be used to validate the instances. It is also used by applications to ``understand'' the structure of the data and to choose optimal in-memory representation. The generic nature of PML makes it very easy to convert data from other formats to PML without loss of information.

PML and was developed at the Institute of Formal and Applied Linguistics of the Charles University in Prague. It was first used in the Prague Dependency Treebank 2.0 and several other treebanks since. Conversion tools for various existing treebank formats are available, too.

Specification

Prague Markup Language specification (in HTML)

or the same in PDF

PML Toolkit

pmltk-1.1.5.tar.gz - The package contains:

Latest version of the PML specification (PDF and html)
pml_validate - PML instance and schema validation tool
pml_simplify - PML schema simplification tool
pml_copy - a tool for copying, moving, compressing, and uncompressing related PML instances without breaking their mutual references.
Format conversion tools from (and sometimes to) other formats, in particular the CoNLL format, the formats of the Penn, Tiger, Sinica, Alpino, Arabic, Hydarabad, and Latin treebanks, etc.
Perl API with support for on-the-fly conversion from/to other formats via pluggable backends and XSLT
Current PML schemas for PDT 2.0 and some other treebanks
Auxiliary tools

For detailed content of the package see the README file.

Querying over PML

PML-TQ (PML Tree Query): a query system for PML-based treebanks with natural support for cross-layer queries. The system is powered either by a relation database or a sequential engine operating directly on PML files. The system includes a command-line client as well as a full-featured user interface in TrEd with a graphical query builder and visualizer of the results. The query language also includes a sub-language for the generation of listings and statistical reports.

Graphical Tools

TrEd Toolkit: a highly-customizable and scriptable tree editor (annotation tool for syntactic and other tree-based analyises) and a set of command-line tools for automated data processing.

MEd: an annotation tool used for speech reconstruction, easily adaptable for other linearly-structured annotations of text or audio data on multiple layers

LAW (Lexical Annotation Workbench): an integrated environment for morphological annotation. It supports simple morphological annotation (assigning a lemma and tag to a word), integration and comparison of different annotations of the same text, searching for particular word, tag etc.

Processing tools

Validation, format conversion, APIs: See PML Toolkit above.
Parallel data processing: The TrEd Toolkit provides tools called ntred and jtred which can be used for parallel processing of PML data on a computer cluster

NLP tools

TectoMT: a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces.
Other: Several taggers and parsers for Czech and other languages developed at UFAL

Orther useful tools

Sun Multi-Schema XML Validator (MSV) - contains a validator of Relax NG grammars with embedded Schematron rules.

XML Editing Shell - a scripting language and interactive shell for manipulating XML.

Current PML schemas for PDT 2.0 annotation

wdata_schema.xml, mdata_schema.xml, adata_schema.xml, tdata_schema.xml

Bibliography

Pajas Petr, Štěpánek Jan: System for Querying Syntactically Annotated Corpora, in Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Suntec, Singapore, pp. 33-36, 2009

Pajas Petr, Štěpánek Jan: Recent Advances in a Feature-Rich Framework for Treebank Annotation, in The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, Manchester, pp. 673-680, 2008

Petr Pajas, Jan Štěpánek. XML-Based Representation of Multi-Layered Annotation in the PDT 2.0 . In: Proceedings of LREC 2006 Workshop on Merging and Layering Linguistic Information. ELRA, Genoa, Italy, 2006.

Acknowledgment

The development of PML is a part of the project "Integration of language resources for information extraction from natural texts", Information Society of Grant Agency of Academy of Sciences of the Czech Republic: 1ET101120503

Petr Pajas, 2010