Prague Marup Language (PML) is an XML-based, universally applicable data format based on abstract data types intended primarily for interchange of linguistic annotations. It is completely independent of a particular annotation schema. It can capture simple linear annotations as well as annotations with one or more richly structured interconnected annotation layers, dependency or constituency trees. A concrete PML-based format for a specific annotation is defined by describing the data layout and XML vocabulary in a special file called PML Schema and referring to this schema file from individual data files (instances). The schema can be used to validate the instances. It is also used by applications to ``understand'' the structure of the data and to choose optimal in-memory representation. The generic nature of PML makes it very easy to convert data from other formats to PML without loss of information.
PML and was developed at the Institute of Formal and Applied Linguistics of the Charles University in Prague. It was first used in the Prague Dependency Treebank 2.0 and several other treebanks since. Conversion tools for various existing treebank formats are available, too.
Prague Markup Language specification (in HTML)
or the same in PDF
pmltk-1.1.5.tar.gz - The package contains:
For detailed content of the package see the README file.
Sun Multi-Schema XML Validator (MSV) - contains a validator of Relax NG grammars with embedded Schematron rules.
XML Editing Shell - a scripting language and interactive shell for manipulating XML.
wdata_schema.xml, mdata_schema.xml, adata_schema.xml, tdata_schema.xml
Pajas Petr, Štěpánek Jan: System for Querying Syntactically Annotated Corpora, in Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Suntec, Singapore, pp. 33-36, 2009
Pajas Petr, Štěpánek Jan: Recent Advances in a Feature-Rich Framework for Treebank Annotation, in The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, Manchester, pp. 673-680, 2008
Petr Pajas, Jan Štěpánek. XML-Based Representation of Multi-Layered Annotation in the PDT 2.0 . In: Proceedings of LREC 2006 Workshop on Merging and Layering Linguistic Information. ELRA, Genoa, Italy, 2006.
Džeroski Sašo, Erjavec Tomaž, Ledinek Nina, Pajas Petr, Žabokrtský Zdeněk, Žele Andreja. Towards a Slovene Dependency Treebank. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) . Paris, France: 2006, pp. 1388-1391.
The development of PML is a part of the project "Integration of language resources for information extraction from natural texts", Information Society of Grant Agency of Academy of Sciences of the Czech Republic: 1ET101120503