The PAWS data can be found in the data directory and its five subdirectories comprising the data in different export formats.

The treex subdirectory contains 50 gzipped Treex file (*.treex.gz), each one corresponding to a single document. The Treex format is an application of the Prague Markup Language (PML; Pajas and Štěpánek, 2008), an XML-based format designed for linguistic treebank annotations. For the sake of completeness, PML schemata describing the structure of the Treex files are enclosed in the resources directory.

The plain subdirectory contains 50 text files for each language (*.{en,cs,ru,pl}.txt). A text file consists of surface sentences formatted as one sentence per line. A variant of this format stored in the plain_zero subdirectory includes also the zeros that semantically behave as nouns. Such a zero appears in an immediate neighbourhood of its governing verb and is formatted as a concatenation of the tectogrammatical lemma of the zero (#PersPron, #Cor or #Gen) a its semantic role (or functor), delimited by a colon. See the tectogrammatical manual for more information on these attributes.

The conll subdirectory contains each language variant of each document stored in a text file formatted according to the style that is derived from the style used for SemEval 2010 Shared Task and is compatible with the official CoNLL scorer for coreference resolution. In this format, each token is represented on a single line by tab-separated attributes. A difference to the original SemEval 2010 format lies in the last but one attribute that comprises a Treex ID of the corresponding node. In order to be compatible with the official CoNLL coreference scorer, the last column remains reserved to coreference information annotated in open-close notation. Similarly to the plain-text format, the conll_zero subdirectory consist of the same files but with additional records for zeros.

How to browse the data in the Treex format

Tree editor TrEd (Pajas and Štěpánek, 2008) can be used to open and browse the data in the Treex format. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified at the page for your operating system.

After the installation, an extension needs to be installed:

  1. Start TrEd.
  2. In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
  3. Click on the button Get New Extensions; a dialog window with a list of available (not yet installed) extensions appears.
  4. Make sure that at least the extension EasyTreex is checked to install (if it is not in the list, it may have already been installed).
  5. Click on the button Install Selected; the selected extensions get installed.
  6. Close all TrEd windows including the main application window and start TrEd again.

Now, TrEd is able to open the data of PAWS, displaying the analytical and tectogrammatical trees of a single sentence in four languages at once.

In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at tred at