Multiword expressions in the Prague Dependency Treebank 2.0
Annotation of Multiword Expressions and Multiword Named Entities in the PDT 2.0
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0.
There is also a tectogrammatical MWE lexicon SemLex: lexicon of all the MWEs annotated. The lexicon is a work in progress: It is complete in terms of coverage of the data. All the entries also include the basic form of the expressions, a simplified dependency structure, and some other attributes. On the other hand only a few entries have a proper gloss, example sentence, synonyms (if applicable) and some other attributes.
Authors | Pavel Straňák, Eduard Bejček |
---|---|
Supported by | grant 1ET201120505 of the Academy
of Sciences of the Czech Republic and grant MSM0021620838 of the Ministry of Youth, Education and Sport of The Czech Republic |
Status | published within PDT 2.5 (only gold data, not the parallel annotation) |
Annotation (data) | Download all parallel annotations (without any corrections) from three annotators. It is in a stand-off PML format. |
SemLex (lexicon of MWEs) | Download |
License |
This work is licenced under a Creative
Commons Licence.
PDT 2.0 itself is not a part of this dataset. To use the PDT 2.0, a valid PDT License is required. PDT 2.5 (with gold MWE annotation) is, however, licenced under CC. |
Annotation Tool | SemAnn (username and password 'public') |
Visualisation+Search | A TrEd extension
is available: install and run TrEd, Setup→Manage
Extensions→Get New Extensions→"Display st-data in the
tectogrammatical trees".
The developement repository of the extension is also public. |
Publications |
|