2019 | ÚFAL

Year 1 - 2019

Manual checks and additions to CzeDLex entries (WP2) were performed by two project workers, with a partial overlap for measuring inter-annotator agreement. The agreement was very high, the few discrepancies were checked and resolved. The theoretical research resulted in a change in the lexicon structure, with a new possibility to have two or more second-level entries with the same discourse sense also for primary connectives (reflecting the complexity of connectives "potom" [afterwards] vs. "potom, co" [after]). Numerous lexicon entries were split and other joined.

The amount of the manual work is in accordance with the project proposal, i.e. approx. 1/3 of the remaining entries were processed. As we first annotate the most frequent entries (and also preferentially anaphoric connectives, see part DC-publikace), the manually processed part of the lexicon covers over 90% of all discourse relations annotated in the source corpus (PDiT 2.0). The annotations were followed by the publication of CzeDLex 0.6 (204 entries with 76 fully manually checked and enriched with additional information). A paper was submitted to LREC 2020, describing theoretical results of the research connected with the lexicon structure and examples of how to search in the lexicon using the search engine PML-TQ.

The lexicon was incorporated into the first version of a discourse parser for Czech (WP3). In this first version, CzeDLex is used for inter-sentential relations only. Automatic annnotation of intra-sentential relations (in this first version) uses the procedure proposed by Jínová (now Synková), Mírovský and Poláková in 2012 - it relies on tectogrammatical annotation of the texts, which may be either manual or automatic. After all manual work in WP2 was finished, the parser was tested on the PDiT 2.0 development test data (with manual annotation of the tectogrammatical layer), using measures defined in Mírovský et al. (2010). The connective-based F1 measure applied on the automatically annotated inter-sentential relations, indicating the success in recognizing presence of a discourse relation, reached 58%. The ratio of correctly assigned discourse types to the correctly recognized relations was 80%. These results represent a baseline for subsequent versions of the parser.

Scripts for annotation projection (WP1) were prepared and tested on the Penn Discourse Treebank 2 (PDTB 2), while external resources (other projects) were used for implementing scripts for getting gorn addresses (links from plain text to phrase-structure trees) missing in the publication of the Penn Discourse Treebank 3 (in contrast with the PDTB 2). The "gorn" scripts were finished by the end of 2019. Research was also carried out on the interoperability of the PDTB 3 sense taxonomy and the Czech discourse senses taxonomy used in the PDiT 2.0 data. The projection of the PDTB 3 has been posponed to the first quarter of 2020, to utilize the externaly implemented scripts for the gorn addresses.

Shallow discourse parsing in Czech

Automatická analýza diskurzních vztahů v češtině

Search form

Year 1 - 2019