CWC | ÚFAL

Syntactic information in KonText

(or making trees linear)

Syntactically annotated corpora are usually displayed as trees. However, the users do not always need to view trees, and linear representation of a text enhanced with some features (like dependency relations, information on a parent node) can be enough.

CWC is a large corpus of Czech that was automatically downloaded from the web; the plain text data can be downloaded from the Lindat repository, while a version automatically tagged with the Featurama tagger and parsed with the MST parser can be searched in KonText. The attributes are: node: form, lemma, tag, afun, parent: p_form, p_lemma, p_tag, p_afun, parent (distance in tokens to the parent, ex. -1 = one to the left, +5 = 5 to the right), effective parent: ep_form, ep_lemma, ep_tag, ep_afun, eparent (distance to the eparent).

Examples of queries:

Suppose you want to find all the nouns that are coordinated and that follow a verb.Try the query: [tag="Vp.*"][p_afun="Coord" & tag="NN[FM].*" ]
A little bit complicated: when those nouns should be Subjects: [tag="Vp.*"][p_afun="Coord" & tag="NN[FM].*" & afun="Sb"]

Lindat KonText

Search form

Syntactic information in KonText

(or making trees linear)