Syntactic information in KonText

(or making trees linear)

Syntactically annotated corpora are usually displayed as trees. However, the users do not always need to view trees, and linear representation of a text enhanced with some features (like dependency relations, information on a parent node) can be enough.

CWC is a large corpus of Czech that was automatically downloaded from the web; the plain text data can be downloaded from the Lindat repository, while a version automatically tagged with the Featurama tagger and parsed with the MST parser can be searched in KonText. The attributes are: node: form, lemma, tag, afun, parent: p_form, p_lemma, p_tag, p_afun, parent (distance in tokens to the parent, ex. -1 = one to the left, +5 = 5 to the right), effective parent: ep_form, ep_lemma, ep_tag, ep_afun, eparent (distance to the eparent).

Examples of queries:

  • Suppose you want to find all the nouns that are coordinated and that follow a verb.Try the query: [tag="Vp.*"][p_afun="Coord" & tag="NN[FM].*" ]       
  •  A little bit complicated: when those nouns should be Subjects: [tag="Vp.*"][p_afun="Coord" & tag="NN[FM].*" & afun="Sb"]