Treebank Search (lab)

(Partially based on an older tutorial in Czech, which is available here.)

In this lab session we will play with PML-TQ, an on-line treebank search service. PML (Prague Markup Language) is a XML-based file format used for the Prague treebanks. TQ stands for “tree query”. The service provides access to many treebanks in various annotation schemes but we will focus on treebanks from the Universal Dependencies collection. Go to https://lindat.mff.cuni.cz/services/pmltq/#!/treebanks. Click on “Universal Dependencies”; you should see a non-zero number next to each language for which a UD treebank exists. Click on “Czech”. You will see a list of Czech UD treebanks beneath the language selection. Click on “Universal Dependencies - Czech”, which is the data from the Prague Dependency Treebank. You will be taken to a “Getting Started” page. Click on the “Query” link in the top line of the page. Type your first query in the window provided and press “Execute query”:

a-node [ form="mimo" ]

You will get the first 100 trees that contain the preposition mimo. You can browse the trees using the buttons “Previous” and “Next”.

Every node that corresponds to a syntactic word is referred to as a-node. In addition, every sentence has an artificial root node that does not correspond to any word; this node is referred to as a-root. A-nodes have attributes such as form, lemma, tag (the universal part-of-speech tag in UD) and deprel (dependency relation label). Morphological features from UD are encapsulated in a structured attribute called iset and their names and values sometimes slightly differ from those defined in UD. Thus the nominative case is encoded in UD (according to the UD documentation) as Case=Nom but in PML-TQ it currently has to be queried as iset/case="nom".

If you want to know the total number of occurrences of the word mimo in the treebank, you can append an aggregation function after the query. Now the result will not be trees but a small one-cell table saying that there are 321 occurrences:

a-node [ form="mimo" ]
>> count()
321

The preposition mimo is one of the Czech equivalents of the English expression except of. The other equivalent is kromě. The authoritative Czech grammar (or at least my teacher from the elementary school) says that kromě requires a noun phrase in the genitive case while the argument of mimo must be in the accusative. However, speakers sometimes extend the genitive to mimo, too. Let's see whether such usage occurs in our treebank. Let's query the case feature of the parent node of the preposition:

a-node [ form="mimo", parent a-node [ iset/case="gen" ] ]

Yes, there are 10 occurrences! But what proportion of all occurrences of the preposition is it? Instead of querying separately the accusative (and possibly other cases), we will use an aggregation filter again, this time to get a table of different case markings with their counts. We need to give names to the nodes in order to refer to them in the filter (notice the $a := that we added after a-node):

a-node $a := [ form="mimo", parent a-node $p := [ ] ]
>> for $a.lemma, $a.tag, $p.iset/case give $1, $2, $3, count() sort by $4 desc, $3
mimoADPacc300
mimoADPgen10
mimoADP5
mimoADV5
mimoADPloc1

Note that the variables $1, $2, $3 in the give part refer to the n-th element of the for part, and similarly, the numbered variables in the sort part refer to the elements of the give part.

Homework (optional, for those who need more points): Valency Dictionary of Verbs

Deadline: February 15, 2017.