Udapi Tutorial

by Martin Popel


Udapi is an API and framework for processing Universal Dependencies available for Python, Perl and Java. This tutorial uses the Python version and expects Linux+Bash and Python 3.3 or higher.

You can download my slides about UD and Udapi.

Step 1: Install Udapi

Follow the instructions at https://github.com/udapi/udapi-python
Solution
pip3 install --user --upgrade git+https://github.com/udapi/udapi-python.git
export PATH="$HOME/.local/bin/:$PATH"

Step 2: Download sample data

Download and extract ud14sample.tgz. There are just 10 sentences for each language plus one bigger file (dev.conllu) for English. For full UDv1.4 go to Lindat.
Solution
wget http://ufal.mff.cuni.cz/~popel/udapi/ud14sample.tgz
tar -xf ud14sample.tgz
cd sample

Step 3: Browse your favorite language

Use the udapy commands from my slides.
Solution
cat */sample.conllu | udapy -T | less -R

This concatenates all languages and pipes them to udapy and then to less (type q to exit). You can use e.g. UD_English instead of *. The -R option tells less to display colors (instead of their ANSI codes).

The -T prints the trees in text mode and it is actually a shortcut for udapy write.TextModeTrees color=1. Run udapy --help to see other useful shortcuts, e.g.

cat UD_English/sample.conllu | udapy -H > en.html
will create a html version, you can open in any modern browser. -HA will include all the nodes' attributes in the html output.

Step 4: Find out what does the discourse deprel (dependency relation) mean

OptionA: search the documentation.

Solution

see the documentation of discourse deprel

OptionB: browse UD_English/dev.conllu as in the previous step and find the occurences of discourse.

Solution
udapy -T < UD_English/dev.conllu | less -R
In the less

OptionC: extract all word forms and UPOS tags of nodes annotated with the discourse deprel in UD_English/dev.conllu. Hints: use udapy util.Eval node='PYTHON_CODE' and substitute PYTHON_CODE with a code which should use node.deprel, node.form and node.upos. The standard Unix way of frequency analysis is sort | uniq -c | sort -rn.

Solution
udapy util.Eval node='if node.deprel == "discourse": print(node.form, node.upos)' < dev.conllu > disc.txt
cat disc.txt | sort | uniq -c | sort -rn | less

TextLink training school, Prague, February 9, 2017