by Martin Popel
Udapi is an API and framework for processing Universal Dependencies available for Python, Perl and Java. This tutorial uses the Python version and expects Linux+Bash and Python 3.3 or higher.
You can download my slides about UD and Udapi.
pip3 install --user --upgrade git+https://github.com/udapi/udapi-python.git export PATH="$HOME/.local/bin/:$PATH"
dev.conllu
) for English.
For full UDv1.4 go to Lindat.
wget http://ufal.mff.cuni.cz/~popel/udapi/ud14sample.tgz tar -xf ud14sample.tgz cd sample
udapy
commands from my slides.
cat */sample.conllu | udapy -T | less -R
This concatenates all languages and pipes them to udapy
and then to less
(type q
to exit).
You can use e.g. UD_English
instead of *
.
The -R
option tells less
to display colors (instead of their ANSI codes).
The -T
prints the trees in text mode
and it is actually a shortcut for udapy write.TextModeTrees color=1
.
Run udapy --help
to see other useful shortcuts, e.g.
cat UD_English/sample.conllu | udapy -H > en.htmlwill create a html version, you can open in any modern browser.
-HA
will include all the nodes' attributes in the html output.
OptionA: search the documentation.
see the documentation of discourse deprel
OptionB: browse UD_English/dev.conllu
as in the previous step and find the occurences of discourse.
udapy -T < UD_English/dev.conllu | less -RIn the less
OptionC: extract all word forms and UPOS tags of nodes
annotated with the discourse deprel in UD_English/dev.conllu
.
Hints: use udapy util.Eval node='PYTHON_CODE'
and substitute PYTHON_CODE
with a code
which should use node.deprel
, node.form
and node.upos
.
The standard Unix way of frequency analysis is sort | uniq -c | sort -rn
.
udapy util.Eval node='if node.deprel == "discourse": print(node.form, node.upos)' < dev.conllu > disc.txt cat disc.txt | sort | uniq -c | sort -rn | less