Pavel Straňák
stranak@ufal.mff.cuni.cz
Friday 12:30–14:00
Palachovo nám. 2, room S131
21. 4. 2023
Corpus: a big amount of collected text, processed in a uniform way for searching and statistics:
Al the following corpus managers use the same query language for
advanced queries: CQP
/ CQL
. It is quite
useful to Learn some
basics of CQL.
https://app.sketchengine.eu/#open
https://korpus.cz/kontext
https://wiki.korpus.cz/doku.php/manualy:kontext:index
http://lindat.cz/services/teitok/
http://teitok.org
(docs, other installations in Europe
with interesting corpora)There are many other corpus interfaces and corpora in them. But the Kontext and TEITOK corpus managers are some of the best in the world.
http://bcc.blcu.edu.cn/
9.5b words: 3b literature, 2b
“古汉语” (but incl. 文言文, 康熙字典⋯⋯), Weibo, subtitles …
http://aranea.juls.savba.sk/aranea_about/_sinicum.html
At LINDAT we are currently hosting Universal Dependencies and a few treebanks outside, like the Penn Chinese Treebank. We are happy to host other corpora, just get in touch.
A way to repeat some processing for each item in a list. For example process a number of files in the same way. To begin with we will just print some file names to the screen in some ways.
for f in *; do echo "Jmenuji se: $f"; done
$(command)
or command
for f in $( ls | grep '.txt' ) ; do echo "Text file: $f" ; done
i=0; for f in $( ls | grep '.html' ) ; do (( i=$i + 1)); echo "Text file $i: $f" ; done