Language Technologies for Research in Humanities

NPFL131 / ATKL00349

Pavel Straňák

stranak@ufal.mff.cuni.cz

Friday 12:30–14:00
Palachovo nám. 2, room S131

21. 4. 2023

Language Corpora

Corpus: a big amount of collected text, processed in a uniform way for searching and statistics:

plain text,
- metadata (not always; bad in many NLP corpora, good in literary and hand-composed corpora like Czech National Corpus)
- annotation (traditionally PoS, lately syntax, named entities)
historical corpora
- facsimiles linked with digitised pages
- orthography normalisation
- text variants
for poetry verse structures, rhyming, etc.

Major Corpus Managers 1/2

Al the following corpus managers use the same query language for advanced queries: CQP / CQL. It is quite useful to Learn some basics of CQL.

CQPWeb: Lancaster
- free account allows also to create corpora
- old-school looks, but modern
- great many English corpora, a selection of others
SketchEngine:
- Czech system, main site commercial
- very strong lexicographic feature: “Word Sketches”
- Czech National Corpus runs an installation (account per request)
- free testing on a few open corpora: https://app.sketchengine.eu/#open

Major Corpus Managers 2/2

Kontext:
- developed at Czech National Corpus, run there, at LINDAT, and many other places
- https://korpus.cz/kontext
- user guide and tutorial: https://wiki.korpus.cz/doku.php/manualy:kontext:index
- the same backend ad SketchEngine. No WordSketch, but more statistical features
TEITOK:
- developed at UFAL MFF UK by Maarten Janssen
- run at LINDAT with many corpora, more can be added on demand: http://lindat.cz/services/teitok/
- http://teitok.org (docs, other installations in Europe with interesting corpora)
- a corpus manager, but unlike the others, can provide detailed visualisation: TEI XML for each document in a corpus

There are many other corpus interfaces and corpora in them. But the Kontext and TEITOK corpus managers are some of the best in the world.

Corpora of Chinese 1/2

ctext.org has some corpus features, but only in the payed version, which is expensive
http://bcc.blcu.edu.cn/ 9.5b words: 3b literature, 2b “古汉语” （but incl. 文言文, 康熙字典⋯⋯）, Weibo, subtitles …
- Looks like a good corpus, but the interface has much less functionality (and is in Chinese)
CQPWeb, Kontext and TEITOK can all work well with Chinese, but currently they do not seem to have a person dedicated to properly managing Chinese corpora for them
Czech National Corpus:
- Aranea Sinicum: http://aranea.juls.savba.sk/aranea_about/_sinicum.html
  - web corpus annotated by (obsolete) TreeTagger, many problems
  - ** do have the raw data and agreement, so we could do a better annotation and publish it!**
- Intercorp includes Chinese
  - good for translation studies (parallel corpus)
  - not very large data, mostly subtitles
  - 202k core, 240k syndicate (news), 2,247k subtitles; sum: 2,689k tokens

Corpora of Chinese 2/2

At LINDAT we are currently hosting Universal Dependencies and a few treebanks outside, like the Penn Chinese Treebank. We are happy to host other corpora, just get in touch.

Universal Dependencies in Kontext, TEITOK and PML-TQ
UD corpora are small, but they are treebanks, i.e. have syntax and are hand-annotated
- Classical Chinese included
  - see the composition of the corpus (古文 + 文言文）
Other Chinese treebanks in PML-TQ (Penn Chinese Treebank, Academia Sinica): https://lindat.mff.cuni.cz/services/pmltq/
- Tutorial on the PML Tree Query Language

Unix shell – for cycle

A way to repeat some processing for each item in a list. For example process a number of files in the same way. To begin with we will just print some file names to the screen in some ways.

for f in *; do echo "Jmenuji se: $f"; done
Command Substitution is a way to replace a command with the standard output of it’s result.
- $(command) or command
there can be a sequence of commands in the substitution:
- for f in $( ls | grep '.txt' ) ; do echo "Text file: $f" ; done
We can also add a second variable, like a counter:
- i=0; for f in $( ls | grep '.html' ) ; do (( i=$i + 1)); echo "Text file $i: $f" ; done