Language Technologies for Research in Humanities


NPFL131 / ATKL00349

Pavel Straňák

stranak@ufal.mff.cuni.cz

Friday 12:30–14:00
Palachovo nám. 2, room C131

17. 2. 2023

Motivation

Requirements

S131 lab at the Faculty of Arts

Unix

Literature

Course structure

  1. Processing text as a necessary basis for (not only) computational linguistics
  2. Why use the Unix shell; the most basic commands
  3. more commands to manipulate texts
  4. text editors
  5. search using regular expressions
  6. using regular expressions to edit text
  7. basic principles of formulation and validation of hypotheses, application to data, accuracy, completeness, value of results
  8. removal of diacritics, segmentation into sentences, tokenization
  9. rule automatic identification of word types
  10. using current stare-of-the-art NLP toolkits (web API or local programmes)

Text Processing

SSH, PuTTY

Shell

Shell II (bash)

Basic commands – files and directories (folders)

Shell III

Basic commands II – text files and variables

less

Homework

  1. Who wants to use their own computers and did’t get the instalation of Ubuntu finished today, do it before the next class and test it by performing the rest of these tasks
  2. You should practice the basic commands from today’s lecture. To make it a bit more fun, I have uploaded one play by a famout Czech writer Karel Capek here: https://ufal.mff.cuni.cz/~stranak/matka.txt
    • Download the file to your desktop
    • open the Ubuntu terminal and get to the Desktop, then view file with less
    • use head (and man head) to get only the metadata of the file (the lines with some text before the line “MATKA”). Save this text into a separate file.
    • use tail to save only the final list of Capek’s works (including the header “První vydání knih Karla Čapka”) and save this into another file
    • copy, move and remove some of these files. Try renaming them too (rename is a kind of a move).
    • bonus point: use the commands we learned to only isolate the line “1934 - Povětroň”