Language Technologies for Research in Humanities


NPFL131 / ATKL00349

Pavel Straňák

stranak@ufal.mff.cuni.cz

Friday 12:30–14:00
Palachovo nám. 2, room S131

24.–31. 3. 2023

Setting up your Linux on Windows II – Windows Terminal

Unix warm-up

cut -f1 table-with-numbers.tsv| awk '{s+=$1}END{print s}' 
cut -f1 table-with-numbers.tsv| perl -nlE '$sum+=$_;END{say $sum}'

Unix warm-up II

There are 2 ways to get the combination sort |uniq -c |sort -nr accessible any time under the name of your choice as a sinle command, for instance Sort

1. an application program

We have created this program in the previous lecture (see slides 4). Now it may be called, but you have to always find it and call it directly with it’s location. To get the commabd available any time and anywhere it must be put into one of the directories the shell searches when looking for commands. Their list (separated by :) is stored in the environment variable $PATH:

env
echo $PATH

Unix warm-up II (cont.)

2. a shell function

The file $HOME/.profile includes commands executed on a startup of a shell. We can add a new one.

pico $HOME/.profile

now add this line and save the file:

function Sort { sort $@ | uniq -c | sort -nr }

in a new shell started from this moment the Sort function should be available

POS tagging (continued)

Look at the gold data (correct POS) to check a hypothesis: “I should probably make a rule for punctuation, i.e. tokens that do not start with a ‘word character’.”

cat pdt.train.3col| perl -nlE'say if /^\W/' |cut -f3| Sort | head            
172561 PUNCT
 846 SYM
  40 NUM
cat pdt.train.3col| perl -C -plE'
s/$/\tNOUN/;
s/^([.,;:?!].*)NOUN$/$1PUNCT/;

[more rules here]

'

POS Tagging development

head -10000 pdt.train.3col >small-pdt.train.3col
perl -ple '[my rules here]' | \
perl -nle 'print if /(.)\t\g1$/' | \
wc -l

Calculating the accuracy from shell variables using perl

u-pl0:~$ OK=157
u-pl0:~$ ALL=479
u-pl0:~$ export OK ALL
u-pl0:~$ perl -E 'say "Accuracy: ", 
        $ENV{'OK'}/($ENV{'ALL'}/100), 
        "%."'
Accuracy: 32.776617954071%.

POS Statistics per lemma

grep -i 'klaus' pdt.train.3col | Sort
 144 Klaus  Klaus   PROPN
  74 Klause Klaus   PROPN
  24 Klausem    Klaus   PROPN
  11 Klausovi   Klaus   PROPN
   9 Klausův    Klausův ADJ
   6 Klausovu   Klausův ADJ
   6 Klausova   Klausův ADJ
   3 Klausově   Klausův ADJ
   2 klausule   klauzule    NOUN
   2 Klausovy   Klausův ADJ
   2 KLAUS  Klaus   PROPN
   1 Protiklausovská    protiklausovský ADJ
   1 Klausovým  Klausův ADJ
   1 Klausovo   Klausův ADJ
   1 KLAUSE Klaus   PROPN