Automatic processing of text data


NPFL098 / ATKL00345

Pavel Straňák

stranak@ufal.mff.cuni.cz

úterý 10.40–13.50
Malostranské nám. 25, SU1

Unix warm-up

cut -f1 table-with-numbers.tsv| awk '{s+=$1}END{print s}' 
cut -f1 table-with-numbers.tsv| perl -nlE '$sum+=$_;END{say $sum}'

POS tagging (continued)

perl -plE '
s/$/\tN/;
s{ (^ \d \t \w+ti .* ) .$ }{ $1 V }x;

[more rules here]

'
head -1000 simplified-train.conll >small-train.conll
perl -ple '[my rules here]' | \
perl -nle 'print if /(.)\t\1$/' | \
wc -l

Calculating the accuracy from shell variables using perl

u-pl0:~$ OK=157
u-pl0:~$ ALL=479
u-pl0:~$ export OK ALL
u-pl0:~$ perl -E 'say "Accuracy: ", 
        $ENV{'OK'}/($ENV{'ALL'}/100), 
        "%."'
Accuracy: 32.776617954071%.

POS Statistics per lemma

u-pl0:~$  perl -nlE 'print if /\t\w+\tklaus/i' \
pos-tag/simplified-train.conll | \
wc -l
195

POS Statistics per lemma (cont.)

u-pl0:~$ perl -nlE '
print if /\t\w+\tklaus/i' pos-tag/simplified-train.conll | \
cut -f2,3,4 | \
sort |uniq -c |sort -nr
96 Klaus    Klaus   N
60 Klause   Klaus   N
18 Klausem  Klaus   N
 8 Klausovi Klaus   N
 4 Klausovu Klausův A
 4 Klausova Klausův A
 2 Klausovy Klausův A
 1 Klausovo Klausův A
 1 KLAUSE   Klaus   N
 1 KLAUS    Klaus   N