Pavel Straňák
stranak@ufal.mff.cuni.cz
Friday 12:30–14:00
Palachovo nám. 2, room C131
24. 2. – 3. 3. 2023
ssh
scp
(see
man scp
), WinSCPpasswd
lecture_2
matka.txt
from a directory
Capek
in my home directory into
lecture_2
iconv
If you are working on a remote server, you may just have the comand line. But it is all you need 😎
file
wget
by default, but all unixes have
curl
:curl -O "https://raw.githubusercontent.com/kanripo/KR1h0004/master/KR1h0004_001.txt"
lynx
,
links
unzip "filename" (if you downloaded an archive)
sed
perl
Address: line nr. or a regex. 2 addresses are lines: from, to
find
and replace
jsou regexes.
perl -ne 'print if /find/'
perl -ple 's/find/replace/g'
… substitute, globally
-e
… “one-liner” mode, i.e. a small program is passed
in quotes: -e 'some program'
-n
… cycle over all lines of input (no print)-p
… cycle over all lines of input + print the result
to STDOUT (standard output)-l
… normalise line endings (Windows and Unix use
different characters to end lines)-C
… use UTF-8 Unicode for all input and output (needed
to represent non-latin characters)
-C
must stand separately and firstsort, uniq
, occurrencessort [-n -r -f -k]
… Numerical,
Reverse, Fold lowercase to uppercase,
K – by which [k]olumnuniq [-c -i]
… Count, case
Insensitivesort | uniq -c | sort -nr
… the most common and useful comboThe |
“pipe” character takes the standard output
(STDOUT) of one command and “pipes” it to the standard input (STDIN) of
another command. You can chain commands like this almost
indefinitely.
do something here | less
perl -C -ple 's/ +/\n/g' matka-utf8.txt | sort | uniq -c | sort -nr | less
perl -C -ple 's/ +/\n/g' matka-utf8.txt | sort | uniq -c | sort -nr >frekvence.slova
Why is this solution not good enough?
Segmentation does not solve interpunction. Can you improve it? (you only need the information from this class)
perl -C -plE 's/(.)/$1\n/g;'