Pavel Straňák
stranak@ufal.mff.cuni.cz
Friday 12:30–14:00
Palachovo nám. 2, room C131
10. 3. 2023
^a
… go to line start^e
… go to line end^k
… cut from the cursor to the end of line^u
… cut from the cursor to the start of line^w
… delete word^t
… transpose characters (around cursor)^r
… search history^c
… stop the running process (so the Windows’ Copy
can’t use ^c)^d
… stop inputsort, uniq
, frequencies (repeated from last week)sort [-n -r -f -k]
… Numerical,
Reverse, Fold lowercase to uppercase,
K – by which [k]olumnuniq [-c -i]
… Count, case
Insensitive (no Unicode support 👎)sort | uniq -c | sort -nr
… the most common and useful comboMostly we use perl
. You can do all of this in
perl
, but for the very simplest things, especially with
fixed strings (no regexes) grep
can be simpler to use.
Whenever you need a regex, use perl.
g/re/p
… Globally find by
regexp and print resultsgrep -v
… hide matches, instead showing themgrep -c
… countgrep -i 're' <file>
… ignore
casegrep -r 're' <dir>
… recursive.
In all files in this directory (and its sub-directories, etc.)head
and tail
select lines,
cut
columnscut
… cuts a column of STDIN, writes to STDOUT
-f
… field - which column-d
… delimiter - column delimiter.
Default delimiter is \t
.paste file1 file2
… paste content of files. As columns,
or lines (interleaved).Perl is a programming language, but it can be also used like grep, sed, wc, etc.
perl -e 'print "Hello.\n"'
perl -C -e -n 'print if /XY/'
…
while(<STDIN>){
Not
printing;}
$_
contains the current line
perl -C
-p
-e 's/X/Y/g'
… Like -n
, but
Printperl -C
-l
-pe
… Line endings = auto. Automagically
solves issues with different line endings in texts created on unix and
windows systems.Regular expressions (regexes) exist in many programming languages and unix tools. Many variants (“standard,”extended”, etc.). We will only use Perl regular expressions.
Fast, convenient, great Unicode support, many extensions …
perldoc perlretut
/ perlrequick
\
adds a special meaning (escape sequences or
character classes)[abd] # not a string, a group. Any of the 3 matches.
.
*
(0 or more), ?
(0 or 1),
+
(1 or more)^
… beginning of line (position before the first
character)
$line =~ /^\pLu/ # line starts with an upper case letter
$
… end of line (before the newline character)
$line =~ /(\pP)+$/ # matches if line ends with punctuation (and store it)
\d
matches a digit, not just [0-9] but also digits from
non-roman scripts\s
matches a whitespace character, the set [ and
others\w
matches a word character (alphanumeric or
_
), not just [0-9a-zA-Z_]
, but also digits and
characters from non-roman scripts\D
is a negated \d
; it represents any
other character than a digit, or [^\d]
\S
is a negated \s
; it represents any
non-whitespace character [^\s]
\W
is a negated \w
; it represents any
non-word character [^\w]
'.'
matches any character but
\n
(unless the modifier //s
is in effect, as
explained below).\N
, like the period, matches any character but
\n
, but it does so regardless of whether the modifier
//s
is in effect.Brackets: (()) ... \1, \2, $1, $2
m/(ab(cd))/ # $1 = "abcd"; $2 = "cd"
character classes (independent of LOCALE):
\w, \W, \d, \D, \s, \S
Blocks, Scripts and Categories of Unicode:
\pLu # Letter upper case
\pP # Punctuation
read perldoc perlunicode
and search for
General_Category
(/)
\p{Cyrilic} # is Cyrilic character
\P{Latin} # is NOT Latin character
read perldoc perlunicode
and do
/Scripts