Bash shortcuts (and Windows specifics)

^a … go to line start
^e … go to line end
^k … cut from the cursor to the end of line
^u … cut from the cursor to the start of line
^w … delete word
^t … transpose characters (around cursor)
^r … search history
^c … stop the running process (so the Windows’ Copy can’t use ^c)
^d … stop input
configuring copy & paste in Ubuntu on Windows

`sort, uniq`, frequencies (repeated from last week)

sort [-n -r -f -k] … Numerical, Reverse, Fold lowercase to uppercase, K – by which [k]olumn
uniq [-c -i] … Count, case Insensitive (no Unicode support 👎)
sort | uniq -c | sort -nr … the most common and useful combo

grep

Mostly we use perl. You can do all of this in perl, but for the very simplest things, especially with fixed strings (no regexes) grep can be simpler to use. Whenever you need a regex, use perl.

find files matching a regex/substring (inside, not in filenames)
- display/count matches
- display context of matches
g/re/p … Globally find by regexp and print results
grep -v … hide matches, instead showing them
grep -c … count
grep -i 're' <file> … ignore case
grep -r 're' <dir> … recursive. In all files in this directory (and its sub-directories, etc.)

[6] alfred:~% ps ax | grep Chrome | grep -v grep |wc -l
       5
[7] alfred:~% ps ax | grep Chrome |wc -l               
       6

cut & paste

head and tail select lines, cut columns
cut … cuts a column of STDIN, writes to STDOUT
- -f … field - which column
- -d … delimiter - column delimiter. Default delimiter is \t.

cut -d: -f5 /etc/passwd

paste file1 file2… paste content of files. As columns, or lines (interleaved).
most common use: “concatenate” columns into one table

cut -d: -f5 /etc/passwd >c1
cut -d: -f3 /etc/passwd >c2
paste c1 c2

Perl in command line

Perl is a programming language, but it can be also used like grep, sed, wc, etc.

perl -e 'print "Hello.\n"'
perl -C -e -n 'print if /XY/' …
- while(<STDIN>){Not printing;}
- Default variable $_ contains the current line
  - “default variable” means the one that is used when we don’t say which variable to use
- Parameter -C ensures that input and output is interpreted as UTF-8, regardless of settings in the user’s shell
- perl -C -p -e 's/X/Y/g' … Like -n, but Print
perl -C -l -pe … Line endings = auto. Automagically solves issues with different line endings in texts created on unix and windows systems.

Big advantages of Perl

It is a programming language. Anything can be done in a script (=program).

perl -C -ple '$_=reverse()'

Best regular expressions and Unicode support.
- if you don’t believe, read Unicode Good, Bad and Ugly

Perl Regular Expressions

Regular expressions (regexes) exist in many programming languages and unix tools. Many variants (“standard,”extended”, etc.). We will only use Perl regular expressions.

Fast, convenient, great Unicode support, many extensions …

perldoc perlretut / perlrequick
backslash \ adds a special meaning (escape sequences or character classes)
character groups: [abd] # not a string, a group. Any of the 3 matches.
Any character (except newline): .
Quantifiers: * (0 or more), ? (0 or 1), + (1 or more)
^ … beginning of line (position before the first character)
- $line =~ /^\pLu/ # line starts with an upper case letter
$ … end of line (before the newline character)
- $line =~ /(\pP)+$/ # matches if line ends with punctuation (and store it)

Basic character classes

\d matches a digit, not just [0-9] but also digits from non-roman scripts
\s matches a whitespace character, the set [ and others
\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_], but also digits and characters from non-roman scripts
\D is a negated \d; it represents any other character than a digit, or [^\d]
\S is a negated \s; it represents any non-whitespace character [^\s]
\W is a negated \w; it represents any non-word character [^\w]
The period '.' matches any character but \n (unless the modifier //s is in effect, as explained below).
\N, like the period, matches any character but \n, but it does so regardless of whether the modifier //s is in effect.

Even More Regular Expressions

Brackets: (()) ... \1, \2, $1, $2
- m/(ab(cd))/ # $1 = "abcd"; $2 = "cd"
character classes (independent of LOCALE): \w, \W, \d, \D, \s, \S
Blocks, Scripts and Categories of Unicode:
- \pLu # Letter upper case
- \pP # Punctuation
read perldoc perlunicode and search for General_Category (/)
- \p{Cyrilic} # is Cyrilic character
- \P{Latin} # is NOT Latin character
read perldoc perlunicode and do /Scripts

Language Technologies for Research in Humanities

NPFL131 / ATKL00349

Bash shortcuts (and Windows specifics)

`sort, uniq`, frequencies (repeated from last week)

grep

cut & paste

Perl in command line

Big advantages of Perl

Perl Regular Expressions

Basic character classes

Even More Regular Expressions

Language Technologies for Research in Humanities

NPFL131 / ATKL00349

Bash shortcuts (and Windows specifics)

sort, uniq, frequencies (repeated from last week)

grep

cut & paste

Perl in command line

Big advantages of Perl

Perl Regular Expressions

Basic character classes

Even More Regular Expressions

`sort, uniq`, frequencies (repeated from last week)