Automatické zpracování textových dat


NPFL098 / ATKL00345

Pavel Straňák

stranak@ufal.mff.cuni.cz

úterý 10.40–13.50
Malostranské nám. 25, SU1

14. 3. 2017

Klávesové zkratky pro Bash

sort, uniq, frequencies

grep

Mostly we use perl. You can do all of this in perl, but for the very simplest things, especially with fixed strings (no regexes) grep can be simpler to use. Whenever you need a regex, use perl.

[6] alfred:~% ps ax | grep Chrome | grep -v grep |wc -l
       5
[7] alfred:~% ps ax | grep Chrome |wc -l               
       6

cut & paste

cut -d: -f5 /etc/passwd
cut -d: -f5 /etc/passwd >c1
cut -d: -f3 /etc/passwd >c2
paste c1 c2

Perl in command line

Perl is a programming language, but it can be also used like grep, sed, wc, etc.

Big advantages of Perl

  1. It is a programming language. Anything can be done in a script (=program).
perl -C -ple '$_=reverse()'
  1. Best regular expressions and Unicode support.

Perl Regular Expressions

Regular expressions (regexes) exist in many programming languages and unix tools. Many variants ("standard, "extended", etc.). We will only use Perl regular expressions.

Fast, convenient, great Unicode support, many extensions ...

Basic character classes

\d matches a digit, not just [0-9] but also digits from non-roman
    scripts

\s matches a whitespace character, the set [\ \t\r\n\f] and others

\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_]
but also digits and characters from non-roman scripts

\D is a negated \d; it represents any other character than a digit, or
[^\d]

\S is a negated \s; it represents any non-whitespace character [^\s]

\W is a negated \w; it represents any non-word character [^\w]

The period '.' matches any character but "\n" (unless the modifier
    "//s" is in effect, as explained below).

\N, like the period, matches any character but "\n", but it does so
    regardless of whether the modifier "//s" is in effect.   

More Regular Expressions