Automatic processing of text data

NPFL098 / ATKL00345

Pavel Straňák

stranak@ufal.mff.cuni.cz

úterý 10.40–13.50
Malostranské nám. 25, SU1

18. 4. 2017

Text Encodings

A way to encode characters od a natural language into a simple set of symbols useful for storing and moving around. Usually it means into binary encoding.

Binary encoding are quite old: Yijing (I Ching/易经); Leibniz created the one we currently use.

7 bit encoding – ASCII (1950s-60s) – control characters + a-zA-Z0-9
8 bit encodings (1970s-90s, unfortunatelly even today) – ASCII + 128 more characters; regional, operating system (unix, dos, win, mac), and other varieties
More bits for complex characters (Chinese, Japanese, ...)
- also many varieties
Until Unicode came to solve this mess.

Use iconv to convert between encodings.

Unicode – Terminology

Unicode

Mapping of any characters to unique numbers. Independent of platforms, programming languages, natural languages, etc.

Character

unit of (written) language representation; abstraction. Not the concrete written glyph

Composition/Decomposition of characters

A character can be equivalent to a sequence of one or more other characters. Composition/decomposition converts between these equivalent representations.

Diacritic

A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) It need not change a character's value. See Unicode Glossary

Glyph

A way to write a character. Its representation. There may be several glyphsg for a character (e.g. oldstyle, standard and tabular digits).

In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing.

Font

A collection of glyphs used for the visual depiction of character data.

Unicode – encodings:

UTF-32
- 4 octets per character
UTF-16
- each character in Basic Multilingual Plane: 2 octets
- other characters 4 octets
UTF-8
- 1–6 octets per character
- number of 1's in the first octet determines the number of octets
- “continuation octets” have the form 10xxxxxx
- advantage: ASCII is a subset of UTF-8
- for Czech, 1 or 2 octets is enough for all the characters

(source: http://ufal.mff.cuni.cz/~zabokrtsky/courses/npfl092/html/slides/w3c_slidy/02.html#(9))

More Unicode – Character Names

Use the \N{charname} notation to get the character by that name for use in interpolated literals (double-quoted strings and regexes). In v5.16, there is an implicit

use charnames qw(:full :short);

"\N{MATHEMATICAL ITALIC SMALL N}" # :full
"\N{GREEK CAPITAL LETTER SIGMA}" # :full

Anything else is a Perl-speciﬁc convenience abbreviation. Specify one or more scripts by names if you want short names that are script-speciﬁc.

"\N{Greek:Sigma}" # :short
"\N{ae}" # latin
"\N{epsilon}" # greek

(from Perl Unicode Cookbook)

What characters are there really?

use Unicode::Normalize 'NFD';
use open qw(:std :utf8);

while (<>) {
    $_ = NFD($_);    # separate the combining diacritics ...
#    s/\p{Mn}//g;     # ... and strip it
    printf "U+%v04X\n", $_;
}

Unicode Normalization

Form	Description
Normalization Form D (NFD)	Canonical Decomposition
Normalization Form C (NFC)	Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD)	Compatibility Decomposition
Normalization Form KC (NFKC)	Compatibility Decomposition, followed by Canonical Composition

See http://www.unicode.org/reports/tr15/ and perldoc Unicode::Normalize

Unicode Normalization Ⅱ – Potential Pitfalls

Singletons – Angström and Ohm symbols

Å looks like Å and Ω looks like Ω. However after decomposing them, you don't get them back by recomposing.

Unicode Normalization Ⅲ – Potential Pitfalls

Compatibility (K) Composites

Compatible (K) forms remove formating distinctions. Sometimes needed. E.g. if one insists that 'office' should actually be 6 characters, not four (with 'ﬃ' ligature) or five (with 'f' and 'ﬁ' ligature).

Beware of this when comparing strings.

Robust stripping of diacritics

Unicode "compositional characters"
- 2 different ways to represent a "combined" character
- normalisation (everything combined or split) (not that simple, we know)
- decomposed representation allows us to work with diacritics separately

use Unicode::Normalize 'NFD';
use open qw(:std :utf8);

while (<>) {
    $_ = NFD($_);    # separate the combining diacritics ...
    s/\p{Mn}//g;     # ... and strip it
    print;
}

Robust stripping of diacritics Ⅱ

Not always so simple: Devanagari

अनुस्वारः [anusvāra] vs. candrabindu (anunāsika)
- common error in Hindi
- normalisation is linguistically wrong, but possible and practical
- DEVANAGARI SIGN CANDRABINDU is not "diacritic", but it is "nonspacing mark": Mn
- DEVANAGARI SIGN ANUSVARA too, so they can both be stripped like diacritics.
nukta often missing (even though it should be written)
- Remove everywhere?
different order: +vocal+candrabindu, or +candrabindu+vocal
danda (U+0964, not |, U+007C, VERTICAL BAR) vs. full stop after a sentence
"double danda"
devanagari vs. arabic numbers
variability of spelling of words of English origin impossible to normalise: no rules

Even More Unicode (in Perl)

Perl Unicode Cookbook
The Effective Perler
What is Wrong with Sort and How to Fix It
- correct sorting: Unicode and Locale aware
- German (ö = oe), English and Czech (g, h, ch, i) alphabetical sorts (e.g. for a phonebook) differ.
Always decompose and recompose (and think about what happens)!