Automatic processing of text data

NPFL098 / ATKL00345

Pavel Straňák

úterý 10.40–13.50
Malostranské nám. 25, SU1

18. 4. 2017

Text Encodings

A way to encode characters od a natural language into a simple set of symbols useful for storing and moving around. Usually it means into binary encoding.

Binary encoding are quite old: Yijing (I Ching/易经); Leibniz created the one we currently use.

Use iconv to convert between encodings.

Unicode – Terminology


Mapping of any characters to unique numbers. Independent of platforms, programming languages, natural languages, etc.


unit of (written) language representation; abstraction. Not the concrete written glyph

Composition/Decomposition of characters

A character can be equivalent to a sequence of one or more other characters. Composition/decomposition converts between these equivalent representations.

  1. A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) It need not change a character's value. See Unicode Glossary

A way to write a character. Its representation. There may be several glyphsg for a character (e.g. oldstyle, standard and tabular digits).

In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing.


A collection of glyphs used for the visual depiction of character data.

Unicode – encodings:


More Unicode – Character Names

Use the \N{charname} notation to get the character by that name for use in interpolated literals (double-quoted strings and regexes). In v5.16, there is an implicit

use charnames qw(:full :short);

Anything else is a Perl-specific convenience abbreviation. Specify one or more scripts by names if you want short names that are script-specific.

(from Perl Unicode Cookbook)

What characters are there really?

use Unicode::Normalize 'NFD';
use open qw(:std :utf8);

while (<>) {
    $_ = NFD($_);    # separate the combining diacritics ...
#    s/\p{Mn}//g;     # ... and strip it
    printf "U+%v04X\n", $_;

Unicode Normalization

Form Description
Normalization Form D (NFD) Canonical Decomposition
Normalization Form C (NFC) Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD) Compatibility Decomposition
Normalization Form KC (NFKC) Compatibility Decomposition, followed by Canonical Composition

See and perldoc Unicode::Normalize

Unicode Normalization Ⅱ – Potential Pitfalls

Singletons – Angström and Ohm symbols

Singletons – Angström and Ohm symbols

Å looks like Å and Ω looks like Ω. However after decomposing them, you don't get them back by recomposing.

Unicode Normalization Ⅲ – Potential Pitfalls

Compatibility (K) Composites

Compatibility (K) Composites

Compatible (K) forms remove formating distinctions. Sometimes needed. E.g. if one insists that 'office' should actually be 6 characters, not four (with 'ffi' ligature) or five (with 'f' and 'fi' ligature).

Beware of this when comparing strings.

Robust stripping of diacritics

use Unicode::Normalize 'NFD';
use open qw(:std :utf8);

while (<>) {
    $_ = NFD($_);    # separate the combining diacritics ...
    s/\p{Mn}//g;     # ... and strip it

Robust stripping of diacritics Ⅱ

Not always so simple: Devanagari

Even More Unicode (in Perl)