Pavel Straňák
stranak@ufal.mff.cuni.cz
úterý 10.40–13.50
Malostranské nám. 25, SU1
18. 4. 2017
A way to encode characters od a natural language into a simple set of symbols useful for storing and moving around. Usually it means into binary encoding.
Binary encoding are quite old: Yijing (I Ching/易经); Leibniz created the one we currently use.
Use iconv
to convert between encodings.
Mapping of any characters to unique numbers. Independent of platforms, programming languages, natural languages, etc.
unit of (written) language representation; abstraction. Not the concrete written glyph
A character can be equivalent to a sequence of one or more other characters. Composition/decomposition converts between these equivalent representations.
A way to write a character. Its representation. There may be several glyphsg for a character (e.g. oldstyle, standard and tabular digits).
In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing.
A collection of glyphs used for the visual depiction of character data.
(source: http://ufal.mff.cuni.cz/~zabokrtsky/courses/npfl092/html/slides/w3c_slidy/02.html#(9))
Use the \N{charname}
notation to get the character by that name for use in interpolated literals (double-quoted strings and regexes). In v5.16, there is an implicit
use charnames qw(:full :short);
"\N{MATHEMATICAL ITALIC SMALL N}" # :full
"\N{GREEK CAPITAL LETTER SIGMA}" # :full
Anything else is a Perl-specific convenience abbreviation. Specify one or more scripts by names if you want short names that are script-specific.
"\N{Greek:Sigma}" # :short
"\N{ae}" # latin
"\N{epsilon}" # greek
(from Perl Unicode Cookbook)
use Unicode::Normalize 'NFD';
use open qw(:std :utf8);
while (<>) {
$_ = NFD($_); # separate the combining diacritics ...
# s/\p{Mn}//g; # ... and strip it
printf "U+%v04X\n", $_;
}
Form | Description |
---|---|
Normalization Form D (NFD) | Canonical Decomposition |
Normalization Form C (NFC) | Canonical Decomposition, followed by Canonical Composition |
Normalization Form KD (NFKD) | Compatibility Decomposition |
Normalization Form KC (NFKC) | Compatibility Decomposition, followed by Canonical Composition |
See http://www.unicode.org/reports/tr15/
and perldoc Unicode::Normalize
Singletons – Angström and Ohm symbols
Å looks like Å and Ω looks like Ω. However after decomposing them, you don't get them back by recomposing.
Compatibility (K) Composites
Compatible (K) forms remove formating distinctions. Sometimes needed. E.g. if one insists that 'office' should actually be 6 characters, not four (with 'ffi' ligature) or five (with 'f' and 'fi' ligature).
Beware of this when comparing strings.
use Unicode::Normalize 'NFD';
use open qw(:std :utf8);
while (<>) {
$_ = NFD($_); # separate the combining diacritics ...
s/\p{Mn}//g; # ... and strip it
print;
}
DEVANAGARI SIGN CANDRABINDU
is not "diacritic", but it is "nonspacing mark": Mn
DEVANAGARI SIGN ANUSVARA
too, so they can both be stripped like diacritics.+vocal+candrabindu
, or +candrabindu+vocal
U+0964
, not |, U+007C, VERTICAL BAR
) vs. full stop after a sentence