Technology for NLP


NPFL092

Zdeněk Žabokrtský & Rudolf Rosa

{zabokrtsky,rosa}@ufal.mff.cuni.cz

Tuesday 9.00–11.20
SU2

Character Encoding

Basic Notions

Character
  • abstract (Platonic) entity
  • no numerical representation nor graphical form
  • e.g. “capital A with grave accent”
Character repertoire
  • Set of characters
  • Question of identity: logically distinct characters can look identical (e.g. A in Roman, Greek and Cyrillic alhpabet)

Basic Notions (2)

Encoding
Algorithm to convert a sequence of characters to a sequence of octets
Code position
Numerical representation of a character
Glyph
Visual representation of a character
Font
Set of glyphs for a set of characters

ASCII

There is no 8-bit ASCII!

ASCII (2)

8-bit Encodings

Unicode

Unicode Consortium (1991)

Unicode (2)

Common encodings:

UTF-32
4 octets per character
UTF-16
  • each character in Basic Multilingual Plane: 2 octets
  • other characters 4 octets
UTF-8
  • 1–6 octets per character
  • number of 1's in the first octet determines the number of octets
  • “continuation octets” have the form 10xxxxxx
  • advantage: ASCII is a subset of UTF-8
  • for Czech, 1 or 2 octets is enough for all the characters

Unicode (3)

Problems

Other Solutions

Conversion Tools

Locale

Comparing Different Settings

export LC_ALL=C
cat << EOF | sort
ďa
čá
ča
ca
dá
da
EOF
export LC_ALL=cs_CZ.UTF-8
cat << EOF | sort
ďa
čá
ča
ca
dá
da
EOF

Mastering Your Text Editor

Modern source code editor should provide:

+ fallback mode for working in a text console (no GUI)

Text Editors in Linux

2 major editors:

Poor man's choice: joe, mcedit, nano