Technology for NLP

NPFL092

Zdeněk Žabokrtský & Rudolf Rosa

{zabokrtsky,rosa}@ufal.mff.cuni.cz

Tuesday 9.00–11.20
SU2

Character Encoding

Today's computers use binary digits
No natural relation between numbers and characters of an alphabet ⇒ convention needed
No convention ⇒ chaos

Basic Notions

Character

abstract (Platonic) entity
no numerical representation nor graphical form
e.g. “capital A with grave accent”

Character repertoire

Set of characters
Question of identity: logically distinct characters can look identical (e.g. A in Roman, Greek and Cyrillic alhpabet)

Basic Notions (2)

Encoding: Algorithm to convert a sequence of characters to a sequence of octets
Code position: Numerical representation of a character
Glyph: Visual representation of a character
Font: Set of glyphs for a set of characters

ASCII

American Standard Code for Information Interchange (1950's)
7 bits (0–127)
0–31,127: Control characters (Escape, Line Feed)

32–126: Space, numerals, upper and lower case characters

33: ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
65: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` 
97: a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

There is no 8-bit ASCII!

ASCII (2)

Advantages:
- Very simple: one character—one code position
- Minimal volume: 1 character—1 octet
Main drawback:
- No way to represent national alphabets

8-bit Encodings

Supersets of ASCII, using octets 128–255 (still keeping the 1 character—1 octet relation)
International Standard Organisation: ISO 8859 (1980's)

West European: ISO 8859-1 (ISO Latin 1)

161: ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ - ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
192: À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
224: à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

For Czech and other Central/East European languages: anarchy
- ISO 8859-2 (ISO Latin 2)
- Windows 1250
- KOI-8
- Brothers Kamenický
- “standards” of IBM, Apple etc.
⋮

Unicode

Unicode Consortium (1991)

Unicode (ISO 40646)
- nowadays: 30 alphabets used in hundreds of languages (approx. 40.000 characters)
- Arabic, Sanscrit, Chinese, Japanese, Korean…
- ambition: 250 alphabets for hundreds of languages
- e.g. “LATIN CAPITAL LETTER A WITH ACUTE”

Unicode (2)

Common encodings:

UTF-32

4 octets per character

UTF-16

each character in Basic Multilingual Plane: 2 octets
other characters 4 octets

UTF-8

1–6 octets per character
number of 1's in the first octet determines the number of octets
“continuation octets” have the form 10xxxxxx
advantage: ASCII is a subset of UTF-8
for Czech, 1 or 2 octets is enough for all the characters

Unicode (3)

Problems

equivalence of visually identical characters of different alphabets
several ways to encode the same character (á: C3 A1 or 61 CC 81) ⇒ normalization
sorting

Conversion Tools

Linux

iconv -f windows-1250 -t utf8 text-win > text-utf8

MS Windows:
1. Open the input file in MS Word as “Encoded Text”
2. Save it as “Encoded Text” in different encoding

Locale

Comparing Different Settings

export LC_ALL=C
cat << EOF | sort
ďa
čá
ča
ca
dá
da
EOF

export LC_ALL=cs_CZ.UTF-8
cat << EOF | sort
ďa
čá
ča
ca
dá
da
EOF

Mastering Your Text Editor

Modern source code editor should provide:

modes (programming languages, xml, html...)
syntax highlighting
completion
indentation
templates
support for encodings (utf-8)
undo
searching
integration with shell, compiler and/or debugger

+ fallback mode for working in a text console (no GUI)

Text Editors in Linux

2 major editors:

Poor man's choice: joe, mcedit, nano