HW2 - Edit distance

Note: There are no home work assignments this year (2017/2018), instead invest some extra energy into the project. You were, however, supposed to send me a pseudo code of an inflectional morphological analyzer during the semester..

Write a program that will create a very rough Russian-Czech dictionary based on the similarity of words. What you should do:

Use Edit distance with the 4 basic operations (M,S,I,D) but with variable costs for substitute.
You do not need to tune the substitute costs by iterations (unless you really want to), just set them to some reasonable values based on the facts listed below.
The dictionary can be 1:n (but keep the n small; it can be zero if there is no reliable translation).
Use the Czech corpus you used in the first homework, the Russian corpus can be downloaded here (10M tokens from Russian wikipedia, you do not need to clean it from things like #REDIRECT).
If you are unable to process all the words in the corpora, use less words.
Any parameters can be set directly in the code (in one place), you don't need to use configuration files.

Hints:

Do not work with tokens, get types first
You might want to use some modified spelling for each language so that they look more similar (latin alphabet for Russian, dropping vowel length for Czech, etc). Translate the words to this modified spellings before running edit distance, but remember the dictionary must use standard spelling conventions.
When transcribing Russian to latin alphabet, use the so-called Scientific transliteration,
- see https://en.wikipedia.org/wiki/Scientific_transliteration_of_Cyrillic
- you can import the table to googledocs with
  =IMPORTHTML("https://en.wikipedia.org/wiki/Scientific_transliteration_of_Cyrillic", "table",1)
- then copy and paste as values (Ctrl+C, Ctrl+Shift+V) and select the columns/rows you need
Be smart - use some simple and quick filtering before calculating ED. If you already have a candidate pair "abc" and "abcc", you should see without using ED that the pair "abc" - "trytobesmart" cannot be better, and "abc - "ccc" is probably not better either. (Remember that the substitution costs are not always 1).

Facts about languages to consider when setting edit distance costs and designing pre-processing code:

Old Slavic g changed to h in Czech but not in Russian. Czech imported g later
Czech pronounces y and i the same way, thus their spelling is more likely to change.
Czech marks vowel length, Russian does not
Czech ch corresponds to Russian х (x).
Czech marks palatalization by hacek, Russian uses the soft-sign following the letter. Some Russian palatalizations do not have equivalents in Czech (l'), some do (n' - ň). See this page for all the Czech letters.
The Czech alphabet uses
There are other things you could consider, but you are allowed to ignore them.

What you should send me:

A Russian-Czech "dictionary" covering 5000 most frequent Russian words. Use tab as a separator. The Czech translation part can be empty.
Reasonably clean code with reasonable amount of comments.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form