Lab 11 - Word Alignment with IBM Model 1

The goal of the lab is to implement IBM Model 1 for word alignment (a model that considers only lexical values of words, i.e. the words as they are written, not their position etc.)

Implement the IBM model 1 as shown in pseudocode in the slides from MT Marathon 2010 (Patrick Lambert, slides originally by Philipp Koehn).
Download manual word alignments: czenali.gz (2501 lines)
- The data originally come from: Czech-English manual word alignments.
- I concatenated all files *.wa from merged_data/.
- I stripped SGML and converted to four tab-delimited columns: English, Czech, sure alignments, possible alignments.
- IMPORTANT: The alignments provided are only for reference, your script must not look at them. They serve as the golden answer that you are evaluating your outputs against.
Evaluate and report alignment error rate, precision and recall of your IBM1 alignments against the manual alignments.
- Alignment error rate definition: Section 6 of Och and Ney (2000), implementation in Perl: alignment-error-rate.pl (sample input)
- Evaluate at various thresholds of the conditional probability.
- To debug, you print the manual or your alignments in plaintext using this script:
  ./alitextview.pl --indexed-from-one < alitextview.sample-input.txt | less
- Try improving the alignment by various token-level changes (lowercasing, stemming, lemmatization).
Your solutions are one part of HW04. From this lab, I will need:
- The implementation.
- The aligned corpus as the 2501 lines, three tab-delimited columns:
  1. Original source text.
  2. Original target text.
  3. Your best alignment.