Lab 11 - Word Alignment with IBM Model 1
The goal of the lab is to implement IBM Model 1 for word alignment (a model
that considers only lexical values of words, i.e. the words as they are
written, not their position etc.)
- Implement the IBM model 1 as shown in pseudocode in the slides from MT Marathon 2010 (Patrick Lambert, slides originally by Philipp Koehn).
- Download manual word alignments: czenali.gz (2501 lines)
- The data originally come from: Czech-English manual word alignments.
- I concatenated all files
*.wa
from merged_data/
.
- I stripped SGML and converted to four tab-delimited columns: English, Czech, sure alignments, possible alignments.
- IMPORTANT: The alignments provided are only for reference, your script must not look at them. They serve as the golden answer that you are evaluating your outputs against.
- Evaluate and report alignment error rate, precision and recall of your IBM1 alignments against the manual alignments.
-
Your solutions are one part of HW04. From this lab, I will need:
-
The implementation.
- The aligned corpus as the 2501 lines, three tab-delimited columns:
- Original source text.
- Original target text.
- Your best alignment.