Homework 02

Download the data. The zip file contains a part of the Prague Dependency Treebank (PDT), namely the directory train-1 with m-files (morphological layer) and w-files (word layer).

The files are gzipped. You can

either gunzip them in advance and then process (but note that if you gunzip the w-files, the links from m-files to their respective w-files get broken),
or process them directly using libraries working with gzipped files in your favourite programming language (in Perl, e.g., IO::Compress::Gzip, PerlIO::gzip).

If you check the beginning of any m-file, you can see the head element, e.g.:

  <head>
    <schema href="mdata_30_schema.xml" />
    <references>
      <reffile id="w" name="wdata" href="cmpr9410_001.w.gz" />
    </references>
  </head>

The important line for our today's task is:

      <reffile id="w" name="wdata" href="cmpr9410_001.w.gz" />

... which refers to the name of the respective w-file. (Notice the .gz suffix in the link which would make (strictly speeking) the link broken if you gunzip the w-files in advance.) The m-files and w-files for individual documents share file names and only differ in the suffix but do not rely on it. Instead, if you need to find the corresponding w-file to an m-file, use the href attribute in the reffile element with the attribute name set to wdata (generally, there may be multiple elements reffile with different values of the attribute name in the header).

In the homework, we focus on differences between the word layer and the morphological layer. We already know that the w-layer keeps the original text even with all typos and that these are corrected at the m-layer. There are four types of changes made at the m-layer in comparison with the original text, i.e. the w-layer, which are signalled by a value of the element form_change within the respective element m in the m-file. The four possible values for the element form_change are:

num_normalization (normalization of numbers)
insert (insertion of missing tokens in the original text)
spell (typos, splitting mistakenly joint words)
ctcd (splitting contracted tokens such as 'oč' -> 'o co' [for what])

Your task: Write a script (in Perl, Python, etc.) to generate a list of form changes in the given data, each change described on a single line. (Do not use btred or any Prague Markup Language (PML)-related libraries in case you already know them.)

The script should read all m-files in the directory and produce output that would indicate the id of the m-element where a change was done, type of the change (form_change), the original token from the w-layer and the changed form from the m-layer. (Naturally, for cases with the form_change value insert, there will be no original token.)

For example, for the following input m-data:

  <m id="m-cmpr9410-011-p2s2w3">
    <src.rf>manual</src.rf>
    <w.rf>
      <LM>w#w-cmpr9410-011-p2s2w3</LM>
    </w.rf>
    <form_change>spell</form_change>
    <form>podmínkách</form>
    <lemma>podmínka</lemma>
    <tag>NNFP6-----A----</tag>
  </m>

and the corresponding w-data:

  <w id="w-cmpr9410-011-p2s2w3">
    <token>podmínkach</token>
  </w>

the output line should look like:

   m-cmpr9410-011-p2s2w3   spell   podmínkach   podmínkách

Notes:

Do not rely on identifiers (attribute id) of m-elements and w-elements to be almost identical (i.e., except for the w- or m- prefix). They are not, especially in cases where one token is split into two m-forms. To identify the corresponding token from the w-layer, always use the link found in the element w.rf.
Similarly, as already mentioned, to find the corresponding w-file, use the element references in the header of the m-file.
Sometimes, two or more tokens from the w-layer may be joined to one token at the m-layer. It happens a few times in the whole PDT but not in the data for this homework.
If a superfluous, redundant token from the w-layer is just deleted, there is no info about this change on the m-layer. (It is just for your information, do not try and find such cases in this homework.)
Run the script on all m-files from the data and produce a single output text file with the changes.
Submit the script and also the ouput of the script in a text file to the svn.