Homework 02

Download the data. The zip file contains a part of the Prague Dependency Treebank (PDT), namely the directory train-1 with m-files (morphological layer) and w-files (word layer).

The files are gzipped. You can

If you check the beginning of any m-file, you can see the head element, e.g.:

  <head>
    <schema href="mdata_30_schema.xml" />
    <references>
      <reffile id="w" name="wdata" href="cmpr9410_001.w.gz" />
    </references>
  </head>

The important line for our today's task is:

      <reffile id="w" name="wdata" href="cmpr9410_001.w.gz" />
... which refers to the name of the respective w-file. (Notice the .gz suffix in the link which would make (strictly speeking) the link broken if you gunzip the w-files in advance.) The m-files and w-files for individual documents share file names and only differ in the suffix but do not rely on it. Instead, if you need to find the corresponding w-file to an m-file, use the href attribute in the reffile element with the attribute name set to wdata (generally, there may be multiple elements reffile with different values of the attribute name in the header).

In the homework, we focus on differences between the word layer and the morphological layer. We already know that the w-layer keeps the original text even with all typos and that these are corrected at the m-layer. There are four types of changes made at the m-layer in comparison with the original text, i.e. the w-layer, which are signalled by a value of the element form_change within the respective element m in the m-file. The four possible values for the element form_change are:

Your task: Write a script (in Perl, Python, etc.) to generate a list of form changes in the given data, each change described on a single line. (Do not use btred or any Prague Markup Language (PML)-related libraries in case you already know them.)

The script should read all m-files in the directory and produce output that would indicate the id of the m-element where a change was done, type of the change (form_change), the original token from the w-layer and the changed form from the m-layer. (Naturally, for cases with the form_change value insert, there will be no original token.)

For example, for the following input m-data:

  <m id="m-cmpr9410-011-p2s2w3">
    <src.rf>manual</src.rf>
    <w.rf>
      <LM>w#w-cmpr9410-011-p2s2w3</LM>
    </w.rf>
    <form_change>spell</form_change>
    <form>podmínkách</form>
    <lemma>podmínka</lemma>
    <tag>NNFP6-----A----</tag>
  </m>
and the corresponding w-data:
  <w id="w-cmpr9410-011-p2s2w3">
    <token>podmínkach</token>
  </w>
the output line should look like:
   m-cmpr9410-011-p2s2w3   spell   podmínkach   podmínkách

Notes: