Download the data. The zip file contains a part of the Prague Dependency Treebank (PDT), namely the directory train-1
with m-files (morphological layer) and w-files (word layer).
The files are gzipped. You can
If you check the beginning of any m-file, you can see the head
element, e.g.:
<head> <schema href="mdata_30_schema.xml" /> <references> <reffile id="w" name="wdata" href="cmpr9410_001.w.gz" /> </references> </head>
The important line for our today's task is:
<reffile id="w" name="wdata" href="cmpr9410_001.w.gz" />... which refers to the name of the respective w-file. (Notice the
.gz
suffix in the link which would make (strictly speeking) the link broken if you gunzip the w-files in advance.)
The m-files and w-files for individual documents share file names and only differ in the suffix but do not rely on it. Instead, if you need to find the corresponding w-file to an m-file, use the href
attribute in the reffile
element with the attribute name
set to wdata
(generally, there may be multiple elements reffile
with different values of the attribute name
in the header).
In the homework, we focus on differences between the word layer and the morphological layer. We already know that the w-layer keeps the original text even with all typos and that these are corrected at the m-layer. There are four types of changes made at the m-layer in comparison with the original text, i.e. the w-layer, which are signalled by a value of the element form_change
within the respective element m
in the m-file. The four possible values for the element form_change
are:
num_normalization
(normalization of numbers)
insert
(insertion of missing tokens in the original text)
spell
(typos, splitting mistakenly joint words)
ctcd
(splitting contracted tokens such as 'oč' -> 'o co' [for what])
Your task: Write a script (in Perl, Python, etc.) to generate a list of form changes in the given data, each change described on a single line. (Do not use btred
or any Prague Markup Language (PML)-related libraries in case you already know them.)
The script should read all m-files in the directory and produce output that would indicate the id
of the m-element where a change was done, type of the change (form_change
), the original token from the w-layer and the changed form from the m-layer. (Naturally, for cases with the form_change
value insert
, there will be no original token.)
For example, for the following input m-data:
<m id="m-cmpr9410-011-p2s2w3"> <src.rf>manual</src.rf> <w.rf> <LM>w#w-cmpr9410-011-p2s2w3</LM> </w.rf> <form_change>spell</form_change> <form>podmínkách</form> <lemma>podmínka</lemma> <tag>NNFP6-----A----</tag> </m>and the corresponding w-data:
<w id="w-cmpr9410-011-p2s2w3"> <token>podmínkach</token> </w>the output line should look like:
m-cmpr9410-011-p2s2w3 spell podmínkach podmínkách
Notes:
id
) of m-elements and w-elements to be almost identical (i.e., except for the w- or m- prefix). They are not, especially in cases where one token is split into two m-forms. To identify the corresponding token from the w-layer, always use the link found in the element w.rf
.
references
in the header of the m-file.