Up

MORPHOLOGICAL LAYER


What is PDT

The morphological analysis of an isolated (regardless of context) word form produces a lemma and a combination of values of individual morphological categories. The combination of those values is called a morphological tag (MTag); in other words, the list of possible MTags together with corresponding lemmas represents the output of the morphological analysis of the input word form. In a given context, just one pair (MTag, lemma) "fits in"; the context-sensitive process of selecting the fitting pair is called morphological annotation (if it is done manually) or morphological tagging (if it is automatically; visit Czech Language Tagging page). In order to use the tags effectively in applications, and for uniformity, we also follow the usual practice and assign "lemmas" and appropriate "morphological tags" to punctuation.

Thus on the morphological layer of the PDT 1.0, a MTag and a lemma are assigned to each token in the input data. The morphological annotation has been done semi-automatically. It is a two-step process:

1, the input text is processed automatically by the morphological analyzer, resulting in a list of possible (lemma, MTag) pairs for each input token.

Example:

  Input sentence:
            Prezident rezignoval na svou funkci.

  Output of morphological analyser (SGML markup used):
        <csts>
        <f cap>Prezident<MMl>prezident<MMt>NNMS1-----A----
        <f>rezignoval<MMl>rezignovat_:T<MMt>VpYS---XR-AA---
        <f>na<MMl>na<MMt>RR--4----------<MMt>RR--6----------          # lemma: na; tags: RR--4----------, RR--6----------
        <f>svou<MMl>svůj-1_^(přivlast.)<MMt>P8FS4---------1<MMt>P8FS7---------1
        <f>funkci<MMl>funkce<MMt>NNFS3-----A----<MMt>NNFS4-----A----<MMt>NNFS6-----A----
        <D>
        <d>.<MMl>.<MMt>Z:-------------                                                    # punctuation
        </csts>
 

2, manual disambiguation yields the desired unique pair

Example:

    Input:            # three ambiguous word forms: na, svou, funkci
        <csts>
        <f cap>Prezident<MMl>prezident<MMt>NNMS1-----A----
        <f>rezignoval<MMl>rezignovat_:T<MMt>VpYS---XR-AA---
        <f>na<MMl>na<MMt>RR--4----------<MMt>RR--6----------
        <f>svou<MMl>svůj-1_^(přivlast.)<MMt>P8FS4---------1<MMt>P8FS7---------1
        <f>funkci<MMl>funkce<MMt>NNFS3-----A----<MMt>NNFS4-----A----<MMt>NNFS6-----A----
        <D>
        <d>.<MMl>.<MMt>Z:-------------
        </csts>

    Output of manual annotation:

        <csts>
        <f cap>Prezident<l>prezident<t>NNMS1-----A----
        <f>rezignoval<l>rezignovat_:T<t>VpYS---XR-AA---
        <f>na<l>na<t>RR--4----------
        <f>svou<l>svůj-1_^(přivlast.)<t>P8FS4---------1
        <f>funkci<l>funkce<t>NNFS4-----A----
        <D>
        <d>.<l>.<t>Z:-------------
        </csts>
 
 

The morphological layer has been annotated by a separate (with regards to the group of annotators of the next two other layers) group of annotators. The group (nine udergraduate students with either a computer science or linguistics backgrounds) proceeded in two separate phases. During the first phase - for each text to be annotated - two annotators independently chose the (lemma, MTag) pair from the list suggested by the morphological analyzer. The two versions of the same text were compared to each other, and then in the second phase another annotator resolved the differences between them. Eight of the nine students were the "first phase" annotators and only one was the "second phase" annotator-arbiter, with the hope of consistent tag assignment throughout the corpus.
In order to make the annotation of texts more human-friendly, a special purpose tool has been developed. The tool was first implemented under Linux platform and then reimplemented for the MS Windows platform.
 

Description of the post-annotation checking steps done at the morphological layer is available (pdffile, psfile).
 

Currently, Czech MTags are defined as a concatenation of 15 morphological categories and each morphological category corresponds to precisely one position - detailed description (psfile, pdffile), quick reference (htmlfile,pdffile).