<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" 
                      "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">



<!-- Konverze do HTML: /home/pajas/bin/docbook &hyph;&hyph;html tr18dz.xml -->
<!-- Dokumentace docbooku: http://www.oasis-open.org/docbook/documentation/reference/html/docbook.html -->



<!--
<!doctype book PUBLIC "-//UFAL//DocBk XML V3.1.7-Based Extension V1.0//EN"
  "ufaldb.dtd">
-->
<?xml-stylesheet type="text/css" href="modern.css"?>
<book lang="en">
  <bookinfo>
    <title>Manual for Morphological Annotation</title>
    <subtitle>Revision for the Prague Dependency Treebank 2.0</subtitle>
    <subtitle>ÚFAL Technical Report No. 2005-27</subtitle>
    <authorgroup>
      <author>
    <firstname>Jiří</firstname>
    <surname>Hana</surname>
      </author>
      <author>
        <firstname>Daniel</firstname>
        <surname>Zeman</surname>
      </author>
    </authorgroup>
    <authorgroup>
      <collab><collabname>Jan Hajič</collabname></collab>
      <collab><collabname>Hana Hanová</collabname></collab>
      <collab><collabname>Barbora Hladká</collabname></collab>
      <collab><collabname>Emil Jeřábek</collabname></collab>
    </authorgroup>
    <orgname><ulink url="http://ufal.mff.cuni.cz/">Ústav formální a
    aplikované lingvistiky</ulink>, <ulink
    url="http://www.mff.cuni.cz/">Matematicko-fyzikální
    fakulta</ulink>, <ulink url="http://www.cuni.cz/">Univerzita
    Karlova</ulink>, Praha, Czechia</orgname>
  </bookinfo>

  <preface>
    <title>Preface to Version 2.0</title>
    <para>Although the title of this report inherits the word
    &quot;Manual&quot; from the previous version, it is no     more
    intended to guide the annotators. Rather it attempts to describe
    the current state of the morphological annotation in PDT
    2.0. Most of the added information resulted from several
    semi-automatic checks performed on the data before having
    released it. In some cases it was not manageable to bring the data
    to the desired state - if so, both the desired and the current
    state of the data are described.</para>
    <para>PDT 2.0 contains 1,960,657 morphologically annotated
    tokens in 126,831 sentences. There are 168,454 distinct word
    forms, 71716 distinct lemmas, and 1740 morphological tags.</para>
    <!-- ntred -qTNe 'print("$this->{lemma}\n") if($this!=$root and
    $this->{TID} eq "")' | sort -u | wc -l -->
    <para>The final checking and analysis of the data as well as the
    work on this manual revision were supported by the Czech Academy
    of Sciences program called "Information Society", projects
    No. 1ET101120503 and 1ET101120413, and the grant No. GA405/03/0913.</para>
  </preface>
  
  <preface><title>Preface to Version 1.0</title>
    <para>We are pleased to publish the first version of the manual
      for morphological annotation of Czech sentences. We believe that
      such guidelines can be of use to the users of Prague Dependency
      Treebank 1.0 (PDT 1.0), as well as for preparation of new
      data.</para>
    <para>Let us recall the most important steps we passed in order to
      get about two million morphologically annotated words (PDT 1.0).
      At the very beginning, we put together a team of eight
      annotators - we did introduce them to a system of morphological
      tags we designed to describe Czech morphological properties; we
      also used (as a preprocessing step) a morphological analyzer for
      processing isolated words, and, last but not least, we did rely
      on their knowledge of Czech morphology they have acquired while
      studying at secondary school, i.e. we did not offer them any
      annotation guidelines.</para>
    <para>One can assume that this strategy is too hazardous - how to
      deal with discrepancies the annotators produce to ensure the
      consistency of annotation? First, two annotators annotated each
      text file. Then, by a &quot;blind&quot; automatic procedure (no
      matter what word is processed - just comparing two strings) we
      detected words annotated differently. Consequently, the only one
      annotator (as a member of just two-member team) handled these
      cases and, also, checked the morphological annotations against
      the syntactic-analytic annotations. This way we replaced the
      absence of annotation guidelines by sequential elimination of
      discrepancies across both the morphological and
      syntactic-analytic levels of annotation.</para>
    <para>Along the way we were writing this annotation manual. It is not
      intended as a comprehensive guide to the morphological
      annotation of Czech sentences (in contrast to the manual for
      syntactic-analytic annotations). The authors concentrate
      &quot;only&quot; on those cases which caused the most
      ambiguities and problems while annotating PDT 1.0. The ongoing
      effort is directed to the treating of not- yet-solved
      problematic cases in accord with the
      conventions of the automatic morphological analyzer.</para>
    <para>The morphological annotation of PDT 1.0 was carried out in the
      framework of experimental verification of the definition of
      formal representation of the analysis of Czech sentences (the
      project GAČR 405/96/0198, &quot;Formal representation of
      language structures&quot;). The material obtained in this way
      (data) is used in many domains of research in computational
      linguistics, above all as basic (training) data in projects of
      the automatic language analysis, the MŠMT research project
      MSM113000006, the &quot;Laboratory for Language Data
      Processing&quot; (the MŠMT project VS961510) and the Center for
      Computational Linguistics (the MŠMT project LN00A063). These
      data have been also used as verification material for various
      partial projects within the complex program GAČR 405/96/K214
      (&quot;Czech Language in Computer Age&quot;). The &quot;Center
      for Computational Linguistics&quot; project financially
      supported work on these morphological annotation guidelines.</para>
  </preface>
  
  <chapter id="ch-intr"><title>Introduction</title>
  
    <para>We do not want to substitute a grammarbook of Czech. So we are not going to systematically define word classes and paradigms. All the annotators should understand the fundamentals of Czech morphology, as most native Czech speakers do (the stuff is being taught in elementary schools). What we are going to describe are the difficult or unusual phenomena. Most notably we will address the annotation of proper names, foreign words, and abbreviations. Such categories are rarely and sparsely covered by standard dictionaries. To get an idea what a foreign word, proper name etc. mean it is useful to try to find it using an internet portal, an encyclopedia etc. During annotation, we found the following internet links useful:</para>
  
    <formalpara><title>Portals</title>
      <para>
        <itemizedlist spacing="compact" type="vert">
          <listitem><ulink url="http://www.seznam.cz">http://www.seznam.cz/</ulink> - for Czech products and companies</listitem>
          <listitem><ulink url="http://search.seznam.cz/search.cgi?mod=f&amp;hlp=y">http://search.seznam.cz/search.cgi?mod=f&amp;hlp=y</ulink> - for Czech companies</listitem>
          <listitem><ulink url="http://www.google.com">http://www.google.com/</ulink></listitem>
          <listitem><ulink url="http://www.altavista.com">http://www.altavista.com/</ulink> (shop section for various  searching products)</listitem>
        </itemizedlist>
      </para>
    </formalpara>
    
    <formalpara><title>Encyclopedias</title>
      <para>
        <itemizedlist spacing="compact" type="vert">
          <listitem><ulink url="http://cs.wikipedia.org/">http://cs.wikipedia.org/</ulink> and <ulink url="http://en.wikipedia.org/">http://en.wikipedia.org/</ulink></listitem>
          <listitem><ulink url="http://www.encyclopedia.com">http://www.encyclopedia.com/</ulink></listitem>
          <listitem><ulink url="http://www.encarta.msn.com">http://www.encarta.msn.com/</ulink></listitem>
        </itemizedlist>
      </para>
    </formalpara>
    
    <formalpara><title>Dictionaries</title>
      <para>
        <itemizedlist spacing="compact" type="vert">
          <listitem><ulink url="http://slovnik.seznam.cz">http://slovnik.seznam.cz/</ulink> - various dictionaries</listitem>
        </itemizedlist>
      </para>
    </formalpara>
    
    <formalpara><title>Maps</title>
      <para>
        <itemizedlist spacing="compact" type="vert">
          <listitem><ulink url="http://mapy.atlas.cz">http://mapy.atlas.cz/</ulink> - Czechia</listitem>
          <listitem><ulink url="http://www.mapquest.com/maps">http://www.mapquest.com/maps/</ulink> - U.S.A and the world</listitem>
        </itemizedlist>
      </para>
    </formalpara>
  
  </chapter>
  <!-- The end of INTRODUCTION chapter -->
  
  
  
  
  
  <chapter id="ch-lemm-tag"><title>Lemma and tag structure</title>
    
    <sect1 id="lemma"><title>Lemma structure</title>
    
      <para>Lemma in PDT 1.0 has two parts. First part, the lemma proper,
        has to be a unique identifier of the lexical item. Usually it
        is the base form (e.g. infinitive for a verb) of the word,
        possibly followed by a number distinguishing different lemmas
        with the same base forms. Second part (optional) is not part
        of the identifier and contains additional information about
        the lemma, e.g. semantic or derivational information.</para>
      <para>The formal description of the lemma structure follows. Spaces were inserted between nonterminals to improve readability. Note however that no lemma contains any spaces. Capitalized multi-character symbols are nonterminals. All other symbols are terminals.</para>
      <para>
<synopsis>
Lemma       ::= LemmaProper | LemmaProper AddInfo
LemmaProper ::= Word | Word - Number | Number | SpecialChar
Word        ::= Letter | Letter Word
Letter      ::= A | a | Á | á | Ä | ä | ... | Z | z | Ž | ž | '
Number      ::= NonZero | NonZero Number0
Number0     ::= Digit | Digit Number0
NonZero     ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Digit       ::= 0 | NonZero
SpecialChar ::= ! | " | # | $ | % | &amp; | ' | ( | ) | * | + | , |
                - | . | / | : | ; | &lt; | = | &gt; | ? | @ | [ | \ |
                ] | ^ | _ | ` | { | | | } | ~ | § | °
AddInfo     ::= Reference Category Term Style Comment
Reference   ::= &lt;empty&gt; | ` LemmaProper
Category    ::= &lt;empty&gt; | _: Category1 | _: Category1 Category
Term        ::= &lt;empty&gt; | _; Term1     | _; Term1 Term
Style       ::= &lt;empty&gt; | _, Style1    | _, Style1 Style
Comment     ::= &lt;empty&gt; | _^ Comment1
Category1   ::= N | J | A | Z | M | V | T | W | D | P | C | I | F | B | Q | X
Term1       ::= Y | S | E | G | K | R | m | 
                H | U | L | j | g | c | y | b | u | w | p | z | o
Style1      ::= t | n | a | s | h | e | l | v | x
Comment1    ::= ( Explanation ) | ( Derivation ) |
                ( Explanation )_( Derivation )
Explanation ::= CommentChar | CommentChar Explanation
Derivation  ::= * Number Word | * Word
CommentChar ::= Letter | Digit |
                ! | " | # | $ | % | &amp; | ' | * | + | , | - | . |
                / | : | ; | &lt; | = | &gt; | ? | @ | [ | \ | ] | ^ |
                _ | ` | { | | | } | ~ | § | °
</synopsis>
      </para>
      <para>Notes on characters:</para>
      <orderedlist>
        <listitem><para>Any character that is letter in the <ulink
      url="http://www.unicode.org/">Unicode standard</ulink> can
      appear in place of the Letter nonterminal. In the non-ASCII area
      this most frequently applies to the Czech accented characters:
      ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž. However, other characters occur
      in names (e.g. German ÄäÖöÜü, Serbo-Croatian Ćć) and in foreign
      words (e.g. Slovak ĽľĹĺÔôŔŕ).</para></listitem>
        <listitem><para>Standard HTML entities (such as
      <literal>&amp;amp;</literal> for &amp; or
      <literal>&amp;agrave;</literal> for  &agrave;) are also
      allowed. PDT 1.0 was encoded in the ISO Latin 2 codepage, so
      representing any West European characters required using
      entities. PDT 2.0 shall be encoded in UTF8, so few entities will
      be needed.</para></listitem>
        <listitem><para>The single quote (') is considered a Letter in
      some transcriptions of non-Latin alphabets (e.g. in Chinese
      <foreignphrase>Mao C'-tung</foreignphrase>, Hebrew
      <foreignphrase>Be'er Sheva'</foreignphrase>). If it marks
      deleted parts of words (e.g. English
      <foreignphrase>don't</foreignphrase>, French
      <foreignphrase>d'Artagnan</foreignphrase>), it is considered a
      SpecialChar and it splits the string into three tokens
      (<literal>d</literal> <literal>'</literal>
      <literal>Artagnan</literal>). Even in these languages there are
      exceptions (e.g. the surname
      <foreignphrase>Preud'homme</foreignphrase> is one
      token).</para></listitem>
      </orderedlist>
      <table><title>Lemma examples</title>
        <tgroup cols="3">
          <colspec/>
          <colspec/>
          <colspec/>  
          <thead>
            <row>
              <entry>Whole lemma</entry>
              <entry>LemmaProper</entry>
              <entry>AddInfo</entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><literal>Chemik</literal></entry>
              <entry>chemik</entry>
              <entry/>
            </row>
            <row>
              <entry><literal>maso_^(jídlo_apod.)</literal></entry>
              <entry>maso</entry>
              <entry>_^(jídlo_apod.)</entry>
            </row>
            <row>
              <entry><literal>Bonn_;G</literal></entry>
              <entry>Bonn</entry>
              <entry>_;G</entry>
            </row>
            <row>
              <entry><literal>vazba-1_^(obviněného)</literal></entry>
              <entry>vazba-1</entry>
              <entry>_^(obviněného)</entry>
            </row>
            <row>
              <entry><literal>vazba-2_^(spojení)</literal></entry>
              <entry>vazba-2</entry>
              <entry>_^(spojení)</entry>
            </row>
            <row>
              <entry><literal>Martinův-1_;Y_^(*4-1)</literal></entry>
              <entry>Martinův-1</entry>
              <entry>_;Y_^(*4-1)</entry>
            </row>
          </tbody>
        </tgroup>
      </table>
      
      <sect2 id="lemma-number"><title>Base form and number</title>
        <para>The Word in LemmaProper is the base form of the
        respective paradigm. This means nominative singular for nouns,
        the same plus masculine positive for adjectives, similarly for
        pronouns and numerals. Verbs are represented by their
        infinitive forms.</para>
        <para>The Number in LemmaProper helps to distinguish several
        senses of a homonymous base form. It should neither be zero
        nor start with zero. The used numbers need not form a
        continuous sequence. Sometimes a particular number is
        repeatedly used for a special kind of word (e.g. the lemmas
        numbered "-99" are almost invariantly authors' signatures and
        their Category/Style part is "_:B_;S"). Conventions of this
        kind exist solely for the convenience of a human reader but
        they are not meant to signal anything to a processing
        program. No conclusions should be ever drawn from the value of
        the lemma number! There is no warranty that an observed number
        "semantics" holds anywhere else. Other sources of information,
        such as the AddInfo text, should be used instead.</para>
        <para>The following rules shall hold for each group of lemmas
        sharing the same base form.</para>
        <para>
          <itemizedlist spacing="compact" type="vert">
            <listitem><emphasis role="bold">Rule 1:</emphasis> If lemmas
            use numbers to distinguish lexical items with the same
            base form, they all have to use them - i.e. if there is
            the lemma X-2, the unnumbered lemma X should not exist. If
            more than one lemma share a base form, all of them must be
            numbered.</listitem>
            <listitem><emphasis role="bold">Rule 2:</emphasis> If a
            lemma is numbered, its AddInfo should not be empty. The
            AddInfo must help to distinguish the lemma from other
            lemmas with the same base form but different
            numbers. Exception: if all but one lemmas with the same
            base form are foreign words, the domestic one need not
            have a non-empty AddInfo. All the foreign counterparts
            must have it, though.</listitem>
            <listitem><emphasis role="bold">Rule 3:</emphasis> Two
            lemmas with different AddInfo must differ in numbers as
            well. Exceptions (see below): abbreviations (two lemmas
            differ in the presence of <literal>_:B</literal> but not
            in their numbers).</listitem>
            <listitem><emphasis role="bold">Rule 4:</emphasis> Two
            lemmas with different number must differ in AddInfo as
            well.</listitem>
          </itemizedlist>
        </para>
        <para>Unfortunately many lemmas are not covered by our
        automatic morphological analyzer. Such lemmas were created by
        the annotators, and the administrator of the lexicon should
        later make their numbers and/or suffixes consistent and
        conformant to the above rules. In many cases it was not
        manageable to complete this task for PDT 2.0.</para>
        <para>Base form in lemma is case-sensitive. Of course, words
        that have to be always capitalized in writing, have their
        lemma capitalized as well. As a consequence,
        <foreignphrase>špaček</foreignphrase> (starling) and
        <foreignphrase>Špaček_;S</foreignphrase> need not be
        distinguished by numbers (or they can both use the same
        number). However, although not required, the unique numbering
        of such cases is recommended.</para>
        <para>Sometimes the numbering of lemmas reflect that their
        base form is homonymous with another word, although the other
        meaning is not base form. For instance,
        <foreignphrase>žena</foreignphrase> is a noun (meaning woman)
        but it can also be transgressive form of the verb
        <foreignphrase>hnát</foreignphrase>. The morphological
        analyzer may assign different numbers to both meanings of
        <foreignphrase>žena</foreignphrase>, although the latter is
        not a base form. As a consequence, there may be lemma žena-2
        even if there is no other lemma with the same base form. Such
        behavior is allowed but not required.</para>
      </sect2>
      <sect2 id="lemma-reference"><title>Reference</title>
        <para>Some lemmas refer to other lemmas. A lemma can point at
        most to one other lemma. The reference is one of the means of
        explaining the meaning of the source lemma. Such mechanism is
        systematically used with spelled-out numbers (jeden`1, oba`2)
        and with abbreviations for various units
        (kWh`kilowatthodina). Occasionally a reference can occur
        elsewhere as well.</para>
      </sect2>
      <sect2 id="lemma-category"><title>Category</title>
        <para>Lemma category is indicated by "_:" followed by a
        letter. Most categories correspond to parts of speech. They
        are rarely used because the part of speech is encoded in
        morphological tags as well (see below; note however that some
        parts of speech are encoded by different characters in the
        lemma than in the morphological tag). They should be used if
        the same lemma behaves as two or more parts of speech. No
        lemma is allowed to appear with morphological tags for two or
        more different parts of speech. For instance,
        <foreignphrase>vedle</foreignphrase> can be either adverb or
        preposition. There should be two lemmas,
        <literal>vedle-1_:D</literal>, and
        <literal>vedle-2_:P</literal>. Note however that in PDT 2.0
        some lemmas, especially foreign words, occasionally appear
        with tags for different parts of speech, and if there are
        separate lemmas for each part of speech, it is often described
        verbally in the Comment part rather than formally using the
        Category field. In our example it would be
        <literal>vedle-1_^(je_z_toho_vedle)</literal>, and
        <literal>vedle-2_^(vedle_něčeho)</literal>. This will be
        corrected in future versions.</para>
        <para>Three categories are used on a more systematical basis:
        _:T and _:W for verbal aspect, and _:B for
        abbreviations. Aspect has currently no representation in the
        morphological tags. It is treated as a lexical property -
        although there are some morphological implications, lots of
        irregularities could be expected if it was part of the verbal
        paradigm. The morphological analyzer covers aspect for some
        verbs while lacking the information for many others. If
        available, the aspect is indicated in the lemma. Note that
        there are biaspectual verbs, so
        <literal>analyzovat_:T_:W</literal> would be
        correct.</para>
        <para>Abbreviations are exceptions to the Rule 3 (saying that
        different AddInfo implies different lemma numbers). There can
        be two lemmas with the same base form and number, if the only
        difference in their AddInfos is that one contains "_:B" and
        the other does not. For more information on abbreviations see
        <xref linkend="abbr"/>.</para>
        <table id="TableLemmaCategories"><title>Lemma categories</title>
          <tgroup cols="2">
            <colspec/>
            <colspec/>
            <thead>
              <row>
                <entry>Category</entry>
                <entry>Explanation</entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry>N</entry>
                <entry>noun</entry>
              </row>
              <row>
                <entry>A, J</entry>
                <entry>adjective</entry>
              </row>
              <row>
                <entry>Z</entry>
                <entry>pronoun</entry>
              </row>
              <row>
                <entry>M</entry>
                <entry>numeral</entry>
              </row>
              <row>
                <entry>V</entry>
                <entry>verb</entry>
              </row>
              <row>
                <entry>T</entry>
                <entry>imperfect verb</entry>
              </row>
              <row>
                <entry>W</entry>
                <entry>perfect verb</entry>
              </row>
              <row>
                <entry>D</entry>
                <entry>adverb</entry>
              </row>
              <row>
                <entry>P</entry>
                <entry>preposition</entry>
              </row>
              <row>
                <entry>C</entry>
                <entry>conjunction</entry>
              </row>
              <row>
                <entry>I</entry>
                <entry>particle</entry>
              </row>
              <row>
                <entry>F</entry>
                <entry>interjection</entry>
              </row>
              <row>
                <entry>B</entry>
                <entry>abbreviation</entry>
              </row>
              <row>
                <entry>Q</entry>
                <entry>???</entry>
              </row>
              <row>
                <entry>X</entry>
                <entry>do not use</entry>
              </row>
            </tbody>
          </tgroup>
        </table>
      </sect2>
      
      <sect2 id="lemma-term"><title>Term</title>
        <para>Lemmas of terms have categories of their own. The term
        type is indicated by "_;" followed by a letter. More than one
        term type may apply to one lemma. Two groups of term types can
        be distinguished: the named entities and the
        scientific/professional terms. The former are mandatory,
        proper names must be categorized. The latter are optional, it
        is up to the lexicon administrator whether they decide that a
        term is so specialized that its branch shall be
        indicated.</para>
        <table id="TableTermTypes"><title>Term types</title>
          <tgroup cols="2">
            <colspec/>
            <colspec/>
            <thead>
              <row>
                <entry>Type</entry>
                <entry>Explanation, examples</entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry>Y</entry>
                <entry>given name (formerly used as default):
                <foreignphrase>Petr</foreignphrase>,
                <foreignphrase>John</foreignphrase></entry>
              </row>
              <row>
                <entry>S</entry>
                <entry>surname, family name:
                <foreignphrase>Dvořák</foreignphrase>,
                <foreignphrase>Zelený</foreignphrase>,
                <foreignphrase>Agassi</foreignphrase>,
                <foreignphrase>Bush</foreignphrase></entry>
              </row>
              <row>
                <entry>E</entry>
                <entry>member of a particular nation, inhabitant of a
                particular territory:
                <foreignphrase>Čech</foreignphrase>,
                <foreignphrase>Kolumbijec</foreignphrase>,
                <foreignphrase>Newyorčan</foreignphrase></entry>
              </row>
              <row>
                <entry>G</entry>
                <entry>geographical name:
                <foreignphrase>Praha</foreignphrase>,
                <foreignphrase>Tatry</foreignphrase> (the
                mountains)</entry>
              </row>
              <row>
                <entry>K</entry>
                <entry>company, organization, institution:
                <foreignphrase>Tatra</foreignphrase> (the
                company)</entry>
              </row>
              <row>
                <entry>R</entry>
                <entry>product: <foreignphrase>Tatra</foreignphrase>
                (the car)</entry>
              </row>
              <row>
                <entry>m</entry>
                <entry>other proper name: names of mines, stadiums,
                guerilla bases, etc.</entry>
              </row>
              <row>
                <entry>H</entry>
                <entry>chemistry</entry>
              </row>
              <row>
                <entry>U</entry>
                <entry>medicine</entry>
              </row>
              <row>
                <entry>L</entry>
                <entry>natural sciences</entry>
              </row>
              <row>
                <entry>j</entry>
                <entry>justice</entry>
              </row>
              <row>
                <entry>g</entry>
                <entry>technology in general</entry>
              </row>
              <row>
                <entry>c</entry>
                <entry>computers and electronics</entry>
              </row>
              <row>
                <entry>y</entry>
                <entry>hobby, leisure, travelling</entry>
              </row>
              <row>
                <entry>b</entry>
                <entry>economy, finances</entry>
              </row>
              <row>
                <entry>u</entry>
                <entry>culture, education, arts, other sciences</entry>
              </row>
              <row>
                <entry>w</entry>
                <entry>sports</entry>
              </row>
              <row>
                <entry>p</entry>
                <entry>politics, governement, military</entry>
              </row>
              <row>
                <entry>z</entry>
                <entry>ecology, environment</entry>
              </row>
              <row>
                <entry>o</entry>
                <entry>color indication</entry>
              </row>
            </tbody>
          </tgroup>
        </table>
      </sect2>
      
      <sect2 id="lemma-style"><title>Style</title>
        <para>Lemmas can be stylistically classified. The style flag
        is indicated by "_," followed by a letter. Standard lemmas
        have no stylistic flag but any lemma intended for special
        usage (bookish, colloquial language etc.) should be marked as
        such. It is necessary to distinguish between the style of the
        lemma and the style of the word form! For instance,
        <foreignphrase>acht</foreignphrase> is an archaic word meaning
        "anathema"; its less archaic counterpart would be
        <foreignphrase>klatba</foreignphrase>. Its lemma should bear
        the archaic flag: <literal>acht_,a</literal>. On the other
        hand, <foreignphrase>lvové</foreignphrase> is just an archaic
        form of a non-archaic lemma <foreignphrase>lev</foreignphrase>
        (lion). In this case the archaicity should only be marked in
        the morphological tag describing the form (the tag would end
        in 3; see below for tag descriptions).</para>
        <table id="TableStyleFlags"><title>Style flags</title>
          <tgroup cols="2">
            <colspec/>
            <colspec/>
            <thead>
              <row>
                <entry>Style</entry>
                <entry>Explanation</entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry>t</entry>
                <entry>foreign word - see <xref linkend="foreign"/></entry>
              </row>
              <row>
                <entry>n</entry>
                <entry>dialect</entry>
              </row>
              <row>
                <entry>a</entry>
                <entry>archaic</entry>
              </row>
              <row>
                <entry>s</entry>
                <entry>bookish</entry>
              </row>
              <row>
                <entry>h</entry>
                <entry>colloquial</entry>
              </row>
              <row>
                <entry>e</entry>
                <entry>expressive</entry>
              </row>
              <row>
                <entry>l</entry>
                <entry>slang, argot</entry>
              </row>
              <row>
                <entry>v</entry>
                <entry>vulgar</entry>
              </row>
              <row>
                <entry>x</entry>
                <entry>outdated spelling or misspelling</entry>
              </row>
            </tbody>
          </tgroup>
        </table>
      </sect2>

      <sect2 id="lemma-explanation"><title>Explanational comment</title>
        <para>Any string in parentheses can be used as explanation of
        the lemma meaning. The string cannot contain spaces or
        parentheses. The underscore character is used to replace
        space, square brackets are used instead of parentheses. The
        meaning is described in Czech. Example of usage, synonym
        etc. can also be used or both a verbal description and an
        example can be mixed. Hint for English speakers: the word
        "example" can be abbreviated as
        <foreignphrase>př.</foreignphrase> or
        <foreignphrase>např.</foreignphrase> in the
        descriptions.</para>
      </sect2>

      <sect2 id="deriv-info"><title>Comment on derivation</title>
        <para>The morphological analyzer handles only
          inflection, not derivations - it means lemmas are rather
          shallow. However, sometimes the lemma contains information
          about lemmas it is derived from. For example lemmas of possessive
          adjectives contain information about the noun they are
          derived from (otcův &larr; otec). The information is encoded
          in the following way - how many characters you have to
          remove from the end, and what string you have to add to get
          the deeper lemma. Only the proper lemmas are both input and
          output of this process (but including the lemma number, if present).</para>
          <example><title>Following examples illustrate this:</title>
            <itemizedlist spacing="compact" type="vert">
          <listitem><literal>kardinálův_^(*2)</literal> - remove two
          letters: kardinál</listitem>
              <listitem><literal>Karlův_;Y_^(*3el)</literal> - remove 3
          characters, add &quot;el&quot;: Karel</listitem>
              <listitem><literal>přijetí-2_^(např._návrh)_(*5mout-2)</literal> - remove 5 characters, add &quot;mout-2&quot;: přijmout-2</listitem>
              <listitem><literal>Martinův-1_;Y_^(*4-1)</literal> -
          remove 4 characters, add &quot;-1&quot;: Martin-1</listitem>
            </itemizedlist>
          </example>
          <example><title>Other examples:</title>
            <para>
              <itemizedlist spacing="compact">
                <listitem><literal>Sorosův_;S_^(*2)</literal></listitem>
                <listitem><literal>chlapcův_^(*3ec)</literal></listitem>
                <listitem><literal>Máchův_;S_^(*2a)</literal></listitem>
                <listitem><literal>Hlinkův-1_;S_^(*4a-1)</literal></listitem>
                <listitem><literal>podání_^(něco_[někomu]_[někam])_(*3at)</literal></listitem>
                <listitem><literal>prohlášení_^(*4sit)</literal></listitem>
                <listitem><literal>protiprávnost_^(*3ý)</literal></listitem>
              </itemizedlist>
            </para>
          </example>
    <para>Note: Derivational comments of the form
      <literal>barvicí_^(^IC**barvit)</literal> occur occasionally
      in the current data. Cf. with
      <literal>barvící_^(*3it)</literal>.</para>
      </sect2>
    </sect1>



    <sect1 id="tag"><title>Tag Structure</title>

      <para>Lemma and tag together should uniquely identify the word form. Two different word forms should always differ either in lemmas or in morphological tags.</para>

      <sect2 id="pos-tags"><title>Positional tags</title>
    
        <para>A positional tag is a string of 15 characters. Every
          positions encodes one morphological category using one
          character (mostly upper case letters or numbers).</para>
    
        <table id="PositionalTagAttributes"><title>Attributes in positional tags</title>
          <tgroup cols="3">
            <colspec colnum="3"/>  
            <thead>
              <row>
                <entry>Position</entry>
                <entry>Name</entry>
                <entry>Description</entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry>1</entry>
                <entry>POS</entry>
                <entry>Part of speech</entry>
              </row>
              <row>
                <entry>2</entry>
                <entry>SUBPOS</entry>
                <entry>Detailed part of speech </entry>
              </row>
              <row>
                <entry>3</entry>
                <entry>GENDER</entry>
                <entry>Gender</entry>
              </row>
              <row>
                <entry>4</entry>
                <entry>NUMBER</entry>
                <entry>Number</entry>
              </row>
              <row>
                <entry>5</entry>
                <entry>CASE</entry>
                <entry>Case</entry>
              </row>
              <row>
                <entry>6</entry>
                <entry>POSSGENDER</entry>
                <entry>Possessor's gender </entry>
              </row>
              <row>
                <entry>7</entry>
                <entry>POSSNUMBER</entry>
                <entry>Possessor's number</entry>
              </row>
              <row>
                <entry>8</entry>
                <entry>PERSON</entry>
                <entry>Person</entry>
              </row>
              <row>
                <entry>9</entry>
                <entry>TENSE</entry>
                <entry>Tense</entry>
              </row>
              <row>
                <entry>10</entry>
                <entry>GRADE</entry>
                <entry>Degree of comparison</entry>
              </row>
              <row>
                <entry>11</entry>
                <entry>NEGATION</entry>
                <entry>Negation</entry>
              </row>
              <row>
                <entry>12</entry>
                <entry>VOICE</entry>
                <entry>Voice</entry>
              </row>
              <row>
                <entry>13</entry>
                <entry>RESERVE1</entry>
                <entry>Reserve</entry>
              </row>
              <row>
                <entry>14</entry>
                <entry>RESERVE2</entry>
                <entry>Reserve</entry>
              </row>
              <row>
                <entry>15</entry>
                <entry>VAR</entry>
                <entry>Variant, style</entry>
              </row>
            </tbody>
          </tgroup>
        </table>
        
        <para>Some of the characters encode aggregation of more atomic
          values - for example: 'X' - means any value, Y means
          masculine animate (M) or inanimate (I). Dash ('-') means "not applicable"
          (e.g. tense for nouns).</para>
    
        <para>Not all combinations of tag values are possible. There is
          about 4K tags.</para>
        <informalexample><!--     <title>Examples:</title> -->
            <itemizedlist spacing="compact" type="vert"><listitem>hraniční:       <literal>AAIS4----1A----</literal>
            standard adjective, masc. inanimate, singular, accusative, positive</listitem>
              <listitem>potok:            <literal>NNIS4-----A----</literal>
            noun, masc. inanimate, singular, accusative, positive
              </listitem>
              <listitem>karikaturistou:       <literal>NNMS7-----A----</literal>
            noun, masc. animate, singular, instrumental, positive
              </listitem>
              <listitem>ODS:          <literal>NNFXX-----A---8</literal>
            noun, feminine, any number, any case, positive, abbreviation
              </listitem>
              <listitem>podle:            <literal>RR--2----------</literal>
            preposition (non vocalized), requiring genitive
              </listitem>
              <listitem>volen:            <literal>VsYS---XX-AP---</literal>
            verb, passive participle, masculine, singular, any person, any tense, positive, 
            passive 
              </listitem>
            </itemizedlist>
        </informalexample>
        <para>See also: 
          <ulink       url="http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf">http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf</ulink></para>
    
        <para>Or for quick reference: 
          <ulink       url="http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html">http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html</ulink></para>
    
        <sect3 id="POS"><title>1 - Part of speech</title>
        
        <para>In fact, part of speech is rather lexical-syntactic than morphological property.
        It is practical to keep it in the tags but it would be more accurate to keep it in the lemmas.
        Anyway, no lemma is allowed to occur with two different parts of speech in the accompanying tags.
        If a word behaves syntactically as various parts of speech, several lemmas have to be reserved for it.</para>
        
          <table id="POSTable"><title>POS</title>
            <tgroup cols="2">
              <colspec colnum="2"/>  
              <thead>
                <row>
                  <entry>Value</entry>
                  <entry>Description</entry>
                </row>
              </thead>
              <tbody>
                <row>
                  <entry>A</entry>
                  <entry>Adjective</entry> 
                </row>
                <row>
                  <entry>C</entry>
                  <entry>Numeral</entry> 
                </row>
                <row>
                  <entry>D</entry>
                  <entry>Adverb</entry> 
                </row>
                <row>
                  <entry>I</entry>
                  <entry>Interjection</entry> 
                </row>
                <row>
                  <entry>J</entry>
                  <entry>Conjunction</entry> 
                </row>
                <row>
                  <entry>N</entry>
                  <entry>Noun</entry> 
                </row>
                <row>
                  <entry>P</entry>
                  <entry>Pronoun</entry> 
                </row>
                <row>
                  <entry>V</entry>
                  <entry>Verb</entry> 
                </row>
                <row>
                  <entry>R</entry>
                  <entry>Preposition</entry> 
                </row>
                <row>
                  <entry>T</entry>
                  <entry>Particle</entry> 
                </row>
                <row>
                  <entry>X</entry>
                  <entry>Unknown, Not Determined, Unclassifiable </entry>
                </row>
                <row>
                  <entry>Z</entry>
                  <entry>Punctuation (also used for the Sentence Boundary token)</entry>
                </row>
              </tbody>  
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="SubPOS"><title>2 - Detailed part of speech</title>
    
          <para>Further subcategorizes POS. The POS value is uniquely
            specified by SubPOS value.</para>
    
          <table id="SUBPOS"><title>SUBPOS</title>
            <tgroup cols="3">
          <colspec colnum="1" colwidth="10%"/>  
              <colspec colnum="2" colwidth="65%"/> 
              <colspec colnum="3" colwidth="15%"/> 
              <thead>
        <row>
          <entry>Value</entry>
          <entry>Description</entry>
          <entry>POS</entry>
        </row>
              </thead>
              <tbody>
        <row>
          <entry>#</entry>
          <entry>Sentence boundary</entry>
          <entry>Z - punctuation</entry>
        </row>
        <row>
          <entry>%</entry>
          <entry>Author's signature,
          e.g. <literal>haš-99_:B_;S</literal></entry>
          <entry>N - noun</entry>
        </row>
        <row>
          <entry>*</entry>
          <entry>Word krát (lit.: times) </entry>
          <entry>C - numeral</entry>
        </row>
        <row>
          <entry>,</entry>
          <entry>Conjunction subordinate (incl. aby, kdyby in all forms) </entry>
          <entry>J - conjuction </entry>
        </row>
            <row><entry>}</entry>
              <entry>Numeral, written using Roman numerals (XIV) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>:</entry>
              <entry>Punctuation (except for the virtual sentence boundary word 
                ###, which uses the <xref linkend="SUBPOS"/> #) </entry>
              <entry>Z - punctuation</entry>
            </row>
            <row><entry>=</entry>
              <entry>Number written using digits </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>?</entry>
              <entry>Numeral kolik
                (lit. how many/how much) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>@</entry>
              <entry>Unrecognized word form </entry>
              <entry>X - unknown</entry>
            </row>
            <row><entry>^</entry>
              <entry>Conjunction (connecting main clauses, not subordinate) </entry>
              <entry>J - conjunction</entry>
            </row>
            <row><entry>4</entry>
              <entry>Relative/interrogative pronoun with adjectival declension of 
                both types (soft and hard) 
                (jaký, který, čí, ..., 
                lit. what, which, whose, ...) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>5</entry>
              <entry>The pronoun he in forms requested after any preposition (with 
                prefix n-: něj, něho, ..., lit. him in 
                various cases) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>6</entry>
              <entry>Reflexive pronoun se in long forms (sebe, sobě, sebou, 
                lit. myself / yourself / herself / himself in various cases; 
                se is personless) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>7</entry>
              <entry><para>Reflexive pronouns se (<xref linkend="CASE"/> = 
                4), 
                si (<xref linkend="CASE"/> = 3), plus the same 
                two forms with contracted -s: ses, sis (distinguished by 
                <xref linkend="PERSON"/> = 2; also number is singular only)
                <phrase role="suggestion">This should be done somehow more consistently, virtually any 
                word can have this contracted -s (cos,
                polívkus, ...)</phrase></para></entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>8</entry>
              <entry>Possessive reflexive pronoun svůj (lit. 
                my/your/her/his when 
                the possessor is the subject of the sentence) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>9</entry>
              <entry>Relative pronoun jenž, již, ...  after a 
                preposition (n-: něhož, niž, ..., lit. who)
              </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>A</entry>
              <entry>Adjective, general </entry>
              <entry>A - adjective</entry>
            </row>
            <row><entry>B</entry>
              <entry>Verb, present or future form </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>C</entry>
              <entry>Adjective, nominal (short, participial) form 
                rád, schopen, ... </entry>
              <entry>A - adjective</entry>
            </row>
            <row><entry>D</entry>
              <entry>Pronoun, demonstrative (ten, onen, ..., lit. 
                this, that, that ... over there, ... )</entry> 
              <entry>P - pronoun</entry>
            </row>
            <row><entry>E</entry>
              <entry>Relative pronoun což (corresponding to 
                English which in 
                subordinate clauses referring to a part of the preceding text) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>F</entry>
              <entry>Preposition, part of; never appears isolated, always in a phrase 
                (nehledě (na), vzhledem (k), ..., lit. 
                regardless, because of) </entry>
              <entry>R - preposition</entry>
            </row>
            <row><entry>G</entry>
              <entry>Adjective derived from present transgressive form of a verb </entry>
              <entry>A - adjective</entry>
            </row>
            <row><entry>H</entry>
              <entry>Personal pronoun, clitical (short) form (mě, mi, ti, mu, ...); 
                these forms are used in the second position in a clause (lit. me, 
                you, her, him), even though some of them (mě) might be 
                regularly used anywhere as well </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>I</entry>
              <entry>Interjections </entry>
              <entry>I - interjection</entry>
            </row>
            <row><entry>J</entry>
              <entry>Relative pronoun jenž, již, ... not after a preposition 
                (lit. who, whom) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>K</entry>
              <entry>Relative/interrogative pronoun kdo (lit. who), 
                incl. forms with 
                affixes -ž and -s 
                (affixes are distinguished by the category <xref linkend="VAR"/> 
                (for -ž) and <xref linkend="PERSON"/> 
                (for -s)) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>L</entry>
              <entry>Pronoun, indefinite všechnen, sám (lit. 
                all, alone) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>M</entry>
              <entry>Adjective derived from verbal past transgressive form </entry>
              <entry>A - adjective</entry>
            </row>
            <row><entry>N</entry>
              <entry>Noun (general) </entry>
              <entry>N - noun</entry>
            </row>
            <row><entry>O</entry>
              <entry>Pronoun svůj, nesvůj, tentam alone
                (lit. own self, not-in-mood, 
                gone) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>P</entry>
              <entry>Personal pronoun já, ty, on (lit. I, you, he
                ) (incl. forms with the 
                enclitic -s, e.g. tys, 
                lit. you're); gender position is used for third 
                person to distinguish on/ona/ono (lit. he/she/it), 
                and number 
                for all three persons </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>Q</entry>
              <entry>Pronoun relative/interrogative co, copak, cožpak 
                (lit. what, 
                isn't-it-true-that) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>R</entry>
              <entry>Preposition (general, without vocalization) </entry>
              <entry>R - preposition</entry>
            </row>
            <row><entry>S</entry>
              <entry>Pronoun possessive můj, tvůj, jeho 
                (lit. my, your, his); gender 
                position used for third person to distinguish jeho, její, jeho 
                (lit. 
                his, her, its), and number for all three pronouns </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>T</entry>
              <entry>Particle </entry>
              <entry>T - particle</entry>
            </row>
            <row><entry>U</entry>
              <entry>Adjective possessive (with the masculine ending -ův as well as 
                feminine -in) </entry>
              <entry>A - adjective</entry>
            </row>
            <row><entry>V</entry>
              <entry>Preposition (with vocalization -e or -u): 
                (ve, pode, ku, ..., lit. in, 
                under, to) </entry>
              <entry>R - preposition</entry>
            </row>
            <row><entry>W</entry>
              <entry>Pronoun negative (nic, nikdo, nijaký, žádný, ..., 
                lit. nothing, 
                nobody, not-worth-mentioning, no/none) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>X</entry>
              <entry>(temporary) Word form recognized, but tag is missing in 
                dictionary due to delays in (asynchronous) dictionary creation </entry>
              <entry/>
            </row>
            <row><entry>Y</entry>
              <entry>Pronoun relative/interrogative co as an enclitic (after a 
                preposition) (oč, nač, zač, lit. about what, on/onto 
                what, 
                after/for what) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>Z</entry>
              <entry>Pronoun indefinite (nějaký, některý, číkoli, cosi, ..., lit. some, 
                some, anybody's, something) </entry>
              <entry>P - pronoun</entry>
            </row>
            <row><entry>a</entry>
              <entry>Numeral, indefinite (mnoho, málo, tolik, několik, kdovíkolik, 
                ..., lit. much/many, little/few, that much/many, some (number 
                of), who-knows-how-much/many) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>b</entry>
              <entry>Adverb (without a possibility to form negation and degrees of 
                comparison, e.g. pozadu, naplocho, ..., 
                lit. behind, flatly); i.e. 
                both the <xref linkend="NEGATION"/> as well as the <xref linkend="GRADE"/> attributes in the same 
                tag are marked by - (Not applicable) </entry>
              <entry>D - adverb</entry>
            </row>
            <row><entry>c</entry>
              <entry>Conditional (of the verb být (lit. to be) 
                only) (by, bych, bys, 
                bychom, byste, lit. would) </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>d</entry>
              <entry>Numeral, generic with adjectival declension (dvojí, desaterý, 
                ..., lit. two-kinds/..., ten-...) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>e</entry>
              <entry>Verb, transgressive present (endings -e/-ě, -íc, -íce) </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>f</entry>
              <entry>Verb, infinitive </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>g</entry>
              <entry>Adverb (forming negation (<xref linkend="NEGATION"/> set to 
                A/N) and degrees 
                of comparison <xref linkend="GRADE"/> set to 1/2/3 (comparative/superlative), 
                e.g. velký, za\-jí\-ma\-vý, ..., 
                lit. big, interesting </entry>
              <entry/>
            </row>
            <row><entry>h</entry>
              <entry>Numeral, generic; only jedny and nejedny 
                (lit. one-kind/sort-of, 
                not-only-one-kind/sort-of) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>i</entry>
              <entry>Verb, imperative form </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>j</entry>
              <entry>Numeral, generic greater than or equal to 4 used as a syntactic 
                noun (čtvero, desatero, ..., lit. four-kinds/sorts-of, 
                ten-...) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>k</entry>
              <entry>Numeral, generic greater than or equal to 4 used as a syntactic 
                adjective, short form (čtvery, ..., lit. four-kinds/sorts-of)
              </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>l</entry>
              <entry>Numeral, cardinal jeden, dva, tři, čtyři, půl, ... 
                (lit. one, two, 
                three, four); also sto and tisíc 
                (lit. hundred, thousand) if noun 
                declension is not used </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>m</entry>
              <entry>Verb, past transgressive; also archaic present transgressive of 
                perfective verbs (ex.: udělav, lit. 
                (he-)having-done; arch. also 
                udělaje (<xref linkend="VAR"/> = 4), 
                lit. (he-)having-done) </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>n</entry>
              <entry>Numeral, cardinal greater than or equal to 5 </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>o</entry>
              <entry>Numeral, multiplicative indefinite (-krát, 
                lit. (times): 
                mnohokrát, tolikrát, ..., 
                lit. many times, that many times) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>p</entry>
              <entry>Verb, past participle, active (including forms with the enclitic -
                s, lit. 're (are))</entry> 
              <entry>V - verb</entry>
            </row>
            <row><entry>q</entry>
              <entry>Verb, past participle, active, with the enclitic -ť, 
                lit. (perhaps) -
                could-you-imagine-that? or 
                but-because- (both archaic) </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>r</entry>
              <entry>Numeral, ordinal (adjective declension without degrees of 
                comparison) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>s</entry>
              <entry>Verb, past participle, passive (including forms with the enclitic 
                -s, lit. 're (are)) </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>t</entry>
              <entry>Verb, present or future tense, with the enclitic -ť, 
                lit. (perhaps) 
                -could-you-imagine-that? or 
                but-because- (both archaic) </entry>
              <entry>V - verb</entry>
            </row>
            <row><entry>u</entry>
              <entry>Numeral, interrogative kolikrát, lit. 
                how many times? </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>v</entry>
              <entry>Numeral, multiplicative, definite (-krát, 
                lit. times: pětkrát, ..., 
                lit. five times)</entry> 
              <entry>C - numeral</entry>
            </row>
            <row><entry>w</entry>
              <entry>Numeral, indefinite, adjectival declension (nejeden, tolikátý, 
                ..., lit. not-only-one, so-many-times-repeated) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>y</entry>
              <entry>Numeral, fraction ending at -ina; used as a noun 
                (pětina, lit. 
                one-fifth) </entry>
              <entry>C - numeral</entry>
            </row>
            <row><entry>z</entry>
              <entry>Numeral, interrogative kolikátý, 
                lit. what (at-what-position-
                place-in-a-sequence) </entry>
              <entry>C - numeral</entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
          <table id="ObsoleteSUBPOS"><title>Obsolete SUBPOS values</title>
            <tgroup cols="2">
              <colspec colnum="2"/>  
              <thead>
                <row>
                  <entry>Value</entry>
                  <entry>Description</entry>
                </row>
              </thead>
              <tbody>
                <row>
                  <entry>!</entry>
                  <entry>Abbreviation used as an adverb</entry>
                </row>
                <row>
                  <entry>.</entry>
                  <entry>Abbreviation used as an adjective </entry>
                </row>  
                <row>
                  <entry>~</entry>
                  <entry>Abbreviation used as a verb </entry>
                </row>
                <row>
                  <entry>;</entry>
                  <entry>Abbreviation used as a noun </entry>
                </row>
                <row>
                  <entry>3</entry>
                  <entry>Abbreviation used as a numeral</entry>
                </row>
                <row>
                  <entry>x</entry>
                  <entry>Abbreviation, part of speech unknown/indeterminable </entry>
                </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="gendert"><title>3 - Gender</title>
        
        <para>In fact, gender is a truly morphological attribute only for adjectives, pronouns, numerals and verbs.
        For nouns, it is a lexical property.
        As a consequence, no noun lemma is allowed to occur with two different genders in the accompanying tags.
        If a word allows for more than genders, several lemmas have to be reserved for it.</para>
        
          <table id="GENDER"><title>GENDER</title>
            <tgroup cols="2"><colspec colwidth="7%"/>
              <colspec colwidth="83%"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>F</entry>
              <entry>Feminine </entry>
            </row>
            <row><entry>H</entry>
              <entry>{F, N} - Feminine or Neuter </entry>
            </row>
            <row><entry>I</entry>
              <entry>Masculine inanimate </entry>
            </row>
            <row><entry>M</entry>
              <entry>Masculine animate </entry>
            </row>
            <row><entry>N</entry>
              <entry>Neuter </entry>
            </row>
            <row><entry>Q</entry>
              <entry>Feminine (with singular only) or Neuter (with plural only); used only with 
                participles and nominal forms of adjectives </entry>
            </row>
            <row><entry>T</entry>
              <entry>Masculine inanimate or Feminine (plural only); used only with participles and 
                nominal forms of adjectives </entry>
            </row>
            <row><entry>X</entry>
              <entry>Any </entry>
            </row>
            <row><entry>Y</entry>
              <entry>{M, I} - Masculine (either animate or inanimate)</entry>
            </row>
            <row><entry>Z</entry>
              <entry>{M, I, N} - Not fenimine (i.e., Masculine animate/inanimate or Neuter); only for 
                (some) pronoun forms and certain numerals </entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="number"><title>4 - Number</title>
    
          <table id="NUMBER"><title>NUMBER</title>
            <tgroup cols="2"><colspec colnum="1" colwidth="7%"/>  
              <colspec colnum="2" colwidth="83%"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>D</entry>
              <entry>Dual , e.g. nohama</entry>
            </row>
            <row><entry>P</entry>
              <entry>Plural, e.g.  nohami </entry>
            </row>
            <row><entry>S</entry>
              <entry>Singular, e.g.  noha  </entry>
            </row>
            <row><entry>W</entry>
              <entry>Singular for feminine gender, plural with neuter; can only appear in participle or 
                nominal adjective form with gender value Q </entry>
            </row>
            <row><entry>X</entry>
              <entry>Any </entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="case"><title>5 - Case</title>
          <table id="CASE"><title>CASE</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>1</entry>
              <entry>Nominative, e.g.  žena</entry>
            </row>
            <row><entry>2</entry>
              <entry>Genitive, e.g.  ženy </entry>
            </row>
            <row><entry>3</entry>
              <entry>Dative, e.g.  ženě </entry>
            </row>
            <row><entry>4</entry>
              <entry>Accusative, e.g.  ženu</entry>
            </row>
            <row><entry>5</entry>
              <entry>Vocative, e.g.  ženo</entry>
            </row>
            <row><entry>6</entry>
              <entry>Locative, e.g.  ženě </entry>
            </row>
            <row><entry>7</entry>
              <entry>Instrumental, e.g.  ženou </entry>
            </row>
            <row><entry>X</entry>
              <entry>Any </entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="poss-gender"><title>6 - Possessor's Gender</title>
          <table id="POSSGENDER"><title>POSSGENDER</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>F</entry>
              <entry>Feminine, e.g.  matčin, její </entry>
            </row>
            <row><entry>M</entry>
              <entry>Masculine animate (adjectives only), e.g.  otců </entry>
            </row>
            <row><entry>X</entry>
              <entry>Any </entry>
            </row>
            <row><entry>Z</entry>
              <entry>{M, I, N} - Not feminine, e.g.  jeho</entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="poss-number"><title>7 - Possessor's Number </title>
          <table id="POSSNUMBER"><title>POSSNUMBER</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>P</entry>
              <entry>Plural, e.g.  náš </entry>
            </row>
            <row><entry>S</entry>
              <entry>Singular, e.g.  můj </entry>
            </row>
        <row>
          <entry>X</entry>
          <entry>Any, e.g. your</entry>
        </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="person"><title>8 - Person</title>
          <table id="PERSON"><title>PERSON</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>1</entry>
              <entry>1st person, e.g.  píšu, píšeme </entry>
            </row>
            <row><entry>2</entry>
              <entry>2nd person, e.g.  píšeš, píšete</entry>
            </row>
            <row><entry>3</entry>
              <entry>3rd person, e.g.  píše, píšou </entry>
            </row>
            <row><entry>X</entry>
              <entry>Any person </entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="tense"><title>9 - Tense</title>
          <table id="TENSE"><title>TENSE</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>F</entry>
              <entry>Future </entry>
            </row>
            <row><entry>H</entry>
              <entry>{R, P} - Past or Present </entry>
            </row>
            <row><entry>P</entry>
              <entry>Present </entry>
            </row>
            <row><entry>R</entry>
              <entry>Past </entry>
            </row>
            <row><entry>X</entry>
              <entry>Any</entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="comp10"><title>10 - Degree of Comparison</title>
          <table id="GRADE"><title>GRADE</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>1</entry>
              <entry>Positive, e.g. velký </entry>
            </row>
            <row><entry>2</entry>
              <entry>Comparative, e.g. větší </entry>
            </row>
            <row><entry>3</entry>
              <entry>Superlative, e.g. největší</entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="negation"><title>11 - Negation</title>
          <table id="NEGATION"><title>NEGATION</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>A</entry>
              <entry>Affirmative (not negated), e.g.  možný </entry>
            </row>
            <row><entry>N</entry>
              <entry>Negated, e.g.  nemožný </entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="voice"><title>12 - Voice</title>
          <table id="VOICE"><title>VOICE</title>
            <tgroup cols="2"><colspec colnum="2"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>A</entry>
              <entry>Active, e.g.  píšící </entry>
            </row>
            <row><entry>P</entry>
              <entry>Passive, e.g.  psaný </entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
        <sect3 id="variant"><title>15 - Variant</title>
          <table id="VAR"><title>VAR</title>
            <tgroup cols="2"><colspec colnum="1" colwidth="7%"/>  
              <colspec colnum="2" colwidth="83%"/>  
              <thead><row><entry>Value</entry>
              <entry>Description</entry>
            </row>
              </thead>
              <tbody><row><entry>-</entry>
              <entry>Basic variant, standard contemporary style; also used for standard forms 
                allowed for use in writing by the Czech Standard Orthography Rules despite 
                being marked there as colloquial</entry>
            </row>
            <row><entry>1</entry>
              <entry>Variant, second most used ( less frequent), still standard </entry>
            </row>
            <row><entry>2</entry>
              <entry>Variant, rarely used, bookish, or archaic </entry>
            </row>
            <row><entry>3</entry>
              <entry>Very archaic, also archaic + colloquial </entry>
            </row>
            <row><entry>4</entry>
              <entry>Very archaic or bookish, but standard at the time </entry>
            </row>
            <row><entry>5</entry>
              <entry>Colloquial, but (almost) tolerated even in public </entry>
            </row>
            <row><entry>6</entry>
              <entry>Colloquial (standard in spoken Czech) </entry>
            </row>
            <row><entry>7</entry>
              <entry>Colloquial (standard in spoken Czech), less frequent variant </entry>
            </row>
            <row><entry>8</entry>
              <entry>Abbreviations </entry>
            </row>
            <row><entry>9</entry>
              <entry>Special uses, e.g. personal pronouns after prepositions etc. </entry>
            </row>
              </tbody>
            </tgroup>
          </table>
    
        </sect3>
    
      </sect2>

      <sect2 id="comp-tag"><title>Compact tags</title>

    <para>For most (but not all cases) just omit the dashes from
      positional tags.</para>

    <para>For more information, see <ulink url="http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/compact_tags.pdf"/></para>

      </sect2>

      <sect2 id="infabbr"><title>Informal abbreviations </title>

    <para>In certain cases (including some places in this manual), the
      following tag abbreviations are used. Most of them are
      self-evident (dashes and rarely used fields dropped), as you
      can see in the following list:</para>

    <itemizedlist spacing="compact" type="vert"><listitem>Ngnc - noun; NFS1 = <literal>NNFS1-----A----</literal>
      </listitem>
      <listitem>Aagnc - adjective;  AAXXX = <literal>AAXXX----1A----</literal>
      </listitem>
      <listitem>Db - adverb;  Db = <literal>Db-------------</literal> 
      </listitem>
      <listitem>Dg - adverb;  Dg = <literal>Dg-------1A----</literal>
      </listitem>
      <listitem>Dgd - adverb;  Dga2 = <literal>Dg-------2A----</literal>
      </listitem>
      <listitem>J^ -  conjunction;  J^ = <literal>J^-------------</literal>
      </listitem>
      <listitem>J, -  conjunction;  J, = <literal>J,-------------</literal>
      </listitem>
      <listitem>Rc, RRc - preposition,  RR7 = <literal>RR--7----------</literal>
      </listitem>
      <listitem>RVc - vocalized preposition,  RV7 = <literal>RV--7----------</literal>
      </listitem>
      <listitem>TT - particle;  TT = <literal>TT-------------</literal>
      </listitem>
      <listitem>Ng-8, NNgXX-8 - noun abreviation;  NFXX-8 = <literal>NNFXX-----A---8</literal>
      </listitem>
      <listitem>AX-8, AAXXX-8 - adjective abreviation;  AAXXX-8 = <literal>AAXXX----1A---8</literal>
      </listitem>
      <listitem>Db-8 - adverb abreviation;  Db-8 = <literal>Db------------8</literal>
      </listitem>
      <listitem>Rc-8, RRc-8 - preposition abreviation;  RR7-8 = <literal>RR--7---------8</literal>
      </listitem>
    </itemizedlist>
      </sect2>

    </sect1>

  </chapter>
  <!-- the end of Lemma and Tag Structure Chapter
  -->





  <chapter id="names"><title>Names</title>
    
    <para>Unlike in version 1.0, it is now preferred to separate named
    entity tagging from morphology. Named entities (often
    multiple-word) should be marked and categorized as special
    <emphasis>phrases</emphasis> on a layer other than morphological;
    this is a separate project that has not been included
    in PDT 2.0. Lemmas of proper names will still bear information on
    the name category. Nevertheless, we respect the original idea that
    the term suffixes shall explain the meaning of the lemma, not the
    context it appears in. Thus for instance
    <foreignphrase>New</foreignphrase> should be lemmatized as
    <literal>new_,t</literal> in <foreignphrase>New
    York</foreignphrase>, not
    <literal>New_;G</literal>. <foreignphrase>York</foreignphrase>
    should be lemmatized <literal>York_;G</literal> even in
    <foreignphrase>New York Times</foreignphrase> where it was
    previously <literal>York_;K</literal>. For details see
    below.</para>
    
    <para>Unfortunately, it was not manageable to enforce the desired
    lemmatization in PDT 2.0. The annotation is still inconsistent in
    this respect. We plan to correct it in a future version.</para>
    
    <table><title>Name types</title>
      <tgroup cols="2">
    <colspec/>
    <colspec/>
    <thead>
      <row>
        <entry>Type</entry>
        <entry>Explanation, examples</entry>
      </row>
    </thead>
    <tbody>
      <row>
        <entry>Y</entry>
        <entry>given name (formerly used as default):
                <foreignphrase>Petr</foreignphrase>,
                <foreignphrase>John</foreignphrase></entry>
      </row>
      <row>
        <entry>S</entry>
        <entry>surname, family name:
                <foreignphrase>Dvořák</foreignphrase>,
                <foreignphrase>Zelený</foreignphrase>,
                <foreignphrase>Agassi</foreignphrase>,
                <foreignphrase>Bush</foreignphrase></entry>
      </row>
      <row>
        <entry>E</entry>
        <entry>member of a particular nation, inhabitant of a
                particular territory:
                <foreignphrase>Čech</foreignphrase>,
                <foreignphrase>Kolumbijec</foreignphrase>,
                <foreignphrase>Newyorčan</foreignphrase></entry>
      </row>
      <row>
        <entry>G</entry>
        <entry>geographical name:
                <foreignphrase>Praha</foreignphrase>,
                <foreignphrase>Tatry</foreignphrase> (the
                mountains)</entry>
      </row>
      <row>
        <entry>K</entry>
        <entry>company, organization, institution:
                <foreignphrase>Tatra</foreignphrase> (the
                company)</entry>
      </row>
      <row>
        <entry>R</entry>
        <entry>product: <foreignphrase>Tatra</foreignphrase>
                (the car)</entry>
      </row>
      <row>
        <entry>m</entry>
        <entry>other proper name: names of mines, stadiums,
                guerilla bases, etc.</entry>
      </row>
    </tbody>
      </tgroup>
    </table>

    <para>The lemma should start with upper case if the word is always in
      upper-case in names (<literal>Špaček_;S</literal> is always
      capitalized, <literal>špaček</literal> is not).</para>

    <sect1 id="personal-name"><title>Personal names</title>

      <para>Given names and surnames are distinguished by the term
      field in their lemmas (<literal>_;Y</literal>
      vs. <literal>_;S</literal>). Note that we do not use the terms
      <emphasis>first name</emphasis> and <emphasis>last
      name</emphasis> because in some cultures the surname (family
      name) comes first and, more importantly, sometimes the original
      order is respected in Czech texts. If a name can serve both as
      given and family name, the preferable solution is to reserve two
      lemmas (for instance, <foreignphrase>Pavel Pavel</foreignphrase>
      would be lemmatized as <literal>Pavel-1_;Y</literal>
      <literal>Pavel-2_;S</literal>. However, in some cases there is
      currently one lemma covering both usages (such as
      <literal>Pavel_;Y_;S</literal>).</para>

      <para>If a person has only one name, it usually is a given name:
      <literal>Aristoteles_;Y</literal> (Aristotle).</para>

      <para>Personal names homonymous with a normal Czech word should
      always have a lemma of their own. Thus
      <foreignphrase>Zeman</foreignphrase> (surname) is lemmatized as
      <literal>Zeman-1_;S</literal>, not <literal>zeman</literal>
      (squire).</para>

      <para>Personal names are always tagged as nouns, even if they
      have an adjectival form (true for many Slavic surnames):
      <literal>Palacký_;S</literal> /
      <literal>NNMS1-----A----</literal>.</para>

      <para>Czech female surnames are usually derived from (but not
      equal to!) a male surname. Their form strongly resembles a
      possessive adjective: <foreignphrase>paní
      Nováková</foreignphrase> (Mrs. Novák) differs from
      <foreignphrase>Novákova žena</foreignphrase> (Novák's wife) just
      in the length of the final
      <foreignphrase>a/á</foreignphrase>. However,
      <foreignphrase>Nováková</foreignphrase> will neither be analyzed
      as <literal>Novákův_;S_^(*2)</literal> /
      <literal>AUFS1M---------</literal> (a surname cannot be
      adjective), nor as <literal>Novák_;S</literal> /
      <literal>NN<emphasis
      role="bold">M</emphasis>S1-----A----</literal> (this lemma
      implies the
      masculine gender). The correct analysis would be
      <literal>Nováková_;S_^(*3)</literal> /
      <literal>NNFS1-----A----</literal> (but it lacks the
      derivational information in the current data).</para>

      <para>Foreign surnames of women are usually "femalized" in Czech
      texts (<foreignphrase>Condoleeza Riceová</foreignphrase>). In
      such cases they are treated as normal Czech female surnames. If
      they are left intact (<foreignphrase>Condoleeza
      Rice</foreignphrase>), their lemma must indicate their foreign
      origin and their tag must tell that their gender and case are
      unknown: <literal>Rice_;S_,t</literal> /
      <literal>NNXSX-----A----</literal>.</para>

      <para>Otherwise, foreign personal names are rarely marked as
      foreign words because in Czech texts, they are usually declined
      according to the Czech grammar: <foreignphrase>Bill Clinton, bez
      Billa Clintona, Billu Clintonovi, s&nbsp;Billem
      Clintonem...</foreignphrase> Thus
      <foreignphrase>Bill</foreignphrase> is lemmatized as
      <literal>Bill_;Y</literal>, not
      <literal>Bill_;Y_,t</literal>. (See also <xref
      linkend="foreign"/>.) Even if a name allows for a
      frozen (undeclined) form, there usually is a context in which it
      can be declined: <foreignphrase>kniha o Willie
      Nelsonovi</foreignphrase> vs. <foreignphrase>kniha o Williem
      Nelsonovi</foreignphrase>; <foreignphrase>zvolili Teng
      Siao-pchinga</foreignphrase> vs. <foreignphrase>zvolili pana
      Tenga</foreignphrase>. Some foreign names, such as
      <foreignphrase>Steffi</foreignphrase>, are never
      declined.</para>

      <sect2 id="von"><title>von, van, etc.</title>

        <para>Prepositions, conjunctions and (foreign) determiners
        form parts of personal names that indicate geographical roots
        of the family (<foreignphrase>Ludwig van Beethoven, Jiří
        z&nbsp;Poděbrad, Kryštof Harant z&nbsp;Polžic a Bezdružic,
        Miguel de Cervantes y Saavedra, Hans van den
        Broek...</foreignphrase>) Both Czech and foreign words of that
        kind are lemmatized as <emphasis>normal words</emphasis>, not
        as given or family names: <literal>z-1</literal>,
        <literal>von-2_,t</literal>, <literal>de_,t</literal>.</para>

    <para>It may not be always clear whether the part after the
    preposition shall be annotated as a surname or a geographical
    name. If the Czech preposition
    <foreignphrase>z</foreignphrase> is present, the following
    word is a geographical name (even if it is a foreign location
    as in <foreignphrase>Blanka z&nbsp;Valois</foreignphrase>. In
    case of <foreignphrase>von</foreignphrase>,
    <foreignphrase>van</foreignphrase> and
    <foreignphrase>de</foreignphrase>, the original geographical
    meaning is usually less obvious for a Czech reader and the
    following word is annotated as surname.</para>

    <example>
      <title>Personal names with <foreignphrase>von,
      van</foreignphrase> etc.</title>
      <itemizedlist spacing="compact" type="vert">
        <listitem><foreignphrase>Ludwig van
        Beethoven</foreignphrase> - <literal>Ludwig_;Y
        van-2_,t_^(v_hol._jménech) Beethoven_;S</literal></listitem>
        <listitem><foreignphrase>František Lobkovic</foreignphrase>
        - <literal>František_;Y Lobkovic_;S</literal></listitem>
        <listitem><foreignphrase>František
        z&nbsp;Lobkovic</foreignphrase> - <literal>František_;Y
        z-1 Lobkovice_;G</literal></listitem>
        <listitem><foreignphrase>Kryštof Harant z&nbsp;Polžic a
        Bezdružic</foreignphrase> - <literal>Kryštof_;Y  Harant_;S
        z-1 Polžice_;G  a-1 Bezdružice_;G</literal></listitem>
      </itemizedlist>
    </example>  

      </sect2>
      <sect2 id="chinese-names"><title>Chinese and Korean names</title>

    <formalpara><title>Usage</title>
      <para>The surname precedes the given name. In most cases, the
        whole name is used (not just the family name). The thing
        is complicated by the fact, that many Chinese living
        abroad often change the order of their name or use their
        given name as a surname, etc. The discussion below can
        help you to determine, which part of a name is the given
        name and which part is the surname. If you are in doubt
        annotate them all as given names (Y).</para>
    </formalpara>

    <formalpara><title>Surnames</title>
      <para>There are relatively few surnames in China (200 most
        common surnames account for &gt;96% of all surnames). Most
        of them consist of one syllable (Wang, Li, Chen, etc.)
        Only few surnames consist of two syllables (Ou-yang,
        Mo-qi, Si-ma, Pu-yang). Married women do not get their
        husband's surname.</para>
    </formalpara>

    <formalpara><title>Given names</title>
      <para>Mostly two syllables, often connected with a dash (however
            sometimes separated by a space).<footnote>Chinese names are
          usually transcribed using a Chinese-Czech transcription
      system (a mutation of Wade-Giles). Pinyin is rarely used. In
      pinyin, the given name would be concatenated to one token
      instead of three (two words and the dash).</footnote>
          Some given names can be widely used, some are unique. Often
      it is impossible (for a non-Chinese speaker) to say whether 
          it is a name of a male or a female. The second syllable is
          usually used in informal addressing. The first syllable
          can be shared by all siblings. In traditional China a
          person had several given names during his/her life.</para>
    </formalpara>

    <formalpara><title>Most common Chinese surnames (in Pinyin /
      Czech transcription):</title>
      <para><foreignphrase>Cai / Cchaj, Chen / Čchen, Deng / Teng,
      Gao / Kao, Guo 
      / Kuo, He / Che, Hu / Chu, Huang / Chuang, Li, Liang, Lin,
      Lü, Ma, She / Še, Sun, Tang / Tchang, Wang, Wu, Xie / Sie,
      Xu / Sü, Yang / Jang, Ye / Jie, Zhang / Čang, Zhao / Čao,
      Zheng / Čeng, Zhu / Ču</foreignphrase></para>
    </formalpara>

    <formalpara><title>Links</title>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem><ulink url="http://www.geocities.com/Tokyo/3919/atoz.html">http://www.geocities.com/Tokyo/3919/atoz.html</ulink> - Alphabetical Index of Chinese Surnames (incl. Pinyin, Anglicized and other versions)</listitem>
        </itemizedlist>
          </para>
    </formalpara>

        <formalpara><title>Korean names</title>
      <para>Most Korean names look and behave similarly to Chinese
      names. The most common Korean surnames (45% of the
      population) are <foreignphrase>Kim, Lee</foreignphrase>
      (often spelled as <foreignphrase>Rhee, Yi,
      Li</foreignphrase>), and
      <foreignphrase>Park</foreignphrase>.</para>
    </formalpara>

    <note>
      <para>Analogical annotation may be suitable for other
      Far-Eastern names as well (e.g. Vietnamese). It does not
      apply to Japanese. Japanese are similar in their preference
      to indicate surname in the first position and given name in
      the second but the order is usually swapped in Czech texts
      and if not, non-Japanese speakers have little clues to
      decide. Both names usually use one to two Chinese characters
      each but they may be pronounced (and transcribed) using much
      more syllables (packed in two words, one for the given name
      and the other for the surname). One clue is that given names
      of Japanese women often take the suffix
      <foreignphrase>-ko</foreignphrase>.</para>
    </note>

    <example><title>Chinese and Korean names</title>
      <itemizedlist spacing="compact" type="vert">
        <listitem><foreignphrase>Teng Siao-pching</foreignphrase> -
      <literal>Teng_;S Siao_;Y - pching_;Y</literal></listitem>
        <listitem><foreignphrase>Kim Ir-sen</foreignphrase> -
      <literal>Kim_;S Ir_;Y - sen-2_;Y</literal></listitem>
      </itemizedlist>
    </example>

      </sect2>
      <sect2 id="foreign312"><title>Foreignized Czech names</title>

    <para>Sometimes you can encounter names that are Czech in
      their origin, but are somehow altered to fit other languages
      (accents omitted, female and male surnames are the same -
      e.g. <foreignphrase>Judy Sedivy</foreignphrase>, from Czech
      <foreignphrase>Šedivý</foreignphrase>).</para>

    <para>Use the following guidelines to decide the lemma and tag
    for such a name:</para>

    <itemizedlist>
      <listitem>
        <para>A name that does not distinguish female and male
        variant should have just one lemma and a tag with the
        <literal>X</literal> (unknown) gender:
        <literal>Sedivy_;S_,t</literal> /
        <literal>NNXXX-----A----</literal></para>
      </listitem>
      <listitem>
        <para>A name that has the same spelling as in Czech,
        should use the Czech lemma: <literal>Jane_;Y
        Janda_;S</literal></para>
      </listitem>
      <listitem>
        <para>A name with altered spelling has its own lemma (with
            the <literal>_,t</literal> suffix): <literal>Judy_;Y
            Sedivy_;S_,t</literal></para>
      </listitem>
    </itemizedlist>

      </sect2>

    </sect1>

    <sect1 id="geograph"><title>Geographical names</title>
      <sect2 id="geo-cities"><title>Countries, cities, rivers, mountains</title>
        <formalpara><title>Main noun</title>
          <para>The main word (head) in a multi-word name of a city is
            always noun; the same holds for a one-word city name.
            If it is homonymous with an adjective, a new noun lemma is
            created for the name. Thus <foreignphrase>Hluboká</foreignphrase>
            is lemmatized as <literal>Hluboká_;G / NNFS1-----A----</literal>
            rather than <literal>hluboký / AAFS1----1A----</literal> (lit. deep). </para>
          <para>Nouns that are frequently used in names (such as
            <foreignphrase>Újezd, Ústí</foreignphrase> may have their
            own geographical lemmas even if they are homonymous with a
            normal word. For homonymous pairs where the non-geographical
            usage is much more common (such as <foreignphrase>voda</foreignphrase>
            (water), <foreignphrase>ves</foreignphrase> (village),
            <foreignphrase>město</foreignphrase> (city)) it is recommended
            to stick with the non-geographical lemma even in geographical
            usages.</para>
        </formalpara>
        <formalpara><title>Modifiers in multi-word names</title>
          <para>Attributive adjectives, prepositions, conjunctions etc. should be
            lemmatized as normal words. Other nouns may be lemmatized as
            geographical if they are nested geographical names
            (e.g. names of rivers or mountains in names of
            cities).</para>
        </formalpara>
        <formalpara><title>Part of speech of foreign words</title>
          <para>Original part of speech of the word in the source language
            is used unless there is a good reason not to do so. Besides not knowing
            the original part of speech, a very good reason is that the word
            behaves as a different part of speech in Czech texts. For instance,
            <foreignphrase>blanc</foreignphrase> is adjective in French
            <foreignphrase>Mont Blanc</foreignphrase> but it behaves as a noun in
            <foreignphrase>na Mont Blanku</foreignphrase>. <foreignphrase>Mont</foreignphrase>
            can be annotated as an undeclined noun.
            See <xref linkend="foreign"/> for more information on foreign words.</para>
        </formalpara>
        <table><title>Examples of geographical names</title>
          <tgroup cols="3">
            <thead>
              <row>
                <entry><para>Name</para></entry>
                <entry><para>Type</para></entry>
                <entry><para>Morphological annotation</para></entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry><para><foreignphrase>Česká republika</foreignphrase></para></entry>
                <entry><para>country</para></entry>
                <entry><para><literal>český</literal> / <literal>AAFS1----1A----</literal> // <literal>republika</literal> / <literal>NNFS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Ústí nad Labem</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Ústí_;G</literal> / <literal>NNNS1-----A----</literal> // <literal>nad-1</literal> / <literal>RR--7----------</literal> // <literal>Labe_;G</literal> / <literal>NNNS7-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Karlovy Vary</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Karlův_;Y_^(*3el)</literal> / <literal>AUIP1M---------</literal> // <literal>Vary_;G_^(Karlovy_Vary)</literal> / <literal>NNIP1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Dobrá Voda</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>dobrý</literal> / <literal>AAFS1----1A----</literal> // <literal>voda</literal> / <literal>NNFS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Odolena Voda</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Odolena_;G_^(Odolena_Voda)</literal> / <literal>AAXXX----1A----</literal> // <literal>voda</literal> / <literal>NNFS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Černá v&nbsp;Pošumaví</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Černá_;G</literal> / <literal>NNFS1-----A----</literal> // <literal>v-1</literal> / <literal>RR--6-----A----</literal> // <literal>Pošumaví_;G</literal> / <literal>NNNS6-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Ohrada u Hluboké</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>ohrada</literal> / <literal>NNFS1-----A----</literal> // <literal>u-1</literal> / <literal>RR--2----------</literal> // <literal>Hluboká_;G</literal> / <literal>NNFS2-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Hradec Králové</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Hradec_;G</literal> / <literal>NNIS1-----A----</literal> // <literal>králová_^(královna)</literal> / <literal>NNFS2-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Kostelec nad Černými Lesy</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Kostelec_;G</literal> / <literal>NNIS1-----A----</literal> // <literal>nad-1</literal> / <literal>RR--7----------</literal> // <literal>černý_;o</literal> / <literal>AAIP7----1A----</literal> // <literal>les</literal> / <literal>NNIP7-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>New York</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>new_,t_^(angl._nový)</literal> / <literal>AAXXX----1A----</literal> // <literal>York_;G</literal> / <literal>NNIS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>A Coru&ntilde;a</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>o-10_,t_^(port._člen)</literal> / <literal>AAFSX----1A----</literal> // <literal>Coru&ntilde;a_;G</literal> / <literal>NNFS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>S&atilde;o Paulo</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>s&atilde;o_,t_^(port._svatý)</literal> / <literal>AAMSX----1A----</literal> // <literal>Paulo_;Y</literal> / <literal>NNMS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Rio de Janeiro</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Rio_;G</literal> / <literal>NNNS1-----A----</literal> // <literal>de_,t</literal> / <literal>RR--X-----------</literal> // <literal>Janeiro_;G</literal> / <literal>NNNS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Le Havre</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>le_,t_^(fr._člen)</literal> / <literal>AAISX----1A----</literal> // <literal>Havre_;G</literal> / <literal>NNIS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Krems an der Donau</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Krems_;G</literal> / <literal>NNIS1-----A----</literal> // <literal>an_,t</literal> / <literal>RR--3----------</literal> // <literal>der_,t_^(něm._člen)</literal> / <literal>AAFS3----1A----</literal> // <literal>Donau_;G</literal> / <literal>NNFSX-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>San Juan de la Rambla</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>san_,t_^(šp._a_it._svatý)</literal> / <literal>AAMSX----1A----</literal> // <literal>Juan_;Y</literal> / <literal>NNMS1-----A----</literal> // <literal>de_,t</literal> / <literal>RR--X----------</literal> // <literal>el_,t_^(šp._člen)</literal> / <literal>AAFSX----1A----</literal> // <literal>Rambla_;G</literal> / <literal>NNFSX----1A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Kao-hsiung</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Kao_;G</literal> / <literal>AAXXX----1A----</literal> // <literal>-</literal> / <literal>Z:-------------</literal> // <literal>hsiung_;G_^(př._Kao-hsiung)</literal> / <literal>NNXXX-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Wu-lu-mu-čchi</foreignphrase></para></entry>
                <entry><para>city</para></entry>
                <entry><para><literal>Wu_;G</literal> / <literal>NNXXX-----A----</literal> // <literal>-</literal> / <literal>Z:-------------</literal> // <literal>lu_;G</literal> / <literal>NNXXX-----A----</literal> // <literal>-</literal> / <literal>Z:-------------</literal> // <literal>mu_;G</literal> / <literal>NNXXX-----A----</literal> // <literal>-</literal> / <literal>Z:-------------</literal> // <literal>čchi_;G</literal> / <literal>NNXXX-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Gerlachovský štít</foreignphrase></para></entry>
                <entry><para>mountain</para></entry>
                <entry><para><literal>gerlachovský</literal> / <literal>AAIS1----1A----</literal> // <literal>štít</literal> / <literal>NNIS1-----A----</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Divoká Orlice</foreignphrase></para></entry>
                <entry><para>river</para></entry>
                <entry><para><literal>divoký</literal> / <literal>AAFS1----1A----</literal> // <literal>Orlice_;G</literal> / <literal>NNFS1-----A----</literal></para></entry>
              </row>
            </tbody>
          </tgroup>
        </table>
      </sect2>
      
      <sect2 id="geo-streets"><title>Streets</title>
        <para>We suppose that a word such as <foreignphrase>ulice</foreignphrase> (street),
          <foreignphrase>náměstí</foreignphrase> (square) etc. is always present,
          even if elided on the surface. Therefore the tagging of the name of the street
          is not altered.</para>
        <example><title>Street names</title>
          <itemizedlist spacing="compact" type="vert">
            <listitem><foreignphrase>Dlouhá</foreignphrase> - <literal>dlouhý / AAFS1----1A----</literal></listitem>
            <listitem><foreignphrase>Dlouhá ulice</foreignphrase> -
                    <literal>dlouhý / AAFS1----1A---- // ulice / NNFS1-----A----</literal></listitem>
            <listitem><foreignphrase>Palackého</foreignphrase> -
                    <literal>Palacký_;S / NNMS2-----A----</literal></listitem>
          </itemizedlist>
        </example>
      </sect2>

    </sect1>

    <sect1 id="comp"><title>Companies and institutions</title>

      <para>Companies, foundations, shops, clubs, sport clubs, restaurants, etc.
        all can have lemmas flagged <literal>_;K</literal>. However, "normal words" (those the usage of which
        is not limited to the company name) should get their normal lemmas.
        Only if a word cannot be explained another way or if its meaning has nothing to do
        with the company (e.g. <literal>Škoda_;K</literal>), the flag should be used. The border between
        personal and company names is fuzzy: if it is clear that a surname is part of a
        company name (e.g. <foreignphrase>Uzenářství <literal>Novák_;S</literal> a syn</foreignphrase>)
        it should be lemmatized as a surname. On the other hand, <foreignphrase>Škoda</foreignphrase> should
        be lemmatized as a company no matter that it was also named after a person. This name
        is mostly known as a company name. Abbreviations and acronyms are frequent company names -
        see also <xref linkend="abbr"/>.</para>
      
      <table><title>Examples of company names</title>
        <tgroup cols="2">
          <thead>
            <row>
              <entry><para>Name</para></entry>
              <entry><para>Annotation</para></entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><para><foreignphrase>Škoda auto, a.s.</foreignphrase></para></entry>
              <entry>
                <para>
                  <literal>Škoda_;K / NNFS1-----A---- // auto / NNNS1-----A---- //
                    , / Z:------------- // akciový_:B / AAFXX----1A---8 //
                    . / Z:------------- // společnost_:B / NNFXX-----A---8 //
                    . / Z:-------------</literal>
                </para>
              </entry>
            </row>
          </tbody>
        </tgroup>
      </table>
      
      <sect2 id="comp-rest"><title>Restaurants</title>
        <table><title>Examples of restaurant names</title>
          <tgroup cols="2">
            <thead>
              <row>
                <entry><para>Name</para></entry>
                <entry><para>Annotation</para></entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry><para><foreignphrase>Bar Viola</foreignphrase></para></entry>
                <entry><para><literal>
                  bar      / NNIS1-----A---- //
                  Viola_;K / NNFS1-----A----
                </literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>U Medvídků</foreignphrase></para></entry>
                <entry><para><literal>
                  u-1      / RR--2---------- //
                  medvídek / NNMS2-----A----
                </literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>La cambusa</foreignphrase></para></entry>
                <entry><para><literal>
                  le_,t_^(fr._člen) / AAFSX----1A---- //
                  cambusa_;K_,t     / NNFS1-----A----
                </literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Restaurant HaPi</foreignphrase></para></entry>
                <entry><para><literal>
                  restaurant        / NNIS1-----A---- //
                  HaPi_;K           / NNXXX-----A----
                </literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Čínská restaurace S'-ČCHUAN</foreignphrase></para></entry>
                <entry><para><literal>
                  čínský            / AAFS1----1A---- //
                  restaurace        / NNFS1-----A---- //
                  S'_;G             / AAXXX----1A---- //
                  -                 / Z:------------- //
                  čchuan_;G         / NNIS1-----A----
                </literal>
                (Note: the restaurant has been named after the Sichuan province in China.)
                </para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Francouzská restaurace v&nbsp;Obecním domě</foreignphrase></para></entry>
                <entry><para><literal>
                  francouzský / AAFS1----1A---- //
                  restaurace  / NNFS1-----A---- //
                  v-1         / RR--6---------- //
                  obecní      / AAIS6----1A---- //
                  dům         / NNIS6-----A----
                </literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Hospůdka U vylitýho mrože</foreignphrase></para></entry>
                <entry><para><literal>
                  hospůdka / NNFS1-----A---- //
                  u-1      / RR--2---------- //
                  vylitý   / AAMS2----1A---6 //
                  mrož     / NNMS2-----A----
                </literal></para></entry>
              </row>
            </tbody>
          </tgroup>
        </table>
      </sect2>
      <sect2 id="sport"><title>Sport clubs</title>
        <para>Names of sporting clubs are often combined of the proper club name
          and a geographical name of the location the club comes from.
          The former should have <literal>_;K</literal>
          in lemma, the latter should have <literal>_;G</literal>.</para>
        <para>Of course, it may be difficult tell whether a word in a foreign
          club name is a location. If you do not know, annotate it as a company.
          To determine, whether something is a name of a town or a club,
          you can try to find that name on a map
          (eg. <ulink url="http://www.expedia.com/pub/agent.dll?qscr=mmfn">http://www.expedia.com/pub/agent.dll?qscr=mmfn</ulink>)
          or to find the club (e.g. <ulink url="http://www.soccerage.com">http://www.soccerage.com/</ulink>).</para>
        <table><title>Examples of sport club names</title>
          <tgroup cols="2">
            <thead>
              <row>
                <entry><para>Name</para></entry>
                <entry><para>Annotation</para></entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry><para><foreignphrase>SKP Union Cheb</foreignphrase></para></entry>
                <entry><para><literal>
                  SKP_:B_;K / NNNXX-----A---- //
                  Union_;K  / NNIS1-----A---- //
                  Cheb_;G   / NNIS1-----A----
                </literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Chelsea FC</foreignphrase></para></entry>
                <entry><para><literal>
                  Chelsea_;G / NNFS1-----A----
                </literal>
                (part of London, UK)
                <literal>FC-1_:B_;K_;w_,t_^(football_club)</literal>
                </para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Sparta Praha</foreignphrase></para></entry>
                <entry><para><literal>Sparta-2_;K Praha_;G</literal>
                  (Although there is a town of Sparta in Greece, it has nothing to do
                  with the football club located in Praha, Czechia.)</para>
                </entry>
              </row>
              <row>
                <entry><para><foreignphrase>Viktoria Žižkov</foreignphrase></para></entry>
                <entry><para><literal>Viktoria-2_;K_^(jméno_sportovního_klubu) Žižkov_;G</literal></para></entry>
              </row>
              <row>
                <entry><para><foreignphrase>Udinese</foreignphrase></para></entry>
                <entry><para><literal>Udinese_;K / NNNSX-----A----</literal>
                  It is an adjective derived from <foreignphrase>Udine</foreignphrase>
                  (a city in Italy), the official name of the club is <foreignphrase>Udinese Calcio</foreignphrase>
                  (Football of Udine). However, the name is perceived in Czech as a noun.</para>
                </entry>
              </row>
            </tbody>
          </tgroup>
        </table>
  
        <para>Names of sport clubs often contain abbreviations.
          Some are common and present in the analyzer's lexicon (e.g.
          FC, AC) some are quite unusual (e.g. EV, ERC, EC, ERC, EG, VS,
          AS). If they are not present in the lexicon,
          enter them suffixing the lemma by <literal>_:B_;K_;w</literal>
          and tag them by <literal>NNNXX-----A---8</literal>.</para>
      </sect2>
    </sect1>

    <sect1 id="horses"><title>Horses, DJ's etc.</title>

      <para>Horses have all kind of names (e.g. <foreignphrase>Vinná réva</foreignphrase>,
        <foreignphrase>Deprivace</foreignphrase>, <foreignphrase>He
        Shall Reign</foreignphrase>, <foreignphrase>La Paloma</foreignphrase>,
        <foreignphrase>Monitor</foreignphrase>, <foreignphrase>Frýdlant</foreignphrase>,
        <foreignphrase>Gold End</foreignphrase>, <foreignphrase>Lučina</foreignphrase>,
        <foreignphrase>Green Peace</foreignphrase>, <foreignphrase>Areál</foreignphrase>,
        <foreignphrase>First</foreignphrase>, <foreignphrase>Bounty</foreignphrase>).
        Quite often one does not know whether it is male or female
        (sometimes even female-like names belong to a male horse).
        One clue is, that in an Oak (a horse contest type), all horses are young mares - females.</para>

      <para>If any reasonable analysis is possible it should be used regardless
        the lemma is marked as name or not. It will be marked as a name within
        a separate project on named entity recognition.
        However, if the name is a word that has no other
        meaning or if it has different gender, a new lemma with the
        <literal>_;Y</literal> flag should be introduced.</para>

      <example><title>Names of horses</title>
        <itemizedlist spacing="compact" type="vert">
          <listitem><foreignphrase>Vinná réva</foreignphrase> - <literal>vinný
            / AAFS1----1A---- // réva / NNFS1-----A----</literal></listitem>
          <listitem><foreignphrase>Deprivace</foreignphrase> - <literal>Deprivace_;Y
            / NNFS1-----A----</literal></listitem>
          <listitem><foreignphrase>He Shall Reign</foreignphrase> - <literal>he_,t
            / PPYS1--3------- // shall_,t / VB-S---3P-AA--- // reign_,t / Vf--------A----</literal></listitem>
        </itemizedlist>
      </example>

      <para>Most of the horse names were not annotated correctly in
        PDT 1.0 - simply any available name was selected. (Otherwise,
        a new lemma with category Y <remark>would have to be</remark>
        inserted in each case: e.g. Deprivace would be Deprivace_;Y,
        annotated as deprivace, He Shall Reign annotated as a normal
        English phrase: he_,t, shall_,t reign_,t).</para>

      <para>Similar problem is with the names of musical groups and DJ's.
        For famous groups and DJ's enter separate lemmas, for others
        use normal available lemmas.</para>

    </sect1>

    <sect1 id="products"><title>Products</title>
      <para>Similarly to companies, only words that are uniquely product names
        (or they have a homonym but its meaning has nothing to do with the product)
        have their lemmas flagged <literal>_;R</literal>.</para>
      <para>If there is a company and a product of the same name, there should be
        two lemmas, e.g. <literal>Tatra-1_;K</literal> in <foreignphrase>Tatra, a.s.</foreignphrase>,
        and <literal>Tatra-2_;R</literal> in <foreignphrase>Tatra 613</foreignphrase>.</para>
    </sect1>

    <sect1 id="sport354"><title>Sporting and other events</title>
      <para>There is no special lemma term flag for events but the <literal>_;m</literal>
        for generic proper names can be used (<literal>_;m_;w</literal> for sporting events).
        Similarly to companies, only words that are uniquely event names
        (or they have a homonym but its meaning has nothing to do with the event)
        have their lemmas flagged <literal>_;m</literal>.</para>
      <para>If there is a company and an event of the same name, there should be
        two different lemmas.</para>
      <table><title>Examples of event names</title>
        <tgroup cols="2">
          <thead>
            <row>
              <entry><para>Name</para></entry>
              <entry><para>Annotation</para></entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><para><foreignphrase>Paris Indoor</foreignphrase></para></entry>
              <entry><para><literal>
                Paris_;G_,t  / NNIXX-----A---- //
                Indoor_;m_,t / NNIXX-----A----
              </literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>US Open</foreignphrase></para></entry>
              <entry><para><literal>
                US-2_:B_,t_^(americký) / AAXXX----1A---8 //
                Open-1_;m_;w_,t_^(otevřený_[turnaj],_v_názvu) / NNIXX-----A----
              </literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>akce Stop milión</foreignphrase></para></entry>
              <entry><para><literal>
                akce / NNFS1-----A---- //
                stopit_:W_^(úplně_spotřebovat_topením) / Vi-S---2--A---- //
                milión`1000000 / NNIS4-----A----
              </literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>Pohár mistrů</foreignphrase></para></entry>
              <entry><para><literal>
                pohár / NNIS1-----A---- //
                mistr / NNMP2-----A----
              </literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>Mistrovství světa</foreignphrase></para></entry>
              <entry><para><literal>
                mistrovství / NNNS1-----A---- //
                svět / NNIS2-----A----
              </literal></para></entry>
            </row>
          </tbody>
        </tgroup>
      </table>
    </sect1>

    <sect1 id="other"><title>Other</title>

      <sect2 id="buildings"><title>Buildings</title>

    <para>If a name of a building cannot be annalyzed other way,
      it should be a geographical name
      (<literal>Parthenón_;G</literal>). However, most building
      names are made of normal words (<literal>tančící_^(*3it)
      dům</literal>, <literal>pražský hrad</literal>,
      <literal>kostel svatý_:B . kříž</literal>) or other names
      (<literal>chrám svatý_:B . Barbora_;Y</literal>).</para>

      </sect2>



      <sect2 id="tv"><title>Televisions</title>
        <para>Generally televisions are annotated as institutions (<literal>_;K</literal>).
          Only when a company runs several channels, then the channels are annotated as
          products (<literal>_;R</literal>). It is currently used only 
          with the Czech(oslovak) public television (<foreignphrase>ČT1</foreignphrase>,
          <foreignphrase>ČT2</foreignphrase> and <foreignphrase>F1</foreignphrase>).</para>
        <example><title>TV company names</title>
          <itemizedlist spacing="compact" type="vert">
            <listitem><wordasword>ČT - ČT_:B_;K</wordasword>
            </listitem>
            <listitem><wordasword>ČT1 - ČT1_:B_;R</wordasword>
            </listitem>
            <listitem><wordasword>Nova - Nova_;K</wordasword>
            </listitem>
            <listitem><wordasword>NBC - NBC-4_:B_;K</wordasword>
            </listitem>
            <listitem><wordasword>CNN - CNN-1_:B_;K_;y_;b_,t</wordasword>
            </listitem>
          </itemizedlist>
        </example>
      </sect2>
      <sect2 id="news"><title>News and magazines</title>
        <para>All names of periodicals shall be annotated as products (<literal>_;R</literal>)
          even if their publishing company has the same name.</para>
        <example><title>Names of periodicals</title>
          <itemizedlist spacing="compact" type="vert">
            <listitem><wordasword>Sme</wordasword> - <literal>Sme_;R_^(noviny) / NNNSX-----A----</literal></listitem>
            <listitem><wordasword>Zeitung</wordasword> - <literal>Zeitung-1_;R_,t_^(souč._názvu_něm._novin) / NNISX-----A----</literal> (originally feminine gender in German but perceived as masculine inanimate in Czech)</listitem>
          </itemizedlist>
        </example>
      </sect2>
      <sect2 id="song"><title>Song names</title>
        <para>Songs, TV programs etc. are in fact products.
          Their names usually consist of more than one word and the component words
          mostly have meaning of their own (not unique to the song name).
          Thus the <literal>_;R</literal> flag will rarely be used.</para>
      </sect2>

    </sect1>



    <sect1 id="geo-adj"><title>Adjectives derived from names</title>

      <para>Possessive adjectives derived from personal names (or
        names of nation members, territory inhabitants) retain the
        name flags in their lemmas:
        <literal>Karlův_;Y_^(*3el)</literal>,
        <literal>Mariin_;Y_^(*2e)</literal>,
        <literal>Novákův_;S_^(*2)</literal>,
        <literal>Číňanův_;E_^(*2)</literal>.</para>
      <para>Adjectives derived from geographical names are
        <emphasis>not</emphasis> marked as geographical (no
        <literal>_;G</literal> flag in lemma). They do not even show
        the derivational information. These adjectives are not
        capitalized in Czech, while the original nouns are. So if we
        used the usual mechanism to describe derivation we would have
        to replace the whole lemma:
        <literal>africký_^(*7Afrika)</literal>, not
        <literal>africký_^(*3ka)</literal>.</para>
    </sect1>

  </chapter>

  <!-- the end of Names chapter
  -->





  <chapter id="abbr"><title>Abbreviations</title>

    <para>Abbreviations of a single word should use the lemma of the word,
      augmented with the <literal>_:B</literal> flag. This is the only acceptable
      situation in which two lemmas share LemmaProper, are not distinguished
      by numbers, but differ in their AddInfo. For instance, the three letters
      (separate tokens) in <foreignphrase>s.r.o.</foreignphrase> are lemmatized
      as <literal>společnost_:B</literal> (company), <literal>ručení_:B</literal>
      (liability), <literal>omezený_:B_^(*3it)</literal> (limited).</para>
    
    <para>Abbreviations consisting of a single capital letter represent
      names. Lots of names can be represented by a letter, and we often do not
      know the name. In such cases, the abbreviation uses itself as a lemma
      (augmented with the appropriate flags). For instance, in
      <foreignphrase>G. Bush</foreignphrase> it would be <literal>G_:B_;Y</literal>
      (despite the fact that in this particular case we know that most probably
      the <foreignphrase>G</foreignphrase> stands for <foreignphrase>George</foreignphrase>).</para>
    
    <para>Acronyms and abbreviations of multi-word expressions use themselves
      as lemmas (again, flagged <literal>_:B</literal>).
      If possible, the comment should explain the abbreviation.
      For instance, <foreignphrase>FIDE</foreignphrase> would be
      <literal>FIDE_:B_;K_;w_,t_^(Fédération_Internationale_des_Échecs)</literal>.</para>
    
    <para>Morphological tags of abbreviations should always end in <literal>8</literal>.</para>
      
      <table><title>Examples of abbreviations</title>
        <tgroup cols="3">
          <thead>
            <row>
              <entry><para>Abbreviation</para></entry>
              <entry><para>Full expression</para></entry>
              <entry><para>Annotation</para></entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><para><foreignphrase>např.</foreignphrase></para></entry>
              <entry><para><foreignphrase>například</foreignphrase></para></entry>
              <entry><para><literal>například_:B / Db------------8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>P.S.</foreignphrase></para></entry>
              <entry><para><foreignphrase>post scriptum</foreignphrase></para></entry>
              <entry><para><literal>post-2_:B_,t_^(lat._po,_např._P.S.) / RR--X---------8 //
                                    scriptum_:B_,t_^(lat.,_např._P.S.) / NNNXX-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>n.L.</foreignphrase></para></entry>
              <entry><para><foreignphrase>nad Labem</foreignphrase></para></entry>
              <entry><para><literal>nad-1_:B / RR--7---------8 //
                                    Labe_:B_;G / NNNS7-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>r. 1998</foreignphrase></para></entry>
              <entry><para><foreignphrase>rok/roku/roce 1998</foreignphrase></para></entry>
              <entry><para><literal>rok_:B / NNIXX-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>r.:</foreignphrase></para></entry>
              <entry><para><foreignphrase>režie:</foreignphrase></para></entry>
              <entry><para><literal>režie_:B / NNFXX-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>rež.:</foreignphrase></para></entry>
              <entry><para><foreignphrase>režie:</foreignphrase></para></entry>
              <entry><para><literal>režie_:B / NNFXX-----A---8</literal>
              Note: This and the previous example violate the rule that each lemma/tag pair
              leads to no more than one word form. Numbering the lemmas is not appropriate
              in this case but no suitable solution has been devised so far.</para></entry>
            </row>
          </tbody>
        </tgroup>
      </table>

    <sect1 id="gender"><title>Gender</title>

      <para>Most abbreviations are nouns and can be used with more than one gender.
        Of course, abbreviations have no endings but the surrounding context can
        reveal their underlying gender whenever gender agreement is required by the Czech grammar.
        Neuter is always possible. Besides that, the author may use the gender of
        the main word of the abbreviated expression. The matter can become further complicated
        with foreign expressions if their Czech gender does not correspond to the
        gender in the original language.</para>
      
      <para>In order to keep the rule of a noun lemma not having more than one gender,
        tags of abbreviations should use the <literal>X</literal> gender code. This is
        often broken in PDT 2.0 and abbreviations are the most frequent nouns to have
        two different genders.</para>
      
      <para>There is a similar problem with abbreviations of personal names (<literal>J_:B_;Y</literal>
        can mean both <foreignphrase>Jan</foreignphrase> and <foreignphrase>Jana</foreignphrase>).
        The difference is that here the neuter interpretation is not plausible. Nevertheless,
        the tagset does not provide any code for <literal>{M+F}</literal> genders,
        so the best bet is to stick with <literal>X</literal>.</para>
      
      <table><title>Gender of abbreviations</title>
        <tgroup cols="3">
          <thead>
            <row>
              <entry><para>Abbreviation</para></entry>
              <entry><para>Full expression</para></entry>
              <entry><para>Possible genders</para></entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><para><foreignphrase>UK</foreignphrase></para></entry>
              <entry><para><foreignphrase>Univerzita Karlova</foreignphrase></para></entry>
              <entry><para><literal>FN</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>FBI</foreignphrase></para></entry>
              <entry><para><foreignphrase>Federal Bureau of Investigation</foreignphrase></para></entry>
              <entry><para><literal>N</literal> (default), 
                <literal>F</literal> (probably &agrave; la <foreignphrase>CIA</foreignphrase>)</para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>CIA</foreignphrase></para></entry>
              <entry><para><foreignphrase>Central Intelligence Agency</foreignphrase></para></entry>
              <entry><para><literal>FN</literal></para></entry>
            </row>
          </tbody>
        </tgroup>
      </table>
    </sect1>



    <sect1 id="isol"><title>Isolated letters</title>

      <para>Most isolated letters (e.g. <foreignphrase>A-konto</foreignphrase>) are handled as abbreviations. 
        Only if they do not form part of a name they are lemmatized as
        <literal>_^(označení_pomocí_písmene)</literal>: <foreignphrase>zápas skupiny B</foreignphrase>.</para>
      
      <para>The following is a prototype of lemmas, their numbers and AddInfos for an isolated letter.
        There should be such lemmas for all letters of the Czech alphabet. Note that numbering
        a lemma by zero is not used anywhere else and might be deprecated in future.
        Anyway, no program should ever rely that the numbers will be as indicated.
        Lemma numbers serve to distinguish between homonymous lemmas but they are not meant
        to bear any semantic information.</para>
      
      <itemizedlist spacing="compact" type="vert">
        <listitem><literal>K-0_:B_;Y</literal> - given names</listitem>
        <listitem><literal>K-4_:B_;K</literal> - names of institutions</listitem>
        <listitem><literal>K-5_:B_;G</literal> - geographical names</listitem>
        <listitem><literal>K-6_:B_;R</literal> - names of products</listitem>
        <listitem><literal>K-7_:B_;m</literal> - other names (sporting events etc.)</listitem>
        <listitem><literal>K-9_:B_;S</literal> - surnames</listitem>
        <listitem><literal>k-8_:B_^(ost._zkratka)</literal> - other abbreviations (not names)
                - should not be used if the annotator knows the abbreviated word
                - then the <literal>word_:B</literal> lemma should be used instead</listitem>
        <listitem><literal>k-3_^(označení_pomocí_písmene)</literal> - 
                other isolated letters (not abbreviations, not in names)
        </listitem>
      </itemizedlist>
      
      <table><title>Examples of isolated letters</title>
        <tgroup cols="2">
          <thead>
            <row>
              <entry><para>Expression</para></entry>
              <entry><para>Annotation of the letter</para></entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><para><foreignphrase>A-mužstvo</foreignphrase></para></entry>
              <entry><para><literal>a-3_^(označení_pomocí_písmene) / NNXXX-----A----</literal>
              (Note: Adjective would be more appropriate in this particular case but
              noun is plausible as well and no lemma is allowed occur with more than
              one part of speech.)</para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>§ 27 odst. 1 písm. d</foreignphrase></para></entry>
              <entry><para><literal>d-3_^(označení_pomocí_písmene) / NNXXX-----A----</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>16 A</foreignphrase></para></entry>
              <entry><para><literal>A-1`ampér_:B / NNIXX-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>A-konto</foreignphrase></para></entry>
              <entry><para><literal>A-6_:B_;R / NNXXX-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>ABC, a.s.</foreignphrase></para></entry>
              <entry><para><literal>akciový_:B / AAXXX----1A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>na s. 128</foreignphrase></para></entry>
              <entry><para><literal>strana-4_:B_^(v_knize,_rukopise...) / NNFXX-----A---8</literal></para></entry>
            </row>
          </tbody>
        </tgroup>
      </table>
    </sect1>



    <sect1 id="measur"><title>Units of measurements</title>

      <para>Unlike most abbreviations, standard unit abbreviations are not followed by
        a period in Czech texts. In PDT 2.0, they often use a lemma equal to the abbreviated
        form, referring to the unabbreviated lemma via <literal>`</literal>: <literal>V-1`volt_:B</literal>.
        Unfortunately, this approach is not taken consistently, so for instance <foreignphrase>Celsius</foreignphrase>
        uses directly the target lemma instead of a reference to it: <literal>Celsius_:B</literal>.</para>
      
      <para>Units called after male persons (<foreignphrase>V - volt, A - ampér</foreignphrase>, etc.),
        have the masculine <emphasis>inanimate</emphasis> gender.
        However, units using degrees (<foreignphrase>C, F</foreignphrase>) have masculine
        <emphasis>animate</emphasis> gender, because the word
        <foreignphrase>stupeň</foreignphrase> (degree) is always present (even if omitted in the written text).
        Absolute temperature uses the unit called <foreignphrase>Kelvin (K)</foreignphrase>,
        not <foreignphrase>degree of Kelvin</foreignphrase>. Therefore 
        the unit has the masculine inanimate gender.
        The author may use it errorneously as degrees but we cannot correct them
        because the gender of a noun is implied by its lemma, not its context.</para>
      
      <table><title>Examples of units</title>
        <tgroup cols="2">
          <thead>
            <row>
              <entry><para>Expression</para></entry>
              <entry><para>Annotation of the unit abbreviation</para></entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><para><foreignphrase>Ráno byly 3&deg;C.</foreignphrase></para></entry>
              <entry><para><literal>Celsius_:B / NNMXX-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>Ráno byly 3 C.</foreignphrase>
                (read as <foreignphrase>Ráno byly tři stupně Celsia.</foreignphrase>)</para></entry>
              <entry><para><literal>Celsius_:B / NNMXX-----A---8</literal></para></entry>
            </row>
            <row>
              <entry><para><foreignphrase>teplota 5000 K</foreignphrase>
                (read as <foreignphrase>teplota pět tisíc kelvinů</foreignphrase>)</para></entry>
              <entry><para><literal>K-1`kelvin_:B / NNIXX-----A---8</literal></para></entry>
            </row>
          </tbody>
        </tgroup>
      </table>
      
      <para>If the C character is preceded by some character trying to look like the degree symbol &deg;
        (eg. -C, o C, O C), it should be marked as an error. The form attribute should be "&deg;",
        while the origf attribute retains the original character.<footnote><para>On Czech keyboards usually
          Shift+&lt;key-on-the-left-from-1&gt;, followed by Space.
          On any keyboard under MS Windows: Alt+0176.</para></footnote>
        The lemma shall be <literal>stupeň_:B</literal>, the tag <literal>NNIXX-----A---8</literal>.</para>
    </sect1>



    <sect1 id="author"><title>Authors' signatures</title>

      <para>The authors' name abbreviations used in newspapers
        (e.g. <foreignphrase>ber, mas, jst...</foreignphrase>
        in "sentences" like <foreignphrase>PRAHA (ČTK, ber) -</foreignphrase>)
        have the base form in the lemma equal to the word form,
        they are numbered -99 and AddInfo-ed <literal>_:B_;S</literal>. Their tag has a special
        SUBPOS character, <literal>%</literal>. For instance, <foreignphrase>ber</foreignphrase>
        is annotated as <literal>ber-99_:B_;S / N%XXX-----A---8</literal>.
        Again, no program should rely on the number being always 99.</para>
    </sect1>



    <sect1 id="academic"><title>Academic titles </title>

      <para>The morpohological analyzer currently distinguishes genders in titles,
        generating one lemma for men and another for women
        (<literal>JUDr-1_:B_^(doktor_práv) / NNMXX-----A---8</literal> vs.
         <literal>JUDr-2_:B_^(doktorka_práv) / NNFXX-----A---8</literal>).
        It is possible that the lemmas will be merged in future,
        using an indefinite gender:
        <literal>JUDr_:B_^(doktor_práv) / NNXXX-----A---8</literal>.</para>

    </sect1>

  </chapter>
  <!-- the end of Abbreviations chapter -->





  <chapter id="colloquial"><title>Colloquial Czech</title>

    <para>The annotation should distinguish between colloquial lemmas
      (e.g. <foreignphrase>Rusák</foreignphrase> (Russian) instead of
      the standard <foreignphrase>Rus</foreignphrase>)
      and colloquial forms of standard lemmas (e.g. <foreignphrase>zelenej</foreignphrase>
      (green) instead of the standard <foreignphrase>zelený</foreignphrase>).
      The former should be marked in the AddInfo of the lemma
      (<literal>Rusák_;E_,h</literal>), the latter should be indicated
      by the VAR field of the morphological tag. The values of <literal>6</literal>,
      <literal>5</literal>, <literal>7</literal>, and sometimes also <literal>3</literal>
      may be applicable; in most common cases, <literal>6</literal> is used
      (<literal>zelený / AAIS1----1A---6</literal>). See also <xref linkend="variant"/>.</para>



    <sect1 id="cos"><title><foreignphrase>Cos, kdys, jaks...</foreignphrase></title>
    
      <para>A set of Czech words can take the suffix <foreignphrase>-s</foreignphrase>
        representing deleted auxiliary verb <foreignphrase>jsi</foreignphrase> (2<superscript>nd</superscript>
        person). For instance, <foreignphrase><quote>To je dobře, že jsi přišel.</quote></foreignphrase>
        (<quote>It is good that you came.</quote>)
        can be shortened to <foreignphrase><quote>To je dobře, žes přišel.</quote></foreignphrase></para>
      
      <para>These words are only slightly colloquial if at all. Moreover, the reflexive
        pronouns <foreignphrase>ses, sis</foreignphrase> were constructed the same way
        but are perfectly standard while the alternative <foreignphrase>jsi se, jsi si</foreignphrase>
        is poor style. <foreignphrase>ses</foreignphrase> is distinguished from
        <foreignphrase>se</foreignphrase> by the 2<superscript>nd</superscript> person
        and by the singular number
        in tag (<literal>P7-S4--2-------</literal> vs. <literal>P7-X4----------</literal>).
        Similarly, <foreignphrase>kdos</foreignphrase> is tagged <literal>PKM-1--2-------</literal>
        while <foreignphrase>kdo</foreignphrase> (who) is tagged <literal>PKM-1----------</literal>.
        <foreignphrase>žes</foreignphrase> is tagged <literal>J,-S---2-------</literal>
        while <foreignphrase>že</foreignphrase> (that) is tagged <literal>J,-------------</literal>.
        It is questionable whether it is a good solution to let tags of various
        classes sometimes indicate the person and sometimes not.
        Nevertheless, the current morphological analyzer behaves so, and the approach
        should be extended to words not covered by the analyzer (e.g. <foreignphrase>cos</foreignphrase>,
        <foreignphrase>kdys</foreignphrase>).</para>
    
    </sect1>
    
    
    
    <sect1 id="eee"><title>Suffix <foreignphrase>-é</foreignphrase> in plural of neuter</title>
    
      <para>It is officially ungrammatical to say <foreignphrase>*malé koťata</foreignphrase>
        instead of <foreignphrase>malá koťata</foreignphrase>. However, the number of people
        doing the error is constantly growing.</para>
      
      <para>The phenomenon should not be treated as misspelling.
        It should be annotated as a colloquial variant of the
        official <foreignphrase>-á</foreignphrase> form
        (<literal>VAR = 5</literal>).</para>
    
    </sect1>
    
    <table><title>Colloquial examples</title>
      <tgroup cols="2">
        <thead>
          <row>
            <entry><para>Expression</para></entry>
            <entry><para>Annotation</para></entry>
          </row>
        </thead>
        <tbody>
          <row>
            <entry><para><foreignphrase>koťata, které</foreignphrase></para></entry>
            <entry><para><literal>který / P4NP4---------5</literal></para></entry>
          </row>
          <row>
            <entry><para><foreignphrase>Novákovic pes</foreignphrase></para></entry>
            <entry><para><literal>Novákův_;S_^(*2) / AUXXXM--------6</literal>
            It is sometimes obsoletely tagged <literal>AUMS1M--------6</literal>
            in PDT 2.0. If the tag system allowed such tags,
            <literal>AUXXXXP-------6</literal> might be even more appropriate.</para></entry>
          </row>
          <row>
            <entry><para><foreignphrase>takovejhlema</foreignphrase></para></entry>
            <entry><para><literal>takovýhle / PDFD7---------6</literal>
            (Correct - but rarely used - is <foreignphrase>takovýmahle</foreignphrase>.)</para></entry>
          </row>
          <row>
            <entry><para><foreignphrase>hovadinama</foreignphrase></para></entry>
            <entry><para><literal>hovadina_,h / NNFP7-----A---6</literal>
            (Both lemma and suffix are colloquial.
            The current morphological analyzer does not mark the lemma
            but it should do so.)</para></entry>
          </row>
          <row>
            <entry><para><foreignphrase>pro naší atletiku</foreignphrase></para></entry>
            <entry><para><literal>můj_^(přivlast.) / PSFS4-P1------6</literal>
            (Short <foreignphrase>-i</foreignphrase>, <foreignphrase>naši</foreignphrase>
            is the correct ending in accusative.)</para></entry>
          </row>
        </tbody>
      </tgroup>
    </table>
  </chapter>
  <!-- the end of Colloquial Czech chapter -->





  <chapter id="foreign"><title>Foreign words and phrases</title>
  
    <para>Foreign words enter Czech texts in three different ways:</para>
    <formalpara><title>Citation use</title>
      <para>Whole phrases in foreign languages can be inserted into Czech
        texts as citations. Besides real citations of something someone
        said or wrote, also names of songs and other works belong to this
        category. If a foreign verb is present, it is most probably
        a citation use. Single words can be cited as well but the rule
        is that a word in a cited phrase never takes Czech suffixes.</para>
    </formalpara>
    <formalpara><title>Word use</title>
      <para>Single words or short phrases (usually noun phrases),
        supplying a term. This ought to be a rather tiny category.
        If a foreign word does not take Czech suffixes, it might
        be a citation. And if it does, the possible domestication
        of the word should be considered carefully.</para>
    </formalpara>
    <formalpara><title>Domesticated words of foreign origin</title>
      <para>Foreign words constantly enter Czech language, take Czech
        endings, settle with Czech declension paradigms and become
        normal Czech words. Words that entered Czech long ago are
        not felt as foreign any more (e.g. <foreignphrase>kakao</foreignphrase>
        (cocoa)). Nevertheless, even newer words should not be treated
        as foreign if they fit into this category. For instance,
        the current morphological analyzer marks <foreignphrase>management</foreignphrase>
        (Czech <foreignphrase>vedení</foreignphrase>, sometimes also
        Czechized spelling <foreignphrase>manažment</foreignphrase>)
        as a foreign word (<literal>management_,t_^(vedení,_manažment;_angl.)</literal>).
        According to the word's usage, the <literal>_,t</literal>
        flag should be omitted.</para>
    </formalpara>
    
    <para>Despite the uncertainty whether some words shall be marked <literal>_,t</literal>,
      the following rule affects also domesticated expressions of foreign origin,
      some names that do not have a Czech equivalent etc. (e.g. <foreignphrase>Mont
      Blanc</foreignphrase>).</para>
    
    <para>General rule
      <orderedlist>
        <listitem><para>In citations, the original morphology of the
          source language shall be described to the extent possible with respect
          to our tags, and to the annotator's knowledge about the foreign word.</para>
        </listitem>
        <listitem><para>In word usages and domesticated expressions,
          Czech morphology takes precedence. For instance, abovementioned
          <foreignphrase>Mont Blanc</foreignphrase> is noun + adjective
          according to French morphology but <foreignphrase>Blanc</foreignphrase>
          has to be tagged as noun because the Czech locative of the phrase
          reads <foreignphrase>na Mont Blanku</foreignphrase> (i.e.,
          <foreignphrase>Blanc</foreignphrase> is declined according to
          a noun paradigm). Unless there is such a conflict between the
          original and the Czech morphology, the original part of speech
          shall be preserved.</para>
        </listitem>
      </orderedlist>
    </para>

    <table><title>Examples of foreign phrases</title>
      <tgroup cols="3">
        <thead>
      <row>
        <entry><para>Expression</para></entry>
            <entry><para>Annotation</para></entry>
            <entry><para>Comments</para></entry>
      </row>
    </thead>
        <tbody>
      <row>
        <entry><para><foreignphrase>V&nbsp;kostele zpívala Musica
        Bohemica.</foreignphrase></para></entry>
            <entry><para><literal>musica_,t_^(lat._hudba) /
        NNFS1-----A---- // bohemica_,t_^(lat._česká) /
        NNFS1-----A----</literal></para></entry>
        <entry><para><foreignphrase>Bohemica</foreignphrase> is
        adjective in Latin but noun in Czech. It is declined
        according to the Czech noun pattern
        <foreignphrase>žena</foreignphrase>. For the same reason,
        the base form is not converted to masculine
        gender.</para></entry>
      </row>
          <row>
        <entry><para><foreignphrase>To je trochu ad
        hoc.</foreignphrase></para></entry>
            <entry><para><literal>ad_,t / RR--X---------- // hoc_,t /
        NNXXX-----A----</literal></para></entry>
        <entry><foreignphrase>hoc</foreignphrase> is adverb in
        Latin but it is annotated as a noun in Czech.</entry>
          </row>
    </tbody>
      </tgroup>
    </table>
    
    
    
    <sect1 id="articles"><title>Articles</title>
    
      <para>Unlike in many other languages, there are no articles in Czech.
        Articles in foreign phrases are annotated as adjectives.</para>
      
      <para>In some languages, articles distinguish gender, number and case.
        Analogically to Czech, their lemma should reflect the masculine
        singular nominative form, the morphological tag should encode the
        real word form in the text. However, sometimes this approach is not
        possible due to a different gender or number in Czech:
        <foreignphrase>La Manche</foreignphrase> is feminine in French,
        masculine inanimate in Czech; <foreignphrase>Los Angeles</foreignphrase>
        is plural in Spanish, singular in Czech (and in English).
        There has to be a special lemma for each such frozen article.
        Thus, <foreignphrase>los</foreignphrase> would be annotated
        <literal>el-3_,t_^(šp._člen) / AAMSX----1A----</literal> in
        <foreignphrase><quote>do Prahy přijeli Los Paraguayos</quote></foreignphrase>
        but <literal>los-3_,t_^(šp._člen) / AAXXX----1A----</literal> in
        <foreignphrase><quote>pracuje v&nbsp;Los Angeles</quote></foreignphrase>.</para>
      <note role="suggestion"><para>The separate lemma reflects the fact that
        the word form is frozen since it was ported to other languages.
        However, it might not be needed. Articles are annotated as adjectives
        and adjectives (unlike nouns) are not required to stick with one gender.</para></note>
      
      <para>Articles merged with a preposition (e.g. French <foreignphrase>du</foreignphrase>,
        Italian <foreignphrase>della</foreignphrase>, German <foreignphrase>aufs, beim,
        vom, zur, im, am...</foreignphrase>) are treated as prepositions.</para>
      
      <table><title>Articles in common foreign languages</title>
        <tgroup cols="4">
          <thead>
            <row>
              <entry><para>Language</para></entry>
              <entry><para>Form</para></entry>
              <entry><para>Lemma</para></entry>
              <entry><para>Tag</para></entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry><para>English</para></entry>
              <entry><para><foreignphrase>the</foreignphrase></para></entry>
              <entry><para><literal>the-1_,t_^(angl._urč._člen)</literal></para></entry>
              <entry><para><literal>AAXXX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>English</para></entry>
              <entry><para><foreignphrase>a</foreignphrase></para></entry>
              <entry><para><literal>a-2_,t_^(angl._neurč._člen)</literal></para></entry>
              <entry><para><literal>AAXXX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>English</para></entry>
              <entry><para><foreignphrase>an</foreignphrase></para></entry>
              <entry><para><literal>a-2_,t_^(angl._neurč._člen)</literal></para></entry>
              <entry><para><literal>AAXXX----1A---1</literal></para></entry>
            </row>
            <row>
              <entry><para>German</para></entry>
              <entry><para><foreignphrase>der</foreignphrase></para></entry>
              <entry><para><literal>der-1_,t_^(něm._člen)</literal></para></entry>
              <entry><para><literal>AAMS1----1A----</literal>
                           <literal>AAFS2----1A----</literal>
                           <literal>AAFS3----1A----</literal>
                           <literal>AAXP2----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>German</para></entry>
              <entry><para><foreignphrase>die</foreignphrase></para></entry>
              <entry><para><literal>der-1_,t_^(něm._člen)</literal></para></entry>
              <entry><para><literal>AAFS1----1A----</literal>
                           <literal>AAFS4----1A----</literal>
                           <literal>AAXP1----1A----</literal>
                           <literal>AAXP4----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>German</para></entry>
              <entry><para><foreignphrase>das</foreignphrase></para></entry>
              <entry><para><literal>der-1_,t_^(něm._člen)</literal></para></entry>
              <entry><para><literal>AANS1----1A----</literal>
                           <literal>AANS4----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>German</para></entry>
              <entry><para><foreignphrase>des</foreignphrase></para></entry>
              <entry><para><literal>der-1_,t_^(něm._člen)</literal></para></entry>
              <entry><para><literal>AAMS2----1A----</literal>
                           <literal>AANS2----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>German</para></entry>
              <entry><para><foreignphrase>dem</foreignphrase></para></entry>
              <entry><para><literal>der-1_,t_^(něm._člen)</literal></para></entry>
              <entry><para><literal>AAMS3----1A----</literal>
                           <literal>AANS3----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>German</para></entry>
              <entry><para><foreignphrase>den</foreignphrase></para></entry>
              <entry><para><literal>der-1_,t_^(něm._člen)</literal></para></entry>
              <entry><para><literal>AAMS4----1A----</literal>
                           <literal>AAXP3----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Dutch</para></entry>
              <entry><para><foreignphrase>de</foreignphrase></para></entry>
              <entry><para><literal>de-2_,t_^(niz._člen)</literal></para></entry>
              <entry><para><literal>AAMSX----1A----</literal>
                           <literal>AAFSX----1A----</literal>
                           <literal>AAXPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Dutch</para></entry>
              <entry><para><foreignphrase>het</foreignphrase></para></entry>
              <entry><para><literal>de-2_,t_^(niz._člen)</literal></para></entry>
              <entry><para><literal>AANSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Dutch</para></entry>
              <entry><para><foreignphrase>den</foreignphrase></para></entry>
              <entry><para><literal>de-2_,t_^(niz._člen)</literal></para></entry>
              <entry><para><literal>AAMS3----1A---5</literal>
                           <literal>AANS3----1A---5</literal></para></entry>
            </row>
            <row>
              <entry><para>French</para></entry>
              <entry><para><foreignphrase>le</foreignphrase></para></entry>
              <entry><para><literal>le-1_,t_^(fr._člen)</literal></para></entry>
              <entry><para><literal>AAMSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>French</para></entry>
              <entry><para><foreignphrase>la</foreignphrase></para></entry>
              <entry><para><literal>le-1_,t_^(fr._člen)</literal></para></entry>
              <entry><para><literal>AAFSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>French</para></entry>
              <entry><para><foreignphrase>l</foreignphrase></para></entry>
              <entry><para><literal>le-1_,t_^(fr._člen)</literal></para></entry>
              <entry><para><literal>AAXSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>French</para></entry>
              <entry><para><foreignphrase>les</foreignphrase></para></entry>
              <entry><para><literal>le-1_,t_^(fr._člen)</literal></para></entry>
              <entry><para><literal>AAXPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Italian</para></entry>
              <entry><para><foreignphrase>il</foreignphrase></para></entry>
              <entry><para><literal>il-1_,t_^(it._člen)</literal></para></entry>
              <entry><para><literal>AAMSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Italian</para></entry>
              <entry><para><foreignphrase>la</foreignphrase></para></entry>
              <entry><para><literal>il-1_,t_^(it._člen)</literal></para></entry>
              <entry><para><literal>AAFSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Italian</para></entry>
              <entry><para><foreignphrase>gli</foreignphrase></para></entry>
              <entry><para><literal>il-1_,t_^(it._člen)</literal></para></entry>
              <entry><para><literal>AAMPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Italian</para></entry>
              <entry><para><foreignphrase>le</foreignphrase></para></entry>
              <entry><para><literal>il-1_,t_^(it._člen)</literal></para></entry>
              <entry><para><literal>AAFPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Spanish</para></entry>
              <entry><para><foreignphrase>el</foreignphrase></para></entry>
              <entry><para><literal>el-1_,t_^(šp._člen)</literal></para></entry>
              <entry><para><literal>AAMSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Spanish</para></entry>
              <entry><para><foreignphrase>la</foreignphrase></para></entry>
              <entry><para><literal>el-1_,t_^(šp._člen)</literal></para></entry>
              <entry><para><literal>AAFSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Spanish</para></entry>
              <entry><para><foreignphrase>los</foreignphrase></para></entry>
              <entry><para><literal>el-1_,t_^(šp._člen)</literal></para></entry>
              <entry><para><literal>AAMPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Spanish</para></entry>
              <entry><para><foreignphrase>las</foreignphrase></para></entry>
              <entry><para><literal>el-1_,t_^(šp._člen)</literal></para></entry>
              <entry><para><literal>AAFPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Portuguese</para></entry>
              <entry><para><foreignphrase>o</foreignphrase></para></entry>
              <entry><para><literal>o-10_,t_^(port._člen)</literal></para></entry>
              <entry><para><literal>AAMSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Portuguese</para></entry>
              <entry><para><foreignphrase>a</foreignphrase></para></entry>
              <entry><para><literal>o-10_,t_^(port._člen)</literal></para></entry>
              <entry><para><literal>AAFSX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Portuguese</para></entry>
              <entry><para><foreignphrase>os</foreignphrase></para></entry>
              <entry><para><literal>o-10_,t_^(port._člen)</literal></para></entry>
              <entry><para><literal>AAMPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Portuguese</para></entry>
              <entry><para><foreignphrase>as</foreignphrase></para></entry>
              <entry><para><literal>o-10_,t_^(port._člen)</literal></para></entry>
              <entry><para><literal>AAFPX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Arabic</para></entry>
              <entry><para><foreignphrase>al, ad, an, ar, as, az</foreignphrase></para></entry>
              <entry><para><literal>al-5_,t_^(arab._člen)</literal></para></entry>
              <entry><para><literal>AAXXX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Arabic</para></entry>
              <entry><para><foreignphrase>el, ed, en, er, es, ez</foreignphrase></para></entry>
              <entry><para><literal>el-5_,t_^(arab._člen)</literal></para></entry>
              <entry><para><literal>AAXXX----1A----</literal></para></entry>
            </row>
            <row>
              <entry><para>Hebrew</para></entry>
              <entry><para><foreignphrase>ha</foreignphrase></para></entry>
              <entry><para><literal>ha-2_,t_^(hebr._člen)</literal></para></entry>
              <entry><para><literal>AAXXX----1A----</literal></para></entry>
            </row>
          </tbody>
        </tgroup>
      </table>

    </sect1>

    
    
    <sect1 id="engnoun"><title>English noun clusters</title>
    
      <para>The original approach taken in PDT was that all attributively used
        nouns were annotated as adjectives. That was quite problematic because
        virtually all English nouns can be used as attributes of other nouns
        while they never take Czech adjectival suffixes in Czech texts. Now
        it is preferred to tag such words as foreign nouns in unknown case.
        In PDT 2.0, it is still annotated inconsistently.</para>
      
      <note><para>English-like attributive use of nouns has been imported
        to Czech (<foreignphrase>Staropramen Extraliga</foreignphrase>,
        <foreignphrase>Český Telecom Cup</foreignphrase> etc.)</para></note>
    
    </sect1>



    <sect1 id="nouns"><title>Nouns</title>

      <para>English nouns in plural form usually preserve the plural
      perception in Czech. However, terms that were imported in
      singular are rarely pluralized according to English grammar when
      the surrounding text requires plural. If a Czech plural ending
      cannot or is not added, the singular form is used as
      plural. Therefore, and for the sake of simplicity, all English
      nouns should be annotated with unknown number
      (<literal>X</literal>), unless they have a Czech ending.</para>

      <para>English (and most other non-Slavic) nouns have unknown
      (<literal>X</literal>) case in citations but they can be
      sometimes declined in word use.</para>

      <table><title>Number and case of English nouns</title>
    <tgroup cols="2">
      <thead>
        <row>
          <entry><para>Expression</para></entry>
          <entry><para>Annotation</para></entry>
        </row>
      </thead>
      <tbody>
        <row>
          <entry><para><foreignphrase>oba dva cash flow (oficiální
          i skutečný)</foreignphrase></para></entry>
          <entry><para><literal>flow_,t /
          NNIXX-----A----</literal></para></entry>
        </row>
        <row>
          <entry><para><foreignphrase>v&nbsp;cash flow
          statementu</foreignphrase></para></entry>
          <entry><para><literal>statement_,t /
          NNIS6-----A----</literal></para></entry>
        </row>
        <row>
          <entry><para><foreignphrase>Beatles:
          Girl</foreignphrase></para></entry>
          <entry><para><literal>girl_,t /
          NNFXX-----A----</literal></para></entry>
        </row>
        <row>
          <entry><para><foreignphrase>A teď zahrajeme písničku
          Girls.</foreignphrase></para></entry>
          <entry><para><literal>girl_,t /
          NNFXX-----A----</literal></para></entry>
        </row>
      </tbody>
    </tgroup>
      </table>

    </sect1>



    <sect1 id="verbs"><title>Verbs</title>

      <sect2 id="englishverbs"><title>English verbs</title>

    <para>The following tags are applied:</para>

    <itemizedlist>
      <listitem>
        <para>Infinitive <foreignphrase>(go)</foreignphrase>:
        <literal>Vf--------A----</literal></para>
      </listitem>
      <listitem>
        <para>Present other than 3<superscript>rd</superscript>
        person singular <foreignphrase>(go)</foreignphrase>:
            <literal>VB-X---XP-AA---</literal></para>
      </listitem>
      <listitem>
        <para>Present 3<superscript>rd</superscript> person
        singular <foreignphrase>(goes)</foreignphrase>:
            <literal>VB-S---3P-AA---</literal></para>
      </listitem>
      <listitem>
        <para>Imperative <foreignphrase>(go)</foreignphrase>:
            <literal>Vi-X---X--A----</literal></para>
      </listitem>
      <listitem>
        <para>Past tense <foreignphrase>(went)</foreignphrase>:
            <literal>Vp-X---XR-AA---</literal></para>
      </listitem>
      <listitem>
        <para>Perfect / passive participle
        <foreignphrase>(gone)</foreignphrase>:
            <literal>Vs-X---XX-AP---</literal></para>
      </listitem>
    </itemizedlist>

    <para>If it is difficult to determine the base form usage,
        annotate it as infinitive. If it is difficult to decide
        between past tense and passive participle, use past
        tense.</para>

    <table><title>Examples of English verbs</title>
      <tgroup cols="2">
        <thead>
          <row>
        <entry><para>Expression</para></entry>
        <entry><para>Annotation</para></entry>
          </row>
        </thead>
        <tbody>
          <row>
        <entry><para><foreignphrase>to be or not to
        be</foreignphrase></para></entry>
        <entry><para><literal>be_,t_^(angl._být,_v_názvech_apod.)
        / Vf--------A----</literal></para></entry>
          </row>
          <row>
        <entry><para><foreignphrase>Do it right
        now!</foreignphrase></para></entry>
        <entry><para><literal>do-2_,t /
        Vi-X---X--A----</literal></para></entry>
          </row>
        </tbody>
      </tgroup>
    </table>

      </sect2>

    </sect1>



    <sect1 id="slovaklang"><title>Slavic languages and Czech
    dialects</title>

      <para>Slavic languages (most prominently Slovak) are related to
      Czech. Citations may contain words that are identical to their
      Czech counterparts.</para>

      <para>When a word has a foreign suffix it must be annotated as a
      foreign word even if its baseform is identical to Czech.</para>

      <para>If all words in a phrase are identical in their forms and
      meanings to Czech, the phrase should be annotated as Czech, even
      if we know that it is in fact Slovak or other language. For
      instance, if a Slovak 
      song was named <foreignphrase>Drahý otec</foreignphrase>, there
      is no need to annotate it as foreign. However, if a single word
      does not fit the Czech grammar or vocabulary, the best would be
      to annotate
      whole citation as foreign. It would be strange if a
      "Czech" word intervened in the middle of a foreign
      phrase. Nevertheless, this is not always kept in PDT 2.0.</para>

      <para>Examples: <foreignphrase>ulica kapitána
      Nálepku</foreignphrase> - <literal>Nálepka_;S_,t /
      NNMS2-----A----</literal>; <foreignphrase>ste
      v&nbsp;Bratislave</foreignphrase> - <literal>byť_,t /
      VB-P---2P-AA--- // v-2_,t / RR--6---------- //
      Bratislava-2_;G_,t / NNFS6-----A----</literal></para>

      <para>Sometimes a Slovak-like phrase is in fact just a Moravian
      dialect of Czech, as in <foreignphrase>Slovácko sa
      nesúdí</foreignphrase>. The lemmas should be flagged
      <literal>_,n</literal> instead of <literal>_,t</literal> in such
      cases.</para>

    </sect1>

  </chapter>
  <!-- the end of Foreign Words and Phrases chapter
  -->





  <chapter id="errors7"><title>Errors</title>

    <para>Sometimes the author of a PDT 2.0 text uses a word
      incorrectly - e.g. a name of a woman as a man's name etc. In
      such cases, the real usage should be annotated, not the
      should-be usage.</para>

    <para>The texts can contain errors. It is reasonable to correct
      some of them (but the original - errorneous - word form should
      always be preserved in the <varname>origf</varname>
      attribute). However, only low-level errors (spelling and
      morphology) should be corrected. We do not want to correct
      Engels' text into Heidegger's. Never replace a colloquial form
      with an official one (e.g. *<foreignphrase>zelené
      města</foreignphrase> &rarr; <foreignphrase>zelená
      města</foreignphrase>, *<foreignphrase>bez noh</foreignphrase>
      &rarr; <foreignphrase>bez nohou</foreignphrase>), even if the
      analyzer does not know the form<footnote><para>You have to
      insert a new lemma and/or tag - see <xref linkend="insertion"/>
      for more details.</para></footnote>.</para>

    <sect1 id="characters7"><title>Characters</title>

      <para>If the author of the text misspelled a foreign name
      (e.g. converted a non-Czech character to a Czech one, say
      <foreignphrase>Milošević</foreignphrase> to
      <foreignphrase>Miloševič</foreignphrase>), it is a low-level
      error that should be corrected.</para>

      <para>Sometimes, foreign characters had been be screwed
        (e.g. Fran?oise), which may not only lead to an unknown word,
        it may mislead the tokenizer, resulting in three tokens. Since
        most work until the release of PDT 2.0 has been done in the
        ISO Latin 2 encoding, there is a problem with letters not
        contained in Latin 2. HTML entities should be used but the
        corresponding accent-free character is also acceptable.</para>

    </sect1>

    <sect1 id="separators"><title>Separators</title>

      <para>Sometimes, the text contains
      <foreignphrase>o</foreignphrase> or
      <foreignphrase>I</foreignphrase> in place of bullets or
      separators. <foreignphrase>o</foreignphrase> should be annotated
      <literal>o-4_^(graf._oddělovač) /
      Z:-------------</literal>.</para>

    </sect1>

  </chapter>
  <!-- the end of Errors chapter -->





  <chapter id="hardtodecide"><title>Hard to decide</title>

    <sect1 id="az"><title>až</title>
      <itemizedlist spacing="compact" type="vert">
    <listitem><wordasword><emphasis role="bold">až-1 +
    J^</emphasis></wordasword></listitem>
    <listitem><wordasword>2 až 3  (but not od 2 až do 3 - see
    až-3)</wordasword></listitem>
    <listitem><wordasword>nabízí přiblížení až
    přijetí</wordasword></listitem>
    <listitem><wordasword><emphasis role="bold">až-2 +
    J,</emphasis></wordasword></listitem>
    <listitem><wordasword>tak .. až: Nabízí se tak okatě, až je to
    hanba.</wordasword></listitem>
    <listitem><wordasword>.. začnou pochybovat, až nakonec uvěří, že
    ..</wordasword></listitem>
    <listitem><wordasword>?? Bylo mi 24, a byl jsem plný touhy se
    pomstít. Až jsem se ocitl před člověkem,
    který</wordasword></listitem>
    <listitem><wordasword>dostal zabrat víc než já.  <remark>##
    patri to sem? Proc?</remark></wordasword></listitem>
    <listitem><emphasis role="bold">až-3 + Db</emphasis></listitem>
      </itemizedlist>
      <para>If omitted, the sentence stays grammatical. It is often
      possible to replace it by teprve.</para>
      <itemizedlist spacing="compact" type="vert">
    <listitem><wordasword>Dostanete až 250 mil
    zdarma.</wordasword></listitem>
    <listitem><wordasword>kam až: Kam až
    půjdeš?</wordasword></listitem>
    <listitem><wordasword>Až on me přesvědčil, že tomu tak
    bude.</wordasword></listitem>
      </itemizedlist>
      <para>Modifies functional word (should be probably TT)</para>
      <itemizedlist spacing="compact" type="vert">
    <listitem>až + conj: Je geolog a až pak filozof</listitem>
    <listitem>až + prep: z Brna až do Prahy (Cf. až-1)</listitem>
      </itemizedlist>
    </sect1>

    <sect1 id="jak"><title>jak</title>
      <itemizedlist spacing="compact">
    <listitem><wordasword>jak-1_;L_^(živočich) + NNMnc-----A----
        Obvious.</wordasword></listitem>
        <listitem><wordasword>jak-2 + J,</wordasword></listitem>
      </itemizedlist>
      <orderedlist>
    <listitem>
      <para>Meaning že (<remark>cannot be replaced by
      jakpak</remark>)</para>
      <itemizedlist spacing="compact" type="vert">
        <listitem>Jak řekl M. Zeman, bude třeba ..</listitem>
        <listitem>Jak ukazuje vývoj poslednich let, je to
        ..</listitem>
        <listitem>Jak známo, ...</listitem>
        <listitem>Skutečnost, jak už to býva, byla trochu
        jiná.</listitem>
      </itemizedlist>
      <para>However, rarely it can be Db - depending on the
      interpretation</para>
      <itemizedlist spacing="compact" type="vert">
        <listitem>Viděl, jak upadla.</listitem>
        <listitem>Meaning Viděl, že upadla. - J,</listitem>
        <listitem>Meaning Viděl, jakým způsobem upadla. - Db</listitem>
        <listitem>Kamera zabírá poslance, jak otvírají krabici</listitem>
      </itemizedlist>
    </listitem>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Time, meaning když, až, jakmile</listitem>
          <listitem>Přijdu, (hned) jak budu
          hotov<superscript>ssč</superscript>.</listitem>
          <listitem>Hned jak budu moct, zavolám.</listitem>
        </itemizedlist>
          </para>
    </listitem>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>In comparison, meaning než, jako:</listitem>
          <listitem>Byl větší jak
          on<superscript>ssč</superscript></listitem>
          <listitem>rychlý jak
          vítr<superscript>ssč</superscript></listitem>
        </itemizedlist>
          </para>
    </listitem>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Condition (coll.), having the meaning jestliže,
          když</listitem>
          <listitem>Jak budeš zlobit, nepůjdeš
          nikam<superscript>ssč</superscript></listitem>
        </itemizedlist>
          </para>
    </listitem>
      </orderedlist>
      <blockquote role="remark">
    <para>Asi to sem patří, ale do které kategorie?</para>
    <itemizedlist spacing="compact" type="vert">
      <listitem><wordasword>Japonskému turistovi upadla lžička, jak
      chtěl zmáčknout spoušť foťáku.</wordasword></listitem>
      <listitem><wordasword>Poslední šancí, jak se probojovat do
      finále, bude ...</wordasword></listitem>
      <listitem><wordasword>Stát to měl spravovat zvláštním
      ministerstvem (jak je tomu
      např. v&nbsp;Rakousku)</wordasword></listitem>
    </itemizedlist>
      </blockquote>
      <itemizedlist spacing="compact" type="vert">
    <listitem><emphasis role="bold">jak-2 + J^ </emphasis></listitem>
    <listitem>In the phrase jak ... tak ... , having the meaning of
    i...i . However cf. jak-3 2.</listitem>
    <listitem>Byli tam jak odborníci, tak amatéři.</listitem>
    <listitem><emphasis role="bold">jak-3 + Db</emphasis></listitem>
    <listitem>Pronominal adverb</listitem>
      </itemizedlist>
      <orderedlist>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Interrogative - manner or extend (expr. jak
          pak).</listitem>
          <listitem><wordasword>Jak se
          jmenuješ?</wordasword></listitem>
          <listitem><wordasword>Jak je to
          možné?</wordasword></listitem>
          <listitem>Sometimes expressing large extend (often in
          exclamations).</listitem>
          <listitem><wordasword>Jak ten čas
          letí<superscript>ssč</superscript></wordasword></listitem>
          <listitem><wordasword>Jak (pak) by
          ne<superscript>ssč</superscript>. Japa by
          ne.</wordasword></listitem>
          <listitem><wordasword>Líbí se ti to? - A
          jak!.</wordasword></listitem>
        </itemizedlist>
          </para>
    </listitem>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Relative - marks subordinative adverbial clause
          (mostly manner expressing comparison, often with tak -
          however cf. jak-2 + J^)</listitem>
          <listitem><wordasword>Jak řekli, tak
          udělali<superscript>ssč</superscript></wordasword></listitem>
          <listitem><wordasword>tak dlouho, jak je možné  (tak ..,
          jak ..)</wordasword></listitem>
          <listitem><wordasword>Jak si kdo ustele, tak si
          lehne</wordasword></listitem>
        </itemizedlist>
          </para>
    </listitem>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Relative (coll.) - meaning co, který</listitem>
          <listitem>ten člověk, jak jsem ti o něm
          říkal<superscript>ssč</superscript></listitem>
        </itemizedlist>
          </para>
    </listitem>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Indefinite</listitem>
          <listitem>buď jak buď (the verb is repeated)</listitem>
          <listitem>jak kdo, jak kde, jak kdy, etc. - <remark>Patří
          to sem?</remark></listitem>
        </itemizedlist>
          </para>
    </listitem>
      </orderedlist>
      <blockquote role="remark">
    <para>Kam s&nbsp;tím, je to asi Db, ale proč?</para>
    <itemizedlist spacing="compact" type="vert">
      <listitem><wordasword>Jak se kůže sama obnovuje, postupně
          vylučuje ..</wordasword></listitem>
      <listitem><wordasword>?? Jak jsem chodil o berlích, tak jsem
          si zničil i druhé koleno.</wordasword></listitem>
    </itemizedlist>
      </blockquote>

    </sect1>



    <sect1 id="malo"><title>málo</title>
      <itemizedlist spacing="compact" type="vert">
    <listitem>Similar to moc.</listitem>
    <listitem><emphasis
    role="bold">málo-1_^(málo_+_2._p.,_málo_peněz) +
    Ca--c----------</emphasis></listitem>
      </itemizedlist>
      <para>It has to be modified (in the shallow syntax) by a noun in
      genitive. Has only two forms:</para>
      <itemizedlist spacing="compact" type="vert">
    <listitem>málo and mála (only in genitive).</listitem>
    <listitem><wordasword>Máme málo zájemců.</wordasword></listitem>
    <listitem><wordasword>bez mála peněz</wordasword></listitem>
    <listitem><wordasword>před málo
    lety<superscript>ssč</superscript></wordasword></listitem>
    <listitem><wordasword>Je jen o málo důslednější.</wordasword> -
    but <wordasword>Je málo důsledný.</wordasword> is
    <wordasword>málo-3 (Dg)</wordasword></listitem>
      </itemizedlist>
      <blockquote role="remark">
    <itemizedlist spacing="compact" type="vert">
      <listitem><wordasword>Udělal to jako jeden z mála odborníků,
      ..</wordasword></listitem>
      <listitem><wordasword>Udělal to jako jeden z mála. - ?? not
      modified by anything</wordasword></listitem>
      <listitem>Udělal to jako  jeden z mála, co přišli.</listitem>
    </itemizedlist>
      </blockquote>
      <itemizedlist spacing="compact" type="vert">
    <listitem><emphasis role="bold">málo-2_^(př._to_málo_co_měl) +
    NNNnc-----A----</emphasis></listitem>
    <listitem><wordasword>vystačit s
    málem<superscript>ssč</superscript></wordasword></listitem>
    <listitem><wordasword>vařit z
    mála<superscript>ssč</superscript></wordasword></listitem>
    <listitem><wordasword>Děkuji. - Za
    málo. <superscript>ssč</superscript></wordasword></listitem>
    <listitem><emphasis
    role="bold">málo-3_^(málo_+_příd._jm.,_př._byl_málo_důsledný)
    + Dg-------dA----</emphasis></listitem>
    <listitem><wordasword>Málo mluví, hodně
    dělá.<superscript>ssč</superscript></wordasword></listitem>
    <listitem><wordasword>Je málo důsledný.</wordasword></listitem>
    <listitem><wordasword>Ve srovnání s loňskou sezónou je to velmi
    málo. - you can say méně.</wordasword></listitem>
    <listitem><wordasword>Zdržím se jen
    málo<superscript>ssč</superscript>.</wordasword></listitem>
      </itemizedlist>
    </sect1>

    <sect1 id="moc"><title>moc</title>
      <itemizedlist spacing="compact" type="vert">
    <listitem>Similar to málo.</listitem>
    <listitem><emphasis
    role="bold">moc-1_^(nad_někým;_politická,_vojenská;_plná,...)</emphasis></listitem>
    <listitem>Obvious.</listitem>
    <listitem><wordasword>převzít moc</wordasword></listitem>
    <listitem><wordasword>moc proletariátu</wordasword></listitem>
    <listitem><wordasword>udělám, co je v mé moci</wordasword></listitem>
    <listitem><wordasword>mermo mocí</wordasword></listitem>
    <listitem><emphasis
    role="bold">moc-2_^(mnoho_něčeho_[se_subst._v_gen.]) +
    <literal>Ca--X----------</literal></emphasis></listitem>
    <listitem>Cannot be replaced by velmi. Can mean příliš, but is
    more colloquial. It has to be modified (in the shallow syntax)
    by a noun in genitive.</listitem>
    <listitem><wordasword>Má moc peněz.</wordasword></listitem>
    <listitem><wordasword>Všeho moc škodí.</wordasword></listitem>
    <listitem><emphasis
    role="bold">moc-3_^(velmi,_ve_spojení_s_adj.,_př._moc_hezká) +
    Db</emphasis></listitem>
    <listitem>Can be replaced by velmi (except ellipses). Modifies
    an adjective, adverb or verb.</listitem>
    <listitem><wordasword>Je moc hezká.</wordasword></listitem>
    <listitem><wordasword>Vím to moc dobře.</wordasword></listitem>
    <listitem><wordasword>Moc se snažil.</wordasword></listitem>
    <listitem>Ve srovnání s loňskem je to moc. - ellipse.</listitem>
      </itemizedlist>
    </sect1>

    <sect1 id="proto"><title>proto</title>
      <itemizedlist spacing="compact" type="vert">
    <listitem><emphasis
    role="bold">proto-1_^(proto;_a_proto,_ale_proto,...) +
    J^</emphasis></listitem>
    <listitem>Coordinative conjunction expressing consequence
    (implication). Structure: reason &rarr;
    consequence. Replaceable by tedy. Usually a proto or a
    ... proto</listitem>
    <listitem><wordasword>Nesplnil úkol, (a) proto nedostal
    odměnu.</wordasword></listitem>
    <listitem><wordasword>Každé proč má své
    proto.</wordasword></listitem>
    <listitem><wordasword>Německo se začalo dusit, a rozhodlo se
    proto omezit ...</wordasword></listitem>
    <listitem>Na začátku vět, bez a (to je tam implicitní)</listitem>
    <listitem><emphasis
    role="bold">proto-2_^(dal_mu_co_proto,_tak_proto!) +
    Db</emphasis></listitem>
    <listitem>Pronominal adverb. Refers to the subordinative clause
    Structure: what &rarr; reason</listitem>
    <listitem><wordasword>proto, že: Udělal to proto, že
    musel.</wordasword></listitem>
    <listitem><wordasword>Udělal to proto, aby/že mu
    pomohl.</wordasword></listitem>
    <listitem><wordasword>co proto: dát někomu co proto; dostat co
    proto</wordasword></listitem>
    <listitem>no proto: Říkal, že tam přece jen půjde - No proto!
    (Sometimes classified as a modal particle)</listitem>
      </itemizedlist>
    </sect1>

    <sect1 id="svuj"><title>svůj</title>
      <itemizedlist spacing="compact" type="vert">
    <listitem><emphasis role="bold">svůj-1_^(přivlast.) +
        P8gnc---------v</emphasis></listitem>
    <listitem>Obvious.</listitem>
    <listitem><emphasis role="bold">svůj-2_^(být_svůj) +
        AOgn----------v</emphasis></listitem>
    <listitem><remark>Problem with tags, analyzer probably needs
        update.</remark></listitem>
    <listitem><wordasword>Vzít za své.</wordasword></listitem>
    <listitem><wordasword>Víme své. Víme svoje.</wordasword></listitem>
      </itemizedlist>
    </sect1>

    <sect1 id="tak"><title>tak</title>
      <para>In general:
        <itemizedlist>
      <listitem>
        <para>replaceable by a proto &rArr; J^</para>
      </listitem>
      <listitem>
        <para>replaceable by tím způsobem, stejně, zrovna &rArr; Db</para>
      </listitem>
    </itemizedlist>
      </para>
      <para><emphasis role="bold">tak-2 + J^</emphasis></para>
      <para>Coordinative conjunction. If one of the clause is
      subordinative, tak has the meaning of an adverb: (Cf. Bál se,
      tak si pískal. - J^ vs. Kdyby se bál, tak si pískal - Db)</para>
      <orderedlist>
    <listitem>
      <para><remark>##Důsledková </remark>- meaning (a) proto, tedy
            <itemizedlist spacing="compact" type="vert">
          <listitem><wordasword>Bál se, (a) tak si
               pískal.<superscript>ssč</superscript></wordasword></listitem>
          <listitem><wordasword>Neudělali..., příspěvek tak budou
               muset vrátit.</wordasword></listitem>
          <listitem><wordasword>Byly zakázané, a tak
               přitahovaly</wordasword></listitem>
          <listitem><wordasword>Zmizí bariéry, a tak bude možné
               využívat ..</wordasword></listitem>
          <listitem><wordasword>Zpozdila se, a tak musela
               běžet.</wordasword></listitem>
          <listitem><wordasword>Jsou profíci, tak ať se podle toho
               zařídí/</wordasword></listitem>
          <listitem><wordasword>Počítá se s tím, že některé se
               sloučí, i tak bude třeba ..</wordasword></listitem>
        </itemizedlist>
          </para>
    </listitem>
    <listitem>
      <para><remark>##Slučovací</remark>in jak - tak</para>
    </listitem>
      </orderedlist>

      <simpara><emphasis role="bold">tak-3 + Db</emphasis></simpara>

      <orderedlist>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Refering to something known, to other sentence,
              etc.</listitem>
          <listitem><wordasword>tak - jak: Bylo to tak, jak jsem
              myslel.<superscript>ssč</superscript></wordasword></listitem>
          <listitem><wordasword>jak - tak: Jak řekli, tak
              udělali.</wordasword></listitem>
          <listitem><wordasword>Přesně tak.</wordasword></listitem>
          <listitem><wordasword>tak zvaný</wordasword></listitem>
          <listitem><wordasword>Ať je to tak nebo tak
              ...<superscript>ssč</superscript></wordasword></listitem>
          <listitem><wordasword>jen tak: Udělal to jen
              tak.</wordasword></listitem>
          <listitem><wordasword>tak tak: Stihl to (jen) tak
              tak.</wordasword></listitem>
          <listitem><wordasword>&gt; to: Stalo se tak při
              ..</wordasword></listitem>
          <listitem><wordasword>Tak se tehdy
              žilo<superscript>ssč</superscript></wordasword></listitem>
          <listitem>Sub-Clause, tak Main-clause: </listitem>
          <listitem><wordasword>Když - tak:   Když jsem počítal já,
              tak mi vyšlo velké číslo.</wordasword></listitem>
          <listitem><wordasword>Pokud - tak:  Pokud to není
              diskriminace, tak nevidím důvod ..</wordasword></listitem>
          <listitem><wordasword>Dokud se člověk raduje, tak je život
              pěkný.</wordasword></listitem>
          <listitem><wordasword>Kdyby - tak:    Kdyby/Pokud by se
              bál, tak by si pískal.</wordasword></listitem>
          <listitem>(Cf. Bál se, tak si pískal. - J^)</listitem>
        </itemizedlist>
          </para>
    </listitem>
    <listitem>
      <para>
            <itemizedlist spacing="compact" type="vert">
          <listitem>Expressing amount (usually large) of a property,
              etc.</listitem> 
          <listitem><wordasword>Kam tak
              rychle?<superscript>ssč</superscript></wordasword></listitem>
          <listitem><wordasword>tak jako: Je tak velký jako
              já.</wordasword></listitem>
          <listitem><wordasword>Zmizel z povědomí tak jako jeho
              pomnik;</wordasword></listitem>
          <listitem><wordasword>Nabízí se tak okatě, až je to
              hanba.</wordasword></listitem>
          <listitem><wordasword>To je ale tak daleko .</wordasword></listitem>
          <listitem><wordasword>tak vysoká; tak oslaben, že
              ...</wordasword></listitem>
          <listitem><wordasword>Buďte tak
              laskav.<superscript>ssč</superscript></wordasword></listitem>
          <listitem><wordasword>ani tak o ..., jako o ...: Nejde ani
              tak o mzdu, jako o ...</wordasword></listitem>
          <listitem><wordasword>&gt; přibližně: Dostane se na burzu
              asi tak třetí den od ..</wordasword></listitem>
          <listitem><wordasword>hned tak: Hned tak
              nepřijde. (koneckoců)</wordasword></listitem>
          <listitem><wordasword>odmítá to, stejně tak jako
              ...</wordasword></listitem>
          <listitem><wordasword>.. a zrovna tak
              hyzdit;</wordasword></listitem>
          <listitem><wordasword>tak jako tak</wordasword></listitem>
        </itemizedlist>
          </para>
    </listitem>
      </orderedlist> 

    </sect1>

  </chapter>
  <!-- the end of Hard to decide chapter -->





  <chapter><title>Selected words</title>

    <formalpara><title><foreignphrase>strana</foreignphrase></title>
      <para><foreignphrase>na jedné straně ..., na druhé straně
      ...</foreignphrase>: <literal>druhý-1_^(jiný)
      strana-1_^(v_prostoru)</literal></para>
      <para><foreignphrase>nerespektované ze strany
      Israele</foreignphrase>:
      <literal>strana-3_^(u_soudu,_na_úřadě,_smluvní_strany;_na_něčí_straně)</literal></para>
    </formalpara>

    <formalpara><title><foreignphrase>stát</foreignphrase></title>
      <para><foreignphrase>stane se ministrem</foreignphrase>:
      <literal>stát-2_^(něco_se_přihodilo)</literal></para>
    </formalpara>

    <formalpara><title><foreignphrase>s=to</foreignphrase></title>
      <para><foreignphrase>být sto něco udělat</foreignphrase>:
      <literal>sto-3_^(být_sto) / TT-------------</literal></para>
    </formalpara>

    <formalpara><title><foreignphrase>vážit</foreignphrase></title>
      <para><foreignphrase>vážit cestu</foreignphrase>:
      <literal>vážit-1_:T_^(na_váze)</literal> (similar to
      <foreignphrase>zvažovat něco</foreignphrase>; besides that, the
      only other possibility would be
      <literal>vážit-2_:T_^(ctít_si_někoho)</literal> but that verb is
      reflexive.</para>
    </formalpara>

    <formalpara><title><foreignphrase>vedení</foreignphrase></title>
      <para>One of the lemma groups for which the morphological
      analyzer currently violates the rule that each lemma should be
      numbered. There are two variants, one unnumbered, and the other
      <literal>vedení-1_^(*7ést-1)</literal>. The unnumbered lemma is
      used only for <foreignphrase>elektrické vedení</foreignphrase>
      and similar uses. Otherwise the numbered variant should be
      assigned, including but not limited to: <foreignphrase>pod
      vedením kamarádky, vedení podniku, čínské
      vedení</foreignphrase>.</para>
    </formalpara>
  </chapter>





  <chapter id="datetime"><title>Date and time</title>

    <itemizedlist spacing="compact" type="vert">
      <listitem><foreignphrase>v</foreignphrase> + a day: accusative (4)
      (<foreignphrase>v&nbsp;sobotu,
      v&nbsp;neděli</foreignphrase>)</listitem>
      <listitem><foreignphrase>v</foreignphrase> + a month: locative (6)
      (<foreignphrase>v&nbsp;lednu,
      v&nbsp;září</foreignphrase>)</listitem>
      <listitem><foreignphrase>v</foreignphrase> + an hour: accusative
      (4) (<foreignphrase>ve 4 hodiny,
      v&nbsp;6 hodin</foreignphrase>)</listitem>
      <listitem><foreignphrase>ve dne</foreignphrase>: locative (6) -
      <literal>NNIS6-----A---9</literal> - special kind of locative
      that occurs only in this context
      (<foreignphrase>v&nbsp;noci</foreignphrase> is also in
      locative)</listitem>
      <listitem>month in a date: genitive (2)
      (<foreignphrase>25.&nbsp;září,
      2. října</foreignphrase>)</listitem>
    </itemizedlist>
  </chapter>





  <chapter id="numbers"><title>Numbers, numerals and quantifiers</title>
    
    <simpara>An adjective modifying a quantified expression agrees in
    case with the noun, not the numeral.</simpara>

    <example><title>Case agreement in counted phrases</title>
      <itemizedlist spacing="compact">
    <listitem><wordasword>za</wordasword> (gen)</listitem>
    <listitem><wordasword>těch</wordasword> (gen)</listitem>
    <listitem><wordasword>mizerných</wordasword> (acc)</listitem>
    <listitem><wordasword>deset</wordasword> (gen)</listitem>
    <listitem><wordasword>korun</wordasword> (gen)</listitem>
      </itemizedlist>
    </example>
    
    <formalpara><title><foreignphrase>1x</foreignphrase></title>
      <para>Lemma equal to the form, e.g. <literal>1x</literal>. Tag
      <literal>Cv-------------</literal>.</para>
    </formalpara>
    
    <formalpara><title><foreignphrase>4x5</foreignphrase></title>
      <para>It should be tokenized into three tokens,
      e.g. <literal>4</literal>,
      <literal>x-5_^(náhr._symbolu_krát)</literal>, and
      <literal>5</literal>.</para>
    </formalpara>

    <simpara><foreignphrase>tři stovky, dvacet tisíc lidí, necelých
      9000</foreignphrase></simpara>
    <itemizedlist>
      <listitem>
    <para><foreignphrase>sto</foreignphrase> and
          <foreignphrase>pětiset</foreignphrase> in
          <foreignphrase>sto-, pětiset- a
          tisícikoruny</foreignphrase></para>
        <para>Not solved. The closest existing tag is the one of first
          parts of hyphenated adjectives
          (<literal>A2--------A----</literal>). But a lemma of a
          numeral should not have an adjectival tag.</para>
      </listitem>
      <listitem>
    <para><foreignphrase>Domníváme se, že <emphasis
          role="bold">poslední</emphasis> půl miliardy let
          udržuje...</foreignphrase></para>
    <para>What case should <foreignphrase>poslední</foreignphrase>
          get? Does it agree with <foreignphrase>půl</foreignphrase>
          (accusative), or with
          <foreignphrase>miliardy</foreignphrase> (genitive)?
          Solution: genitive should be preferred.</para>
    <para><foreignphrase>za těch patnáct let</foreignphrase>:
          <foreignphrase>patnáct</foreignphrase> = accusative,
          <foreignphrase>těch</foreignphrase> = genitive.</para>
      </listitem>
      <listitem>
    <para><foreignphrase>Výsledkem bylo zase jen pár
      marek.</foreignphrase> <foreignphrase>pár</foreignphrase>
      can be a numeral (<literal>C...[2367]</literal>) or a noun
      (<literal>N...[14]</literal>). But in this particular
      context, it should be <literal>C</literal> due to agreement
      with the predicate
      and <literal>N</literal> due to the nominative
      case. Solution: use <literal>ClXP1----------</literal>, the
      morphological analyzer must be adjusted.</para>
      </listitem>
    </itemizedlist>
  </chapter>





  <chapter id="hyphen"><title>Hyphenated composites</title>
    
    <para>If the hyphenated word ends with -o, and by a replacement
    of that -o by an adjective ending we obtain an adjective
    (normal or possesive), the lemma for the word is that
    adjective (e.g. <foreignphrase>česko-německý - česko &rarr; český,
    Karlo-Ferdinanova - Karlo &rarr; Karlův</foreignphrase>). Some
    words cannot be viewed
    as derived from adjectives, but rather from nouns (e.g. rap-
    jazzová - rap &rarr; rap vs. rapovo-jazzová - rapovo &rarr;
    rapový).</para>

    <para>Currently the only tag for first parts of hyphenated
    compounds is <literal>A2--------A----</literal>. The tag set has
    to be extended by a similar tag for nouns. Otherwise, we would
    have to introduce two lemmas for each noun, one tagged normally as
    noun, the other as an adjective before a hyphen. (One lemma must
    not occur with more than one part of speech.) Of course, that
    would be extremely inconvenient.</para>

    <example><title>Hyphenated composites</title>
      <itemizedlist spacing="compact" type="vert">
    <listitem><wordasword>srbsko-černohorská</wordasword>:
      <literal>srbský / A2--------A----</literal></listitem>
    <listitem><wordasword>Univerzita
      Karlo-Ferdinandova</wordasword>: <literal>Karlův_;Y_^(*3el)
      / A2--------A----</literal></listitem>
    <listitem><wordasword>Univerzita
      Karel-Ferdinandova</wordasword>: <literal>Karel_;Y /
      A2--------A----</literal></listitem>
    <listitem><wordasword>rap-jazzová</wordasword>: <literal>rap-2 /
      A2--------A----</literal></listitem>
    <listitem><wordasword>rapo-jazzová</wordasword>: <literal>rap-2
      / A2--------A----</literal></listitem>
    <listitem><wordasword>rapovo-jazzová</wordasword>:
      <literal>rapový / A2--------A----</literal></listitem>
      </itemizedlist>
    </example>
    
  </chapter>





  <chapter id="insertion"><title>Insertion</title>

    <para>If the possibilities offered by the morphological analyzer are
      not suitable, you have to insert new lemma and/or tag. If you
      insert a new lemma, you have to ensure that the lemma (lemma
      proper) you insert is not already used. That usually means
      adding unique numbers to distinguish lexical items having the
      same base form.</para>

    <sect1 id="possessiveadj"><title>Possessive adjectives</title>

      <para>Lemmas of possessive adjectives show how the get the noun
        they are derived from (see also <xref
        linkend="deriv-info"/>). For example:</para>

      <itemizedlist spacing="compact" type="vert">
    <listitem><literal>kardinálův_^(*2)</literal> - remove two
          letters: <literal>kardinál</literal></listitem>
    <listitem><literal>Karlův_;Y_^(*3el)</literal> - remove 3
          characters, add &quot;el&quot;:
          <literal>Karel</literal></listitem>
    <listitem><literal>Martinův-1_;Y_^(*4-1)</literal> - remove 4
          characters, add &quot;-1&quot;:
          <literal>Martin-1</literal></listitem>
      </itemizedlist>

    </sect1>



    <sect1 id="ismus"><title>Words ending with <foreignphrase>-ismus,
    -izmus</foreignphrase></title>

      <para>The base form should use -ismus ending, the form using 
        -izmus is treated as variant '1'. Currently some entries still
        do not follow this convention.</para>
      <example><title><foreignphrase>-ismus,
      -izmus</foreignphrase><footnote><para>The examples show the
      desired state, in the current version of morphological analyzer
      they are regarded as separate lexical items (they have different
      lemmas)</para></footnote></title>
    <itemizedlist spacing="compact" type="vert">
      <listitem><wordasword>mechanismus</wordasword>:
        <literal>mechanismus / NNIS1-----A----</literal></listitem>
      <listitem><wordasword>mechanizmus</wordasword>:
        <literal>mechanismus / NNIS1-----A---1</literal></listitem>
      <listitem><wordasword>exhibicionismus</wordasword>:
        <literal>exhibicionismus /
        NNIS1-----A----</literal></listitem>
      <listitem><wordasword>nacionalizmu</wordasword>:
        <literal>nacionalismus /
        NNIS2-----A---1</literal></listitem>
    </itemizedlist>
      </example>
    </sect1>



    <sect1 id="pronunc"><title>Transcription of pronunciation</title>
      <example><title>Transcription of pronunciation</title>
    <para><foreignphrase>vyslovujeme
          &quot;zpjev&quot;</foreignphrase></para>
    <para><foreignphrase>&quot;měly&quot; se čte
          &quot;mňeli&quot;</foreignphrase></para>
      </example>
      <para>The lemma should be equal to the word form, the tag should
        be <literal>NNXXX-----A----</literal> even if transcribing
        pronunciation of words that are not nouns:
        <literal>mňeli_^(přepis_výslovnosti) /
        NNXXX-----A----</literal></para>
    </sect1>



    <sect1 id="crippled"><title>Crippled forms</title>
      <para>Some crippled forms very closely resemble the
        pronunciation category. In
        <foreignphrase>Gaptschikowo</foreignphrase>, pronunciation is
        modeled using German spelling. In <foreignphrase>&quot;řada
        lidí chybuje a píše 'poměnka'&quot;</foreignphrase>, the
        author points out a spelling error other people do. However,
        the author's intention to use the wrong form should be clear,
        otherwise it is the author's error that should be
        corrected.</para>
      <para>If possible, the crippled forms should be tagged as if
        they were spelled the standard way; otherwise, use
        <literal>NNXXX-----A----</literal> or
        <literal>AAXXX----1A----</literal> according to the part of
        speech.</para>
      <example><title>Crippled forms</title>
    <itemizedlist spacing="compact" type="vert">
          <listitem><wordasword>Waklaf Hafel</wordasword>:
            <literal>Waklaf_;Y_,t / NNMS1-----A---- // Hafel_;S_,t /
            NNMS1-----A----</literal></listitem>
          <listitem><wordasword>Gaptschikowo</wordasword>:
            <literal>Gaptschikowo_;G_,t /
            NNNS1-----A----</literal></listitem>
          <listitem><wordasword>v&nbsp;Gaptschikowo</wordasword>:
            <literal>Gaptschikowo_;G_,t /
            NNNXX-----A----</literal></listitem>
        </itemizedlist>
      </example>
    </sect1>



    <sect1 id="isomorph"><title>Isolated morphemes</title>

      <para>The lemma should be equal to the form, the tag should be
        <literal>NNXXX-----A----</literal></para>
      <para>Example: <foreignphrase>ve slovech končících na -ství
        píšeme...</foreignphrase>: <literal>ství /
        NNXXX-----A----</literal></para>

    </sect1>



    <sect1 id="geometry"><title>Geometry</title>

      <para>In documents on geometric subjects, lots of
        &quot;triangles ABC&quot;, abscissas (lines) PQ, RS, AB
        etc. occur. The identifiers of the objects are not
        abbreviations! Instead, a new lemma numbered 98 must be
        created for each. As always, no program should rely on the
        number being 98 but the annotators should keep the rule for
        the sake of improving human readability.</para>
      <para>Example: <foreignphrase>trojúhelník ABC</foreignphrase>:
        <literal>ABC-98_^(označení_pomocí_písmene)</literal></para>

    </sect1>



    <sect1 id="chess"><title>Chess codes</title>

      <para>Records of Chess games appear occasionally in the
        data. They contain move descriptions in the Chess
        notation. Currently there are errors in tokenization; whole
        move (figure, target column and target row) should be one
        token. The lemma should equal to the code +
        <literal>-1_:B_;w_^(šachový_tah)</literal>. The tag should be
        <literal>NNNXX-----A---8</literal> (the neuter gender
        corresponds to the gender of
        <foreignphrase>pole</foreignphrase> (field)).</para>
      <para>Example: <foreignphrase>Jh8</foreignphrase>:
        <literal>Jh8-1_:B_;w_^(šachový_tah) /
        NNNXX-----A---8</literal></para>

    </sect1>

  </chapter>
  <!-- the end of Insertion chapter -->

</book>

