w
Is the form of the original token as found in the original source of
text. It's text #PCDATA is in most cases identical to the
initial text (#PCDATA) of the
<f>
element, in which case it can be completely omitted. Otherwise it
must immediately precede the corresponding "normalized"
<f>
element(s).
It is used in the following cases:
- for automatically processed data:
- normalized numbers; spaces and/or other thousand separators are removed in
<f>,
decimal separators other that periods are replaced by periods. The
kind
attribute has the value num.orig in both cases. There is
always exactly one
<w>
element for an
<f>
element.
- contracted forms; examples include tys (ty +
jsi), nač (na + co),
abys/abychom/abyste, and all words with
attached "-s" (jsi) if identified by the automatic
processing at tokenization or tagging time. (In English, this would
normally include isn't, wanna, etc.) The
kind
has the value ctcd, and all the
case
attributes of the corresponding
<f>
elements have the (sub)string gen added.
- parts of phrases treated as a single token in the subsequent
processing; for example, fixed multiword names are treated this way,
as is the fixed phrase (být) s to. It includes also peculiar
formatting such as titles widened by spaces (such as
P r a g u e) etc. The
kind
attribute has the value phrpart at every instance of
<w>, and all the
case
attributes of the corresponding
<f>
elements have the (sub)string phrase added.
- for manually annotated data:
- spelling errors; the string with an error is preserved at this
element, with the
kind
attribute set to spell.
- missing forms; rarely used since only obvious omissions (normally
classified as typos) are being corrected. This is the only case when
the element's text (#PCDATA) is empty;
kind
attribute set to ins.
- superfluous "forms"; used e.g. for graphical symbols made up from
letters and punctuation and misidentified by the tokenized as words;
kind
attribute set to del.
- any of the cases listed above in the "automatic" list, if the
automatic tokenization procedure got it wrong.
In the PDT data,
the default value same of the
kind
attribute is never used explicitly; in fact, the whole
<w>
element, although theoretically correct, is never present in such a case.
Content
ATTRIBUTES
CONTENT DECLARATION
- Tag Minimization
-
Open Tag: REQUIRED
Close Tag: OPTIONAL
Parent Elements
Top Elements
All Elements
csts DTD