s

A sentence. Sentence boundaries are identified at tokenization time, unless there are marked in the source, which is almost never the case. The algorithm for sentence boundary identification used in the CNC is very rudimentary, and it is correct only about 95-98% of the time for general texts, and it s accuracy depends very heavily on the type of the text.

Sentences are identified uniquely within the CNC corpus (as they should be in any corpus). The identification consists of the

filename,
document id,
paragraph number, and
sentence number.

The full sentence identification is typically recorded in full at each sentence in the data in the id attribute.

Content

((i|
w|
f|
d|
D|
fadd)*,
idioms?,
salt*)

ATTRIBUTES
CONTENT DECLARATION

Tag Minimization: Open Tag: REQUIRED
Close Tag: OPTIONAL

Parent Elements

Top Elements
All Elements

csts DTD