PADT MorphoTrees Data

Data Set TokensParas Docs Token / Para Para / Doc Token / Doc Explicit Tokens Entities Complete Entities Done / All Entities
+++ 148022 3363 393 37.508 8.557 320.967 144871 126140 122989 97.50%
ALH 73371 1298 154 47.993 8.429 404.513 72267 62295 61191 98.23%
ANN 25331 426 34 50.340 12.529 630.735 24633 21445 20747 96.75%
XIA 49320 1639 205 25.869 7.995 206.829 47971 42400 41051 96.82%

ALH Al Hayat News Agency

73371tokens = syntactic units
62295entities = input words


1298paragraphs
154files


72267tokens explicitly annotated
61191entities completely annotated
1104entities with no annotation
0tokens in incomplete entities


56.5262tokens per paragraph
47.9931entities per paragraph
476.4351tokens per file
404.5130entities per file
8.4286paragraphs per file
98.228 %done entities of entities
98.495 %done tokens of tokens


1.1778tokens per entity
1.1810done tokens per done entity


86514partitions = options for tokenization
113948token forms = elements in such options
176004lemmas in analyses
479992tokens in analyses


1.3888partitions per entity
1.3171elements per partition
1.5446lemmas per element
2.7272token analyses per lemma
7.7051token analyses per entity

ANN An Nahar News Agency

25331tokens = syntactic units
21445entities = input words


426paragraphs
34files


24633tokens explicitly annotated
20747entities completely annotated
698entities with no annotation
0tokens in incomplete entities


59.4624tokens per paragraph
50.3404entities per paragraph
745.0294tokens per file
630.7353entities per file
12.5294paragraphs per file
96.745 %done entities of entities
97.244 %done tokens of tokens


1.1812tokens per entity
1.1873done tokens per done entity


30501partitions = options for tokenization
40543token forms = elements in such options
61433lemmas in analyses
152337tokens in analyses


1.4223partitions per entity
1.3292elements per partition
1.5153lemmas per element
2.4797token analyses per lemma
7.1036token analyses per entity

XIA Xinhua News Agency

49320tokens = syntactic units
42400entities = input words


1639paragraphs
205files


47971tokens explicitly annotated
41051entities completely annotated
1349entities with no annotation
0tokens in incomplete entities


30.0915tokens per paragraph
25.8694entities per paragraph
240.5854tokens per file
206.8293entities per file
7.9951paragraphs per file
96.818 %done entities of entities
97.265 %done tokens of tokens


1.1632tokens per entity
1.1686done tokens per done entity


60095partitions = options for tokenization
78978token forms = elements in such options
121334lemmas in analyses
333464tokens in analyses


1.4173partitions per entity
1.3142elements per partition
1.5363lemmas per element
2.7483token analyses per lemma
7.8647token analyses per entity