PADT MorphoTrees Data

Data Set Tokens Paras Docs Token / Para Para / Doc Token / Doc Explicit Tokens Entities Complete Entities Done / All Entities

+++ 148022 3363 393 37.508 8.557 320.967 144871 126140 122989 97.50%

ALH 73371 1298 154 47.993 8.429 404.513 72267 62295 61191 98.23%

ANN 25331 426 34 50.340 12.529 630.735 24633 21445 20747 96.75%

XIA 49320 1639 205 25.869 7.995 206.829 47971 42400 41051 96.82%

ALH Al Hayat News Agency

73371 tokens = syntactic units
62295 entities = input words

1298 paragraphs
154 files

72267 tokens explicitly annotated
61191 entities completely annotated
1104 entities with no annotation
0 tokens in incomplete entities

56.5262 tokens per paragraph
47.9931 entities per paragraph
476.4351 tokens per file
404.5130 entities per file
8.4286 paragraphs per file

98.228 % done entities of entities
98.495 % done tokens of tokens

1.1778 tokens per entity
1.1810 done tokens per done entity

86514 partitions = options for tokenization
113948 token forms = elements in such options
176004 lemmas in analyses
479992 tokens in analyses

1.3888 partitions per entity
1.3171 elements per partition
1.5446 lemmas per element
2.7272 token analyses per lemma
7.7051 token analyses per entity

ANN An Nahar News Agency

25331 tokens = syntactic units
21445 entities = input words

426 paragraphs
34 files

24633 tokens explicitly annotated
20747 entities completely annotated
698 entities with no annotation
0 tokens in incomplete entities

59.4624 tokens per paragraph
50.3404 entities per paragraph
745.0294 tokens per file
630.7353 entities per file
12.5294 paragraphs per file

96.745 % done entities of entities
97.244 % done tokens of tokens

1.1812 tokens per entity
1.1873 done tokens per done entity

30501 partitions = options for tokenization
40543 token forms = elements in such options
61433 lemmas in analyses
152337 tokens in analyses

1.4223 partitions per entity
1.3292 elements per partition
1.5153 lemmas per element
2.4797 token analyses per lemma
7.1036 token analyses per entity

XIA Xinhua News Agency

49320 tokens = syntactic units
42400 entities = input words

1639 paragraphs
205 files

47971 tokens explicitly annotated
41051 entities completely annotated
1349 entities with no annotation
0 tokens in incomplete entities

30.0915 tokens per paragraph
25.8694 entities per paragraph
240.5854 tokens per file
206.8293 entities per file
7.9951 paragraphs per file

96.818 % done entities of entities
97.265 % done tokens of tokens

1.1632 tokens per entity
1.1686 done tokens per done entity

60095 partitions = options for tokenization
78978 token forms = elements in such options
121334 lemmas in analyses
333464 tokens in analyses

1.4173 partitions per entity
1.3142 elements per partition
1.5363 lemmas per element
2.7483 token analyses per lemma
7.8647 token analyses per entity

+++	148022	3363	393	37.508	8.557	320.967	144871	126140	122989	97.50%
Data Set	Tokens	Paras	Docs	Token / Para	Para / Doc	Token / Doc	Explicit Tokens	Entities	Complete Entities	Done / All Entities
ALH	73371	1298	154	47.993	8.429	404.513	72267	62295	61191	98.23%
ANN	25331	426	34	50.340	12.529	630.735	24633	21445	20747	96.75%
XIA	49320	1639	205	25.869	7.995	206.829	47971	42400	41051	96.82%