Data Set Tokens Paras Docs Token / Para Para / Doc Token / Doc Explicit Tokens Entities Complete Entities Done / All Entities +++ 148022 3363 393 37.508 8.557 320.967 144871 126140 122989 97.50% ALH 73371 1298 154 47.993 8.429 404.513 72267 62295 61191 98.23% ANN 25331 426 34 50.340 12.529 630.735 24633 21445 20747 96.75% XIA 49320 1639 205 25.869 7.995 206.829 47971 42400 41051 96.82% ALH Al Hayat News Agency
73371 tokens = syntactic units 62295 entities = input words
1298 paragraphs 154 files
72267 tokens explicitly annotated 61191 entities completely annotated 1104 entities with no annotation 0 tokens in incomplete entities
56.5262 tokens per paragraph 47.9931 entities per paragraph 476.4351 tokens per file 404.5130 entities per file 8.4286 paragraphs per file
98.228 % done entities of entities 98.495 % done tokens of tokens
1.1778 tokens per entity 1.1810 done tokens per done entity
86514 partitions = options for tokenization 113948 token forms = elements in such options 176004 lemmas in analyses 479992 tokens in analyses
1.3888 partitions per entity 1.3171 elements per partition 1.5446 lemmas per element 2.7272 token analyses per lemma 7.7051 token analyses per entity ANN An Nahar News Agency
25331 tokens = syntactic units 21445 entities = input words
426 paragraphs 34 files
24633 tokens explicitly annotated 20747 entities completely annotated 698 entities with no annotation 0 tokens in incomplete entities
59.4624 tokens per paragraph 50.3404 entities per paragraph 745.0294 tokens per file 630.7353 entities per file 12.5294 paragraphs per file
96.745 % done entities of entities 97.244 % done tokens of tokens
1.1812 tokens per entity 1.1873 done tokens per done entity
30501 partitions = options for tokenization 40543 token forms = elements in such options 61433 lemmas in analyses 152337 tokens in analyses
1.4223 partitions per entity 1.3292 elements per partition 1.5153 lemmas per element 2.4797 token analyses per lemma 7.1036 token analyses per entity XIA Xinhua News Agency
49320 tokens = syntactic units 42400 entities = input words
1639 paragraphs 205 files
47971 tokens explicitly annotated 41051 entities completely annotated 1349 entities with no annotation 0 tokens in incomplete entities
30.0915 tokens per paragraph 25.8694 entities per paragraph 240.5854 tokens per file 206.8293 entities per file 7.9951 paragraphs per file
96.818 % done entities of entities 97.265 % done tokens of tokens
1.1632 tokens per entity 1.1686 done tokens per done entity
60095 partitions = options for tokenization 78978 token forms = elements in such options 121334 lemmas in analyses 333464 tokens in analyses
1.4173 partitions per entity 1.3142 elements per partition 1.5363 lemmas per element 2.7483 token analyses per lemma 7.8647 token analyses per entity