Data Set Tokens Paras Docs Token / Para Para / Doc Token / Doc +++ 113700 2995 420 37.963 7.131 270.714 ALH 10098 215 25 46.967 8.600 403.920 ANN 12613 209 17 60.349 12.294 741.941 XIA 26338 888 111 29.660 8.000 237.279 AFP 12931 374 50 34.575 7.480 258.620 UMH 38378 881 132 43.562 6.674 290.742 XIN 13342 428 85 31.173 5.035 156.965 ALH Al Hayat News Agency
10098 non-root nodes = tokens 215 trees = paragraphs 25 files
46.9674 nodes per tree 8.6000 trees per file 403.9200 nodes per file ANN An Nahar News Agency
12613 non-root nodes = tokens 209 trees = paragraphs 17 files
60.3493 nodes per tree 12.2941 trees per file 741.9412 nodes per file XIA Xinhua News Agency
26338 non-root nodes = tokens 888 trees = paragraphs 111 files
29.6599 nodes per tree 8.0000 trees per file 237.2793 nodes per file AFP Agence France Presse
12931 non-root nodes = tokens 374 trees = paragraphs 50 files
34.5749 nodes per tree 7.4800 trees per file 258.6200 nodes per file UMH Ummah Press Service
38378 non-root nodes = tokens 881 trees = paragraphs 132 files
43.5619 nodes per tree 6.6742 trees per file 290.7424 nodes per file XIN Xinhua News Agency
13342 non-root nodes = tokens 428 trees = paragraphs 85 files
31.1729 nodes per tree 5.0353 trees per file 156.9647 nodes per file