-------------------------------------------------------- Statistics on Readers' Digest corpora sentence alignment -------------------------------------------------------- words ... # of tokens phs ..... # of paragraphs (aligning passages) snts .... # of sentences (not aligned) 1-1, 0-1 ..... types of sentence alignments CZ ...... Czech part EN ...... English part ---------------------------------------------------------------------- ## Version with paragraphs words phs snts 1-1 0-1 1-0 1-2 2-1 2-2 ------- ----- ------ ----- ----- ----- ----- ----- ----- CZ 826839 3838 67491 32944 20071 17963 8096 3802 442 ------- ----- ------ ----- ----- ----- ----- ----- ----- EN 920087 3978 73893 32944 17963 20071 3802 8096 442 ------- ----- ------ ----- ----- ----- ----- ----- ----- # "usable" sentence pairs: 44,842 ---------------------------------------------------------------------- ## Version without paragraphs words phs* snts 1-1 0-1 1-0 1-2 2-1 2-2 ------- ----- ------ ----- ----- ----- ----- ----- ----- CZ 732927 450 59215 45035 438 473 4753 3914 498 ------- ----- ------ ----- ----- ----- ----- ----- ----- EN 839597 450 59889 45035 473 438 3914 4753 498 ------- ----- ------ ----- ----- ----- ----- ----- ----- # "usable" sentence pairs: 53,702 * corresponds to number of files ---------------------------------------------------------------------- ## "Definitive" version with sophisticate sentence alignment (by DM tools) words^ phs* snts 1-1 0-1 1-0 1-2 2-1 2-2 others ------- ----- ------ ----- ----- ----- ----- ----- ----- ------ CZ 888788 443 59367 44086 402 617 4116 3943 378 726 ------- ----- ------ ----- ----- ----- ----- ----- ----- ------ EN 1014181 443 58945 44086 617 402 3943 4116 378 726 ------- ----- ------ ----- ----- ----- ----- ----- ----- ------ ^ number of tokens after tokenization and normalization ----------------------------------------------------------------------