PCEDT & Multilingual Corpora

Prague Czech-English Dependency Treebank

The Prague Czech-English Dependency Treebank is a manually annotated parallel, aligned treebank built above the Penn Treebank - Wall Street Journal text collection. It comes in two versions. The current version has over 1.2 million running words in almost 50,000 sentences for each language part. Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (Prague Dependency Treebank 2.0). ... [learn more]

HamleDT: HArmonized Multi-LanguagE Dependency Treebank

HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. ... There are as many as 29 treebanks integrated in HamleDT at this moment. A subset of the treebanks whose license terms permit redistribution is available directly for download from us. ... [learn more]

 

Other Parallel and/or Multilingual Corpora

Project Tags
Bengali Visual Genome Annotations, Corpora, Data, Machine Translation, Multi-modality, Multilingual
CorefUD Annotations, Coreference, Corpora, Data, Multilingual
Czech Malach Cross-lingual Speech Retrieval Test Collection Corpora, Data, Information Retrieval, Multilingual, Speech Retrieval
CzEng Corpora, Data, Machine Translation, Multilingual
CzEngVallex - Czech and English verbal valency Annotations, Corpora, Data, Lexicons, Machine Translation, Multilingual, Semantics, Taggers
Deep Universal Dependencies Annotations, Coreference, Corpora, Data, Morphology, Multilingual, Multiword Expressions, Semantics, Syntax, Valency
European Language Grid Corpora, Data, Lexicons, Machine Translation, Multilingual, Parsers, Tools
HamleDT Annotations, Corpora, Data, Multilingual, Parsers
Hausa Visual Question Answering Dataset Corpora, Data, Machine Translation, Multi-modality, Multilingual
HindEnCorp Corpora, Data, Machine Translation, Monolingual, Multilingual
Hindi Visual Genome Corpora, Data, Machine Translation, Multi-modality, Multilingual
Interset Corpora, Data, Morphology, Multilingual, Taggers, Tools
Lindat KonText Annotations, Corpora, Data, Monolingual, Multilingual, Tools
Malayalam Visual Genome Corpora, Data, Machine Translation, Multi-modality, Multilingual
Multilingual Corpus Annotation as a Support for Language Technologies Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Multilingual
PAWS (Parallel Anaphoric Wall Street Journal) Annotations, Coreference, Corpora, Data, Linked data, Multilingual
Prague Czech-English Dependency Treebank Annotations, Corpora, Data, Lexicons, Linked data, Multilingual, Valency
Prague Czech-English Dependency Treebank 2.0 Coref Annotations, Coreference, Corpora, Data, Linked data, Multilingual
Prague Database of Spoken Language 1.0 Annotations, Corpora, Data, Dialog, Multi-modality, Multilingual, Speech Recognition, Speech Retrieval
PraViDCo Annotations, Corpora, Data, Discourse, Information Retrieval, Machine Learning, Multi-modality, Multilingual, Speech Recognition, Speech Retrieval, Tools
QT21 Corpora, Data, Lexicons, Linked data, Machine Learning, Machine Translation, Multilingual, Semantics, Tools
Slovakoczech NLP workshop Annotations, Coreference, Corpora, Data, Dialog, Discourse, Information Retrieval, Information Structure, Lexicons, Linked data, Machine Learning, Machine Translation, Monolingual, Morphology, Multi-modality, Multilingual, Multiword Expressions, Parsers, Publications, Semantics, Speech Recognition, Speech Retrieval, Spellcheckers, Taggers, Tools, Valency
UFAL Medical Corpus Corpora, Data, Machine Translation, Multilingual
UniDive Annotations, Corpora, Data, Lexicons, Machine Learning, Morphology, Multilingual, Multiword Expressions, Parsers, Semantics, Syntax, Taggers, Tools
Universal Dependencies Annotations, Corpora, Data, Morphology, Multilingual, Parsers
W2C Corpora, Data, Multilingual