Projects: Corpora

The Prague Dependency Treebank

The Prague Dependency Treebank (PDT) contains a large amount of Czech texts with complex and interlinked morphological, syntactic and complex semantic annotation; in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. ... [learn more]

Prague Czech-English Dependency Treebank

The Prague Czech-English Dependency Treebank is a manually annotated parallel, aligned treebank built above the Penn Treebank - Wall Street Journal text collection. It comes in two versions. The current version has over 1.2 million running words in almost 50,000 sentences for each language part. Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (Prague Dependency Treebank 2.0). ... [learn more]

Prague Discourse Treebank

Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5 (PDT; Bejček et al. 2011), which is a revised, updated and extended version of the Prague Dependency Treebank 2.0 (Hajič et al. 2006). It represents a new manually annotated layer of language description, above the existing layers of the PDT (morphology, surface syntax and underlying syntax) and it portrays linguistic phenomena from the perspective of discourse structure and coherence. ... [learn more]

HamleDT: HArmonized Multi-LanguagE Dependency Treebank

HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. ... There are as many as 30 treebanks integrated in HamleDT at this moment. A subset of the treebanks whose license terms permit redistribution is available directly for download from us. ... [learn more]

 

More Corpora

Title Tags
Abstract Meaning Representation Annotations, Machine Translation, Semantics
Anotace citačních frází v datech iRozhlas Annotations
Anotace citačních zdrojů a frází v článcích serveru iRozhlas Annotations
Anotace citačních zdroů a frází v článcích mediálního serveru iRozhlas Annotations
Anotace pro Google Annotations, Data, Morphology, Semantics
Automatic MWE Identification Data, Lexicons, Monolingual, Semantics
Bengali Visual Genome Annotations, Corpora, Data, Machine Translation, Multi-modality, Multilingual
Čapek Annotations, Morphology
Centrum vizuální historie Malach Data, Discourse, Machine Translation, Multi-modality, Multilingual, Speech Recognition
CoNLL 2017 Shared Task Annotations, Machine Learning, Multilingual, Parsers, Tools
CoNLL 2018 Shared Task Annotations, Machine Learning, Morphology, Multilingual, Parsers, Tools
CorefUD Annotations, Coreference, Corpora, Data, Multilingual
Czech Academic Corpus Corpora, Data, Monolingual
Czech Court Decisions Dataset Annotations, Coreference, Data, Information Retrieval, Information Structure, Linked data, Semantics
Czech Legal Text Treebank Annotations, Corpora, Data, Information Retrieval, Linked data, Monolingual, Semantics
Czech Malach Cross-lingual Speech Retrieval Test Collection Corpora, Data, Information Retrieval, Multilingual, Speech Retrieval
Czech Named Entity Corpus Corpora, Data, Monolingual
Czech RST Discourse Treebank 1.0 Annotations, Corpora, Data, Discourse, Monolingual
Czech-English Manual Word Alignment Annotations, Data, Multilingual
Czech–German Lexicon of Anaphoric Connectives Coreference, Discourse, Lexicons, Multilingual
CzeDLex - A Lexicon of Czech Discourse Connectives Annotations, Corpora, Data, Discourse, Lexicons, Linked data, Monolingual
CzEng Corpora, Data, Machine Translation, Multilingual
CzEngVallex - Czech and English verbal valency Annotations, Corpora, Data, Lexicons, Machine Translation, Multilingual, Semantics, Taggers
CzeSL Annotations, Corpora
Deep Universal Dependencies Annotations, Coreference, Corpora, Data, Morphology, Multilingual, Multiword Expressions, Semantics, Syntax, Valency
Deltacorpus Corpora, Data, Machine Learning, Taggers
DeriNet Annotations, Data, Lexicons, Monolingual
ELITR Minuting Corpus Annotations, Corpora, Data, Dialog
EngVallex - English valency lexicon linked to corpora Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Valency
European Language Grid Corpora, Data, Lexicons, Machine Translation, Multilingual, Parsers, Tools
EUROSAI Corpus Corpora, Data
EVALD 3.0 (Evaluator of Discourse) Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Multi-modality, Tools
Eyetracked Multi-Modal Translation Data, Machine Translation, Multi-modality, Psycholinguistics
Functional Generative Description Data, Information Structure, Morphology, Semantics, Valency
HamleDT Annotations, Corpora, Data, Multilingual, Parsers
Hausa Visual Question Answering Dataset Corpora, Data, Machine Translation, Multi-modality, Multilingual
HindEnCorp Corpora, Data, Machine Translation, Monolingual, Multilingual
Hindi Visual Genome Corpora, Data, Machine Translation, Multi-modality, Multilingual
HPLT kick-off Data
Implicit relations in text coherence Annotations, Corpora, Data, Discourse, Psycholinguistics
Interset Corpora, Data, Morphology, Multilingual, Taggers, Tools
JTagger Data, Information Retrieval, Information Structure, Linked data
Lexical-semantic Annotation / SemLex Lexicon Annotations, Data, Lexicons, Monolingual, Semantics
Lindat KonText Annotations, Corpora, Data, Monolingual, Multilingual, Tools
Linguistic Factors of Readability Annotations, Data, Discourse, Information Structure, Semantics, Syntax
Malach Centre for Visual History Data, Discourse, Multi-modality, Multilingual
Malayalam Visual Genome Corpora, Data, Machine Translation, Multi-modality, Multilingual
Medieval Charter Sections Corpus Corpora, Data, Information Retrieval
Medieval Charter Sections Corpus Annotations, Data, Information Retrieval
Methods for rapid discourse annotation in selected corpora Annotations, Corpora, Data, Discourse, Parsers
Modeling of Complexity in Czech Literary Texts Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Publications, Semantics, Syntax, Teaching
MorfFlex CZ Corpora, Data, Lexicons, Monolingual, Morphology
Multilingual Corpus Annotation as a Support for Language Technologies Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Multilingual
MUSCIMA++ Annotations, Data, Tools
NomVallex: Valency Lexicon of Czech Nouns and Adjectives Corpora, Data, Lexicons, Monolingual, Semantics, Syntax, Valency
OdiEnCorp Corpora, Machine Translation, Monolingual
ParCzech Corpora, Data
PARSEME Annotations, Corpora, Lexicons, Linked data, Machine Learning, Multiword Expressions, Parsers, Semantics, Valency
PARSEME Annotations, Corpora, Lexicons, Multilingual, Multiword Expressions, Parsers, Semantics, Valency
PAWS (Parallel Anaphoric Wall Street Journal) Annotations, Coreference, Corpora, Data, Linked data, Multilingual
PDT-C Annotations, Coreference, Corpora, Data, Dialog, Discourse, Lexicons, Morphology, Multiword Expressions, Speech Recognition, Syntax, Valency
PDT-Vallex: Valency Lexicon Linked to Czech Corpora Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Valency
PDTSC 2.0 Annotations, Corpora, Data, Linked data, Monolingual, Morphology, Multi-modality, Semantics, Speech Recognition, Speech Retrieval, Valency
PML-Tree Query Corpora, Tools
Prague Czech-English Dependency Treebank Annotations, Corpora, Data, Lexicons, Linked data, Multilingual, Valency
Prague Czech-English Dependency Treebank 2.0 Coref Annotations, Coreference, Corpora, Data, Linked data, Multilingual
Prague Czech-English Dependency Treebank 3.0 Annotations, Coreference, Corpora, Data, Lexicons, Machine Translation, Morphology, Valency
Prague Database of Spoken Language 1.0 Annotations, Corpora, Data, Dialog, Multi-modality, Multilingual, Speech Recognition, Speech Retrieval
Prague Dependency Treebank Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Lexicons, Monolingual, Morphology, Multiword Expressions, Parsers, Semantics, Syntax, Taggers, Tools, Valency
Prague Dependency Treebank 3.0 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Morphology, Multiword Expressions, Semantics
Prague Dependency Treebank 3.5 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Lexicons, Machine Learning, Monolingual, Morphology, Multiword Expressions, Parsers, Publications, Semantics, Syntax, Taggers, Tools, Valency
Prague Discourse Treebank 1.0 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Morphology, Multiword Expressions, Semantics
Prague Discourse Treebank 2.0 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Morphology, Multiword Expressions, Semantics
Prague Discourse Treebank 3.0 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Morphology, Multiword Expressions, Semantics, Valency
Prague English Dependency Treebank Annotations, Corpora, Data, Lexicons, Monolingual, Valency
Prague Markup Language (PML) Annotations, Corpora, Tools
PraViDCo Data, Discourse, Multi-modality, Multilingual, Speech Recognition, Tools
QT21 Corpora, Data, Lexicons, Linked data, Machine Learning, Machine Translation, Multilingual, Semantics, Tools
ROMi 1.0 Corpora, Data, Dialog, Monolingual, Speech Recognition
Selected derivational relations for automatic processing of Czech Data, Lexicons, Monolingual, Morphology
Semantic Pattern Recognition Annotations, Corpora, Data, Lexicons, Monolingual, Morphology, Parsers, Publications, Semantics, Taggers, Tools, Valency
Sentiment Analysis in Czech Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Tools
Shallow discourse parsing in Czech Annotations, Corpora, Data, Discourse, Lexicons
Slovakoczech NLP workshop Annotations, Coreference, Corpora, Data, Dialog, Discourse, Information Retrieval, Information Structure, Lexicons, Linked data, Machine Learning, Machine Translation, Monolingual, Morphology, Multi-modality, Multilingual, Multiword Expressions, Parsers, Publications, Semantics, Speech Recognition, Speech Retrieval, Spellcheckers, Taggers, Tools, Valency
SPAT Data, Information Structure, Machine Learning, Morphology
Strojový překlad se sémantickou informací Annotations, Lexicons, Machine Translation, Semantics, Valency
Styx Annotations, Morphology, Tools
SumeCzech Corpora, Data, Monolingual
SynSemClass (formerly CzEngClass) Corpora, Lexicons, Linked data, Semantics, Syntax, Valency
Systematic, economical and corpus-based description of valency properties of Czech deverbal nouns (theory and practice) Lexicons, Valency
UFAL Medical Corpus Corpora, Data, Machine Translation, Multilingual
UFAL Parallel Corpus of North Levantine Corpora, Data, Machine Translation
UniDive Annotations, Corpora, Data, Lexicons, Machine Learning, Morphology, Multilingual, Multiword Expressions, Parsers, Semantics, Syntax, Taggers, Tools
Universal Dependencies Annotations, Corpora, Data, Morphology, Multilingual, Parsers
Universal Derivations Annotations, Data, Lexicons, Morphology, Multilingual
Universal Segmentations Annotations, Data, Lexicons, Morphology, Multilingual
UrMonoCorp Corpora, Data, Monolingual
Valency Lexicon of Czech Verbs VALLEX Data, Lexicons, Monolingual, Semantics, Syntax, Valency
VPS-30-En: Verb Pattern Sample - 30 English Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Valency
VPS-GradeUp Annotations, Corpora, Data, Lexicons, Machine Learning, Monolingual, Semantics, Valency
W2C Corpora, Data, Multilingual
WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353 Annotations, Data, Information Retrieval, Machine Learning, Multilingual, Semantics
Working with the Penn Discourse Treebank Annotations, Corpora, Data, Discourse, Linked data, Monolingual, Tools
Working with the RST-DT and the RST-SC Annotations, Corpora, Data, Discourse, Linked data, Semantics, Tools