Universal Dependencies (UD)

Universal Dependencies - project providing the annotation for treebanks in 18 languages. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The project page with general description of data and tagsets are here.

Search among monolingual corpora, click Universal Dependencies and choose the language you like:

http://lindat.mff.cuni.cz/services/kontext/run.cgi/corplist#lindat-monolingual-corpora

The general and full information on UI and search in KonText can be found here, though note that the attributes and metainformation to search are different from the UD tagset, so use it just as a manual on how to search in general. The attributes for search in UD correspond to those in CoNLL-U.    

Some examples of queries:

  • English, the ufeat attribute is processed as 'multivalue', so if the whole attribute is e.g. Tense=Past|VerbForm=Part|Voice=Pass , the search can be made for either of the values, e.g. [ufeat="Voice=Pass"] will return sentences with passive constructions.
  • English, dependency relation compound:prt - particle, to search for phrasal verbs: [pos="VERB"][deprel="compound:prt"] ; discontinuous construction, with some words in between a verb and a particle: [pos="VERB"] [pos!="VERB"]{2,5} [deprel="compound:prt"]
  • French: we want to find adjectives that stand before nouns: [pos="ADJ"][pos="NOUN"]
  • Czech:  find reflexive particles that depend on a verb, stand on the right from the verb and there are some words between them: [lemma="se" & ( p_ufeat=".*Tense=Pres.*" & parent="+.*") ]
  • Swedish: [lemma="hålla"][]{0,6} [lemma = "på"] [lemma="att"] [ufeat = "VerbForm=Inf"]

Challenges for the UD in KonText:

  • The interset (attribute ufeat) is not very straightforward to represent and query; it is treated as a "multivalue" feature.
  • Not yet decided on how to handle fused tokens

Queries over the joint corpus:

In order to make some experiments in comparative linguistics, we compiled the - 5,000 first
sentences for each language from UD. There is no sense in searching for some lexical issues, but the grammar attributes can be used to compare certain linguistic phenomena in several languages. The frequency distribution in the languages can be viewed with the function Frequency->Doc IDs (the user should be logged in to access this function), where Doc ID stand for a concrete language.

 examples of queries:

  • [pos=NOUN][pos="ADJ"] reflects noun postposition measure that is especially high in Latin languages
  • [deprel="nsubj" & pos="VERB"] shows cases when the subject of a sentence is a verb (or verbal form).