Parseme MWE 1.0 data is available from the Lindat repository, and can be searched through both KonText and NoSke.

The PARSEME shared task data is a collection of corpora in 18 languages with annotation of verbal multiword expressions (VMWEs) in running texts. VMWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French; freuen sich `to be glad' in German). For most languages, parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided.

Here we list a collection of queries that demonstrate some of the abilities of the CQL query type, which is the most expressive query type available in Kontext. We concentrate on the annotation of multiword expressions (MWEs). We link directly to ready-made queries in three common European languages, but the same queries are valid for all 15 languages that can be searched through KonText.
3 languages, Bulgarian, Hebrew and Lithuanian, were not included because the Parseme data does not contain syntactic annotation; they may be added later upon request.


Simplest queries
[mwe_order="first"] French German Spanish

Search for the first word of each MWE annotated in the corpus.
If the same word happens to be the first word of several MWEs, it will appear in the KWIC output only once.

[mwe="LVC"] French German Spanish

Find tokens annotated as part of a light verb construction (LVC).

[lemma="faire" & mwe!="_"] French  German  Spanish

Find a particular verb annotated with any category of VMWE.

[mwe_lemma="faire partie"] French  German  Spanish

Search by a concatenation of the lemmas of words belonging to the MWE (in the order in which they appear in the text).

[mwe="LVC" & upostag="VERB"] French German Spanish

Display LVC tokens that are verbs.

Highlighting multiple words
[mwe_order="first"][mwe_order="cont"]{1,} French German Spanish

Display and highlight continuous MWEs (with no intermediate words between their tokens). The output may possibly contain a discontinuous MWE of three or more words such that the first two directly follow each other. Only the beginning continuous part will be highlighted. Also note than in case of a continuous MWE with three or more words, the KWIC contains multiple lines—one with the first two tokens highlighted, another one with three tokens highlighted. If you need to filter the output to contain only one instance in such cases (with the maximum possible number of words highlighted), please use the instance of the data running in NoSke and use the "Filter > Sub-hits" option.

(meet 1:[mwe_order="first"] 2:[mwe_order="cont"] 0 5) & 1.mwe_id=2.mwe_id within <s/> French German Spanish

The first word of the MWE becomes the KWIC; a continuation word is also highlighted, but is not part of the KWIC.

 1:[mwe_order="first"] []* 2:[mwe_order="cont"] & 1.mwe_id=2.mwe_id within <s/> French German Spanish

Matches the first two words of the MWE together with any words lying between them.

1:[mwe_order="first"] []* 2:[mwe_order="cont"] []* 3:[mwe_order="cont"] & 1.mwe_id=2.mwe_id & 1.mwe_id=3.mwe_id within <s/> French German Spanish

Highlight three tokens in an MWE, including any discontinuities between them.

Queries for overlapping MWEs
[mwe="LVC" & mwe="ID"] French German Spanish

Find words which are LVC and ID at the same time.

[mwe=".*;.*"] French  German  Spanish

Find tokens that are part of more than one MWE.

[mwe="(.*);\1"] French German Spanish

Find tokens that are part of two or more MWEs of the same type; this is typically a result of coordination such as in they were going in and out.

(meet 1:[mwe_id="(.*;.*)"] 2:[] 1 5) & 1.mwe_id=2.mwe_id within <s/> French German Spanish

Find two tokens that simultaneously belong to the same pair of overlapping MWEs.

1:[mwe_order="first" & upostag="VERB" & mwe="LVC"] []* 2:[mwe_order="cont" & upostag="NOUN"] & 1.mwe_id=2.mwe_id within <s/> French German Spanish

Find light verb constructions where the real syntactic head goes first.


Get involved

Post your own queries at the dedicated google group forum or report any issues or suggestions that come up through the issue tracker.

Only the attributes that exist in all of the corpora with a conllu file were converted; thus, for example, meta-information such as sentence ids is not yet available through the search interface. Please let us know if you would like to see it added.