PML-TQ web tutorial

Tutorial for the PML-TQ web client

Petr Pajas

Jiří Mírovský

Table of Contents

Introduction

PML-TQ main features
Basic concepts

Tutorial

Getting started
A simple query
Executing the query
A query with two nodes
Disjunctions, regular expressions and set enumerations
Types of relations (links)
Querying labeled references using the member selector
Subqueries (testing existence, non-existence and number of occurrences)
Looking for small result trees?
Functions
Output filters

Introduction

PML Tree Query (PML-TQ) is a query language and search engine targeted for querying multi-layer annotated treebanks stored in the PML data format. It can be used to query all kinds of treebanks: dependency, constituency, multi-layered, parallel treebanks, as well as other kinds of richly structured types of annotation.

The query language is declarative and offers both textual and graphical representation of queries (note: in the current version of the WEB-based interface, only the textual representation of queries is available). There are two implementations of the query engine, one based on a relational database (Oracle or PostgreSQL >= 8.4), the other based on Perl and the TrEd toolkit. Three user interfaces are available: a WEB-based interface for the database-based query engine displaying query results as SVG, a full-featured graphical user interface for both engines available as a plug-in to the tree editor TrEd, and a text-only command-line interface.

This tutorial focuses on the WEB-based interface. For a tutorial dedicated to the TrEd client, as well as for further information on the query language, please refer to the PML-TQ manual.

PML-TQ main features

queries can span over all layers of annotation (including annotation dictionaries)
allows arbitrary logical constraints
supports output filters (generate custom text output, compute statistics, ...)
the WEB-based client works without installation, in a web browser

Basic concepts

A PML-TQ query consists of a selective part that selects nodes from the treebank and an optional sequence of output filters that are used to extract data from the matching nodes, post-process the results, compute statistics, generate tabular output, etc.

The selective part of a PML-TQ query postulates requirements on one or more nodes from the treebank and their mutual relationships (e.g. on the topological configuration in the tree structure). It is formed by one or more node selectors, which form the outermost scope of the query. Inner scopes of the query are given by nested subqueries as described later.

A node selector represents a node in the treebank of a certain type (in the PML data model, the nodes in the treebank annotation can be typed; the query can also refer to several annotation layers with different types of nodes) and postulates constraints on its properties including relationships to nodes represented by other selectors.

Selectors may nest other selectors; a nested selector belongs to the same scope as the containing selector The nested selector may explicitly specify the relation of its matching node to the node matched by the containing selector; the default relation is child. The nesting of selectors can thus naturally follow the topology of the matching tree.

Selectors can also be named and referred to from other node selectors; however, in many cases, the need for explicitly naming them can be eliminated by nesting.

A match of a query is a mapping which assigns to each outermost-scoped selector a node from a treebank (called a matching node) of the type specified by the selector, in such a way that all the matching nodes are mutually distinct and simultaneously satisfy the constraints postulated by their corresponding selectors (including constraints on their mutual relationships). The match can be represented as a tuple of the matching nodes ordered accordingly to some canonical ordering of the selectors from the outermost scope of the query. There can be zero, one, or more distinct matches of the query in the treebank (two matches are distinct if, as ordered tuples, they differ in at least one node).

Non-identity rule: Two distinct selectors in the same scope of the query always represent two distinct nodes in each match of the query or sub-query (unless explicitly specified othewise in the query).

Selectors can postulate the following types of constraints:

predicates
references to other selectors
subqueries
boolean combinations of the above

In the following descriptions, we refer to the selector postulating a constraint as as the current selector.

Predicate constraints assert equality, inequality, or regular expression match between values computed from terms. An atomic term is a constant (integer, float, or character string), or an attribute of a node matched by the current selector or some other selector in the current or outer scope of the query. A term is either an atomic term or a term obtained from other terms using arithmetical (+, *, -, div, mod) or string (concatenation & ) operators, or functions.

A reference is a constraint on the relationship of a node matched by some named selector to the node matched by the current selector. The referred selector must either belong to the same scope as the current selector or to its outer scope.

A subquery is formed by a selector (called the leading selector of the subquery) nested in the current selector and augmented by restrictions on the number of occurrences, computed as the number of distinct nodes matched by the leading selector of the subquery relatively to a fixed match of the selectors in the current and outer scope (including the current selector). For example, to postulate a constraint that each node matched by the current selector must have at least two child nodes, we create a subquery in form of a nested selector in the child relation to the current selector and restrict the number of occurrences to two and more.

The leading selector can nest other selectors. Each subquery starts a new scope whose outer scope is the scope of the containing selector together with the containing selector's outer scope (if any). Unlike selectors from the outermost scope, selectors declared within a subquery do not represent any particular node in the resulting match. They can refer to selectors from the same scope, and also to selectors from the outer scope, but not vice versa (selectors from the outer scope cannot refer to the selectors in the subquery).

A subquery constraint is verified as follows: for each match of the selectors in the current and outer scope, all matches of the subquery are located (these may coincide with nodes matched by the selectors in the outer scope). The number of distinct nodes matched by the leading selector of the subquery are counted and this number is compared with the restrictions on number of occurrences. The constraint is satisfied if and only if these restrictions are met.

A constraint can also be a boolean combination of other constraints; a nested node selector occurring in a boolean combination with other nested node selectors or constraints is considered to be a subquery with at least one occurrence.

A PML-TQ query can be visualized as a graph consisting of one or more trees whose nodes are the selectors connected by edges according to the nesting of selectors and subqueries. In this sense we may sometimes refer to selectors as query nodes and to the query as query graph or query tree (a technical root can be added above all the trees so that a forest becomes a single tree). The edges can be labeled or colored to represent different relationships between nodes. References to named selectors can be represented by an additional layer of links (edges) in the graph that may go across the basic tree structure of the query tree.

Tutorial

The purpose of this tutorial is to show how to create and run queries from the PML-TQ WEB-based client, searching treebanks hosted at the Lindat/Clarin web pages.

As our examples, we use queries over the Prague Dependency Treebank 3.0; conceptually similar queries can be applied to most other treebanks, although the node types and attributes will be probably different.

The tutorial gradually passes from very simple to complex queries and demonstrates various common syntactic constructions of the PML-TQ language.

Getting started

The PML-TQ provides a client interface in the form of a web application that can be accessed by any web browser capable of combining JavaScript, CSS, and SVG (Scalable Vector Graphics), such as Firefox, Google Chrome, Opera browser, and Safari.

Unlike the TrEd interface, this interface does not require any installation, but lacks some features such as graphical query builder and graphical representation of the query (the queries must be entered in the text form), and of course does not support querying local files.

To access the PML-TQ servers hosted at Lindat/Clarin servers, go to lindat.cz and in the top menu, click on TreeQuery . A starting PML-TQ web page will be displayed (see Figure 1, “The PML-TQ web-service at Lindat/Clarin web pages”).

Figure 1. The PML-TQ web-service at Lindat/Clarin web pages

Below, you can see two lists – a list of recently used treebanks and a list of featured treebanks. Clicking on any of the listed treebanks will connect you directly to the server for the given treebank.

A list of all available treebanks can be accessed by clicking on Browse Treebanks , as demonstrated in Figure 2, “The list of available treebanks at Lindat/Clarin web pages”.

Figure 2. The list of available treebanks at Lindat/Clarin web pages

Here, you can filter the treebanks according to their public availability (i.e. accessibility of the server without login), language and other tags. By clicking on a single treebank from the (filtered) list below, you will get connected to the search server for the respective treebank. For example, if you select the Prague Dependency Treebank 3.0 (PDT 3.0), you will get to the following web page (Figure 3, “The help page for the PDT 3.0 treebank”):

Figure 3. The help page for the PDT 3.0 treebank

It is a very short introduction to the query language that should help you start searching in the treebank if you do not wish to read through this lengthy tutorial. To proceed to the page where you can actually enter a search query, click on .

The following web page will be displayed (Figure 4, “The start page for searching in the PDT 3.0 treebank”):

Figure 4. The start page for searching in the PDT 3.0 treebank

A simple query

Now we may create our first simple query. We shall search for all nodes of the type t-node (tectogrammatical nodes in PDT 3.0) whose attribute functor equals to PRED (Predicate). In the web client, the query can be created in two ways:

Method 1: Click on in the toolbar; a list of available nodes for PDT 3.0 is displayed:
Figure 5. A part of the list of available node types for PDT 3.0

Choose t-node. The string t-node will be copied to the text area below the toolbar.
Properties of a node follow its type, enclosed in square brackets: type [ and choose a t-node attribute functor from menu in the toolbar. Next, select operator = from the Comparison group in the menu in the toolbar, and type "PRED". Finish by closing the definition of the t-node by ]. Figure 6, “A simple query searching for Predicates in PDT 3.0” shows how the query should look like (the spaces are optional).
Figure 6. A simple query searching for Predicates in PDT 3.0
Method 2: In the query text area, start typing t-. The popup menu with possible node types will be offered:
Figure 7. List of available node types for PDT 3.0 containing the string t-

Choose t-node and continue typing the query, i.e. [ fu. After you start typing the name of the attribute, another popup window with possible attributes is offered:
Figure 8. List of available attribute names in PDT 3.0 containing the string fu

Choose functor and finish typing the rest of the query, i.e. ="functor"].
The resulting query should be the same as in Figure 6, “A simple query searching for Predicates in PDT 3.0” (again, the spaces are optional), i.e.:
```
t-node [ functor="PRED" ]
```
try the query

Note

Throughout the tutorial, you can use the button try the query placed below examples to go directly to the web client and try the example. Such permanent links to queries in the web client can be created by clicking on on the right side below the query text area. Please note that the permanent link is a link to the given corpus and the textual query; the order of the result trees is undeterministic, due to propeties of the underlying database.

Executing the query

To execute the query, press Execute Query below the query text area. The query gets processed by the server and the result is displayed. Figure 6, “A simple query searching for Predicates in PDT 3.0” shows the first matching tree.

Figure 9. A result tree for the query searching for Predicates in PDT 3.0

The corresponding sentence is displayed just above the tree. Try clicking on the individual words of the sentence and see the animation marking the corresponding nodes in the tree.

Buttons Previous and Next can be used to navigate among the results, and buttons and to see context sentences/trees. To go directly to the N-th result, change the number of the current result ( 1 of 100 ) to a desired number (make sure that the focus is in the result number field) and – in the list of matching nodes – click on 1 t-node . The corresponding matching node in the given result is displayed (and highlighted in the same colour, in this case green).

Note

By default, the search engine returns up to 100 matches (in no particular order), which should be more than sufficient for viewing a few matching examples. This limit can be changed on the right side above the query text area ( Result Limit ), but raising this limit may slow down the search. We shall later see how to compute the number of all matches, using output filters.

A query with two nodes

We shall now make the query more complex by adding another node to it. We shall ask for a t-node with functor "PRED" (Predicate) that has a child with functor "PAT" (Patient).

To add a node to an existing one in the query, you need to specify a type of relation of the new node to the existing one. The list of available relations can be accessed through in the toolbar (see Figure 10, “A part of the list of available standard relations between nodes in PML-TQ”).

Figure 10. A part of the list of available standard relations between nodes in PML-TQ

The default value is child, so the following two queries are equivalent:

t-node [ functor="PRED", t-node [ functor="PAT" ] ]

Tutorial for the PML-TQ web client

Petr Pajas

Jiří Mírovský

Introduction

PML-TQ main features

Basic concepts

Tutorial

Getting started

A simple query

Note

Executing the query

Note

A query with two nodes

Disjunctions, regular expressions and set enumerations

Types of relations (links)

Structural relations

PML Reference Links

Implementation- or corpus-specific relations

Querying labeled references using the member selector

Subqueries (testing existence, non-existence and number of occurrences)

Looking for small result trees?

Functions

Output filters

Querying labeled references using the `member` selector