Chapter 2 PMLTQ revision on Shakespeare’s dramas

2.1 Corpus: DraCor Shakespeare Drama Corpus

We are going to revise some of the basic concepts from Universal Dependencies and PMLTQ on the DraCor Sheakespeare’s dramas corpus presented in the TEITOK tool.

Our queries draw on David Crystal’s book Think on My Words and the accompanying website

2.2 Enter TEITOK

The Shakespeare corpus is here: <>. You don’t need to log in.

In the menu to the left select the PML-TQ Search option. To get acquainted with the attributes of the nodes in this corpus, hit Show Treebank Options below the search field.

There are two types of nodes:

  1. tok(token)

  2. s(sentence).

The tok node has a number of attributes, among which you are going to spot the column names from the conll-u files output by UDPipe! These are highlighted in bold:

  1. form

  2. regregularized form

  3. upos universal part of speech

  4. xpos external part of speech

  5. lemma

  6. deprel dependency relation to its governing node (head)

  7. head id of this token’s head

  8. depsextra dependencies (if available)

  9. bbox

  10. facs

… and a family of universal features attributes. As you may remember, the conll-u files concatenate the features separating them by |. To make the search easier, TEITOK makes each feature one attribute.

  1. feats/Definite

  2. feats/Degree

  3. feats/Foreign

  4. feats/Gender

  5. feats/Mood

  6. feats/NumType

  7. feats/Number

  8. feats/Person

  9. feats/Polarity

  10. feats/Poss

  11. feats/PronType

  12. feats/Tense

  13. feats/VerbForm .

To be able to use the attributes in your queries, you can look up the possible values of upos, deprel, and features in the Universal Dependencies documentation.

Both the tok and s nodes have the text attributes

  1. text/title

  2. text/author

  3. text/year

… and an id attribute.

You can look up the text titles in TEITOK by opening a new window with the option CQL Search. The CQL Search has a query builder option right at the Search button. This contains a drop-down menu with the titles.

2.3 Extract a list of dramas

To start with, let us extract a list of all dramas in the corpus. Extracting a list means applying a summarizing filter to a query. To obtain at least one node from each drama, we have to ask for a node that is represented in each drama. Call this node $title.

How to ask for a token node called $title:

tok ( = node type) $title (= the name title that you gave the node) := (this has to come along with the node name) [ your constraints on the node ]

When you have extracted at least one node from each drama, it’s time to filter the data for the titles.

>> (start of a filter) for (introduces your enumeration of nodes and their attributes )$title(name of your node).attribute-of-your-node give (ends your enumeration of nodes and their attributes separated by commas) $1 (in a table column the first node.attribute after >> ).

Since you have asked for a text attribute, you will get one node per drama and do not need to de-duplicate anything.

2.4 Extract a list of dramas with their publication years and sort them according to year in the descending order

Hint: sort by $column-order-number desc

The year attribute is unfortunately implemented as a string, so we cannot extract e.g. dramas published before or after a given year. We can only ask for dramas matching a given year.

This would work if text/year were a numerical value:

>> filter $2 < 1599

but only this works when year is encoded as a string:

>> filter $2 ~ '1599'

2.5 A study on the prefix un-

David Crystal says that Shakespeare coined a few new words, especially using the negation prefix un-, most of them occurring in the late plays after 1600, especially in Richard II, Macbeth, Troilus, and Hamlet. PMLTQ certainly gives us no way to identify a neologism, but we can download the resulting table and analyze it in Tableau Public.

Hint: to say that a token starts with something, use form ~ "^[uU]n",