Principal investigator (ÚFAL): 
Grant id: 
ÚFAL budget: 
1,416,000 CZK
2014 - 2016


Sentence structure induction without annotated corpora

Syntactic analysis of sentences is one of the fundamental problems of computational linguistics. At present, we use mainly the supervised approaches that need a large number of syntactically annotated corpora (treebanks) to learn the syntax of language. The disadvantage are the financial and time demands of such corpora and the need to create a new treebank for each additional language.

In this project, we will work on an alternative method. The syntactic relations will be learned automatically from text corpora with no linguistic annotation. These "unsupervised" methods have recently become very popular and it turns out that, for certain types of tasks, they are better than the supervised methods. Their advantage is their simplicity and their linguistic and domain independence. We will test the induced grammar models in applications where a simple n-gram models currently outperform the syntactic ones, for example in machine translation. Our hypothesis is that syntactic models based solely on data and not on linguistic rules can improve the machine translation results.

LiStr: Linguistic Structure induction toolkit

This tookit comprises the tool for unsupervised grammar induction and associated sctipts which were used for experiments in this grant project. You can download it from the LINDAT/CLARIN repository.

1st year progress

The following goals were proposed for the year 2014:
  a) preparation of training and testing treebanks for many different languages,
  b) working on LiStr toolkit, using gold POS tags at sentence structure induction, testing different models of grammar,
  c) find the best model that generates linguistic structures most similar to human annotations (treebanks).
Progress in the particular goals:

a) We have participated in development of HamleDT collection of dependency treebanks. The aim of HamleDT was to collect many treebanks for many languages and harmonize it into the standardized annotation style (Prague Dependency Trrebank style [1] and Stanford Style [2]). The created data is useful in many other projects and experiments working with structures of the sentences. Especially for sentence structure induction, the standardized annotation style helps us to evaluate the induced structures across different languages not reflecting the diferences between particular annotation styles. Our part of work for HamleDT collection was analysis and normalization of Dutch and Russian treebank into a the common styles. We have two publications written describing the data preparation: [1] and [2], with the dedications to other projects, since many people from other projects participated as well. Now we have available more than 30 treebanks of different languages for testing our LiStr tools on.

b) We have developed the software toolkit "LiStr", which comprises some tools for linguistic structure induction. The package contained the following tools:
   1) UDPC - unsupervised dependency parser, the previous parser written in Perl was reimplemented in C++, many functions were optimalized, so that the result is more than 100 times faster than the previous implementation. Some other functions and methods were added: block sampling using the dynamic programing (to sample the whole sentence structure together at once), DMV (Dependency Model with Valence using the stop model instead of the fertility),
   2) the script for estimating stop priors from the big raw corpora,
   3) the script for estimating fertility priors from the big raw corpora,
   4) the scripts for estimating fixed stop priors for function words [3],
   4) the script for inducing both the part-of-speech tags and sentence structure (the experiments on this were not published yet),
   5) other supporting scripts for data conversion and evaluation.

c) We have tried many different settings of the sentence structure induction and compared the results across 20 treebanks against the manually annotated data. The configuration, which best resembled the annotated treebanks comprised different dealing with function words. Function words (the most frequent words in a particular language, eg. prepositions, determiners, conjunctions, pronouns etc.)
have rather fixed number of dependent words in a dependency structure. For example, prepositions in English have almost always just one dependent word on the right side. Pronouns have almost always no dependent, etc. Such hypothesis brought us to automatic inference of such feature for the most frequent words. We tried to find function words that a) just one left dependent b) just one right dependent c) no dependents. When we apply it into our dependency model we got the new state-of-the-art results. Compared to [4], where the authors reached 48.5% similarity with human annotations (averaged over 20 different treebanks), we reached the average score 50.3%. The new model description and results were published in [3], presented on the CICLing conference. We received the "second best paper award" for this paper.

We also did some experiments comparing the fully unsupervised sentence structure induction with semi-supervised methods, namely with the projection of the strucrure from one language to another one. The results were published in [5]. The semi-supervised methods appeared to be better for many languages, however, for some cases, it is better to induce a sentence structure "from scratch" without any other resources, than use e.g. an existed English treebank and a paralel corpus.


2nd year progress

In the year 2015, we have collected various methods for unsupervised induction of word clusters and part-of-speech tags. Finally, we experimented with:

- Brown's word classes - hierarchical clustering of words based on n-gram contexts in which they occur [6]
- Clark's word classes - exploiting distributional and morphological information [7]
- Blunsom's & Cohn's part-of-speech induction - based of Pitman-Yor process [8]

We adjusted the UDPC tool for unsupervised dependency parsing, so that it can deal word-classes instead of part-of-speech tags on the input. The best results were achieved with the Alex Clark's word classes, which uses also the morphological properties of words. The results were reported on 25 languages from HamleDT collection of treebanks [9]. We experimented with clustering words into 25, 50, 100, and 200 classes. We found out that for different languages, different number of classes is optimal. For example, Bulgarian, Danish and Portuguese achieved the best results with 25 classes, whereas German, English ,and Greek achieved the best results with 100 classes, measured by the the unlabeled attachment score. When taking the best number of classes for each language, we achieved 42.5% averaged attachment score over the 25 languages, which is almost the same as with gold part-of-speech tags (with score 43.8%). The dependency structures for some languages were induced wery well, e.g. Spanish (60.1%) or Turkish (55.8%). English, Estonian, Romanian, and Swedish also scored better than 50%. See details in the paper [10].

We also tried to combine word-suffix induction [11] with unsupervised POS tagging [12], however, we were not succesfull, since the results were very poor and we were not able to approach the results obtained by previous tools for word-clustering.

However, we developed a new unsupervised part-of-speech guesser, which is based only on general statistics of individual words in the corpus. Such guesser do not label the words by cluster numbers, it labels them by the known tags, such as 'NOUN', 'VERB', etc. The statistics, such as word length, word frequency, or the entropy of the preceding and following word are computed on the words on corpora of languages, where the part-of-speech tags are known. Then, using the KNN algorithm, we find the k-closest words to a word in a new language, for which we do not have the POS tags. We assign the tag which is most frequent among the k-closest words. Even though the used statistics were very basic, we achieved quite good results: more than 60% accuracy for Czech, Spanish, Italian, Dutch, and Portuguese, when training on seven other languages. This "delexicalized" tagger is also incorporated in the new version of the LiStr toolkit (directory 'posguess'). We also submited the paper called "If you even don't have a bit of Bible: learning delexicalized POS taggers" to the LREC 2016 conference.

We also started working on unsupervised parser utilizing the characteristics of individual part-of-speech tags. Because the current treebanking tends to have one annotation scheme (HamleDT [9], Universal Dependencies [13]) and one common tagset, it becomes senseless to induce structures from an unknown part-of-speech tagset. When we have a tagset for a language, we have also its conversion to the universal tagset and thus we know which tags describes particular types of words. Therefore, we have a prior knowledge about tags, which behave similarly in dependency structures across different languages. For instance, in universal dependencies [13], all function words (determiners, prepositions, pronouns, conjunctions, auxiliary verbs etc.) are leaves in the universal dependency structure. Therefore, we can assign them very high prior probability to have no dependents in the unsupervised dependency parser. Conversely, the verbs will get high probability to have dependents. Such prior knowledge proved to improved the unsupervised dependency parser by a wide margin. The unlabeled attachment score averaged over 31 tested languages from Universal-Dependencies collection raised from 43% to 60%. The files with prior probabilities, that can be used instead of probabilities based on reducibility principle [14], were included in the new version of the LiStr toolkit as well. (directory 'udp_ud'). We plan to submit a paper about these experiments to ACL 2016 conference.

The new version of LiStr toolkit have been released.


3rd year progress

In the year 2016, we continued experimenting with minimally supervised linguistic structure induction, in which the induced structures (dependency trees) are motivated by the Universal Dependencies annotation [13].

Our paper "If you even don't have a bit of Bible: learning delexicalized POS taggers" from the year 2015 was accepted to appear at the LREC 2016 conference. The background intuition was, that the individual POS categories tend to manifest similar statistical properties across languages (e.g., prepositions tend to be short, relatively frequent, showing different patterns of conditional entropy to the left versus to the right, as well as certain asymmetry of occurrences along sentence length). We employed language-independent features such as word length, frequency, neighborhood entropy, character classes (alphabetic vs. numeric vs. punctuation) etc. The models trained on annotated corpora in Universal Dependencies would be then used to tag any language with only sufficient ammount of raw text available. In most cases, the tagging accuracy improved over the most-frequent-tag baseline. The performance is well below results achieved by contemporary methods based on parallel data, however, it is completely independent of the existence of any parallel or comparable corpora or dictionaries. In the following work [16], we tried to combine delexicalized tagging and parsing together, however the results were rather poor. It seems, that the tagging errors were multiplied by the parsing errors. The parsing results were below the completely unsupervised parsing with unsupervised POS-tags we reported in the year 2015.

We also continued working on minimally supervised parsing (MSP), in which the unsupervised parser is enhanced by a couple of
language-independent rules that guides the structures to resemble the desired Universal-Dependenies annotation style.
The paper submission to ACL 2016 was rejected and we decided to extend this work by the comparison with delexicalized transfer parsers (DTP). The delexicalized transfer parsers (see e.g. [21]) is trained only on part-of-speech tags and can be transferred to any language using the same tagset, presuming that the words with the same POS tags behaves similarly in dependency trees across different languages. Both MSP and DTP are simmilar approaches. They need to have the part-of-speech tagset shared accross languages, which is true for Universal-Dependencies treebanks, and they somehow use the part-of-speech-tag-based rules found at the languages with annotated dependency structures for parsing languages, for which no dependency annotation exists. However, while the DTP is trained directly on part-of-speech tags in supervised way, our MSP approach only use some basic hand-written rules to support the generally unsupervised parser. We evaluated this two approaches on Universal-Dependencies treebanks and decribed the results in [17]. We found that whereas the delexicalized transfer parser is better in average, our minimally supervised parser performs better on many non-Indo-European languages and therefore can be suitable to parse often low-resourced exotic languages, for which treebanks often do not exist. The languages, for which MSP outperformed DTP were for example Estonian, Basque, Old Greek, Hindi, Hungarian, and Tamil. The average unlabelled attachment score over all the 32 languages in Universal-Dependencies collection was 60.5% for MSP and 63.8% for DTP.

The third area of research in 2016 was employing the unuspervised parsing methods in machine translation:
a) We first experimented with unsupervised segmentation of parallel dependency trees [18]. We introduced a novel unsupervised method for parallel tree segmentation based on Gibbs sampling. Using the data from a Czech-English parallel treebank CzEng [22], we showed that the procedure converges to a dictionary containing reasonably sized treelets. Many induced treelets created have linguistic substantiation, e.g. collocations like “European Union” tend to constitute treelets of their own, or the manifestation of parallel verb valency captured by some treelets, such as the aligned prepositions [for – na] that are stuck to their governing verbs [wait – čekat] in a bi-treelet and not to their children as usual. This research was planned to be applied on parallel tree structures obtained in unsupervised way, however, we have developed better approach for parallel tree-structures alignment (described in the next paragraph), and therefore this one was not applied furthermore.
b) We introduced so called "merged trees" [19] - another approach of alignment of parallel trees. In this approach, parallel sentences  from  two  languages  are  represented  by  a single dependency tree. Each node of the tree consists of two word-forms (e.g. "cat-kočka") and two POS tags (e.g."NOUN-NOUN"). Words that do not have their counterparts in the other sentence are also represented by nodes and the missing counterpart is marked by label . All such nodes representing a single word without any counterpart are leaf nodes. This ensures that the merged tree structure can be simply divided into two monolingual trees, not including empty nodes. We used this tree-merging in English-to-Czech and Czech-to-English machine translation. The treelet-based dictionary is extracted form the merged parallel trees. The source monolingual trees are used to parse the new sentences to be translated and then the dictionary is applied. Even though the translation model is very simple and do not use any word-reordering or target-language language model, the results are quite promissing. For example in the Czech-to-English direction, our results (15.6 lowercased BLEU points) are comparable with the established tree-based system TectoMT [23] (14.6 lowercased BLEU points). On the other hand, current state-of-the-art machine translation systems based on recurrent neural networks are far better in the translation quality.

Instead of the final technical report, we published a conference paper [20], comparing different unsupervised parsing approaches and different evaluation methods. We studied various approaches of various research teams world-wide from the year 2004 up to now and tried to sort them into groups according to their motivation, use of resources and evaluation. We concluded that many methods are unfortunalelly incomparable, due to different evaluation data and different use of resources. We hope that this paper will help new researchers in this field to be able to orient in this jungle of different approaches to unsupervised parsing and linguistic stricture induction.

We also released the last version of our toolkit - LiStr-16.12. It is available at the LINDAT/CLARIN repository on this link: The new items in the toolkit are the new universal priors for minimally unsupervised parsing and the scripts for merging the trees and for simple machine translation.



[1] Daniel Zeman, Ondřej Dušek, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, Jan Hajič (2014): HamleDT: Harmonized Multi-Language Dependency Treebank. In: Language Resources and Evaluation, ISSN 1574-020X, vol. 48, no. 4, pp. 601-637

[2] Rudolf Rosa, Jan Mašek, David Mareček, Martin Popel, Daniel Zeman, Zdeněk Žabokrtský (2014): HamleDT 2.0: Thirty Dependency Treebanks Stanfordized. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 2334-2341, European Language Resources Association, Reykjavík, Iceland, ISBN 978-2-9517408-8-4

[3] David Mareček, Zdeněk Žabokrtský (2014): Dealing with Function Words in Unsupervised Dependency Parsing. In: 15th International Conference on Computational Linguistics and Intelligent Text Processing, pp. 250-261, Springer, Berlin / Heidelberg, ISBN 978-3-642-54905-2

[4] Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky (2013): Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

[5] Loganathan Ramasamy, David Mareček, Zdeněk Žabokrtský (2014): Multilingual Dependency Parsing: Using Machine Translated Texts instead of Parallel Corpora. In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 102, pp. 93-104
2nd year progress

[6] Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, Jenifer C. Lai: Class-based n-gram models of natural language. Computational Linguistics 18 (4), 1992.

[7] Alexander Clark: Combining distributional and morphological information for part of speech induction. In Proceedings of the tenth Annual Meeting of the European Association for Computational Linguistics (EACL), pages 59-66, 2003

[8] Phil Blunsom, Trevor Cohn: A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 865–874, Portland, Oregon, June 19-24, 2011

[9] Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, Jan Hajič: HamleDT: To Parse or Not to Parse? In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), ISBN 978-2-9517408-7-7, pp. 2735-2741. ELDA, Istanbul, Turkey

[10] David Mareček: Multilingual Unsupervised Dependency Parsing with Unsupervised POS Tags. In: Advances in Artificial Intelligence and Soft Computing, Volume 9413 of the series Lecture Notes in Computer Science, pp 72-82, 2015

[11] Jason Naradowsky, Sharon Goldwater: Improving morphology induction by learning spelling rules. In Proceedings of IJCAI. 2009.

[12] Sharon Goldwater, Thomas L. Griffiths: A Fully Bayesian Approach to Unsupervised Part-of-Speech Tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007.


[14] David Mareček and Zdeněk Žabokrtský: Exploiting Reducibility in Unsupervised Dependency Parsing. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 297-307, Jeju Island, Korea, July, 2012

[15] Zhiwei Yu, David Mareček, Zdeněk Žabokrtský, Daniel Zeman (2016): If You Even Don't Have a Bit of Bible: Learning Delexicalized POS Taggers. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 96-103, European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1

[16] Daniel Zeman, David Mareček, Zhiwei Yu, Zdeněk Žabokrtský (2016): Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined. In: Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, pp. 199-207, Kyung Hee University, Seoul, Korea, ISBN 978-89-6817-428-5

[17] David Mareček (2016): Delexicalized and Minimally Supervised Parsing on Universal Dependencies. In: Statistical Language and Speech Processing, pp. 30-42, Springer International Publishing, Cham, Switzerland, ISBN 978-3-319-45924-0

[18] David Mareček, Zdeněk Žabokrtský (2016): Gibbs Sampling Segmentation of Parallel Dependency Trees for Tree-Based Machine Translation. In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 105, pp. 101-110

[19] David Mareček (2016): Merged bilingual trees based on Universal Dependencies in Machine Translation. In: Proceedings of the First Conference on Machine Translation (WMT). Volume 2: Shared Task Papers, pp. 333-338, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-945626-10-4

[20] David Mareček (2016): Twelve Years of Unsupervised Dependency Parsing. In: Proceedings of the 16th ITAT: Slovenskočeský NLP workshop (SloNLP 2016), pp. 56-62, CreateSpace Independent Publishing Platform, Bratislava, Slovakia, ISBN 978-1537016740

[21] McDonald, R., Petrov, S., Hall, K.: Multi-source Transfer of Delexicalized Dependency Parsers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 62{72. EMNLP 2011, Association for Computational Linguistics, Stroudsburg, PA, USA (2011)


[23] Ondřej Dušek, Zdeněk Žabokrtský, Martin Popel, Martin  Majliš, Michal Novák, and David Mareček (2012): Formemes in English-Czech Deep Syntactic MT. In Proceedings of the Seventh Workshop on Statistical Machine Translation, WMT’12, pages 267–274, Stroudsburg, PA, USA. Association for Computational Linguistics