CLARA Winter School on New Developments in Computational Linguistics

February 13-17, 2012, Prague, Czech Republic

Invited Speakers
Program

Invited Speakers

Raffaella Bernardi: Distributional Compositionality

The course is cancelled (due to the illness of the lecturer).

Since Frege, formal semanticists have focused on the interface between syntax and semantics and on the role played by grammatical words in leading the composition of content words. In brief, the meaning of words can be an object (eg. proper name), or a set of objects (eg. noun) or a set of sets (eg. quantifying determiners); sets are represented as functions, and function application and abstraction are employed to carry out meaning assembly guided by the syntactic structure. The FS view on the meaning of words has been found unsatisfactory by all those people who care of empirical analyses, FS models in fact do not handle the richness of lexical meaning and typically are not accompanied by learning methods. The empirical view has brought to the development of a framework known as Distributional Semantics Models (DSMs). In brief, the meaning of content words is approximated by vectors that summarize their distribution in large text corpora.

Recently, these two research trends have converged into Distributional Compositionality; a number of works have been carried out on how to incorporate compositionality in DSMs, in order to construct vectorial representations for linguistic constituents above the word (e.g.: Baroni and Zamparelli, 2010; Clarke et al., 2011; Erk et al., 2010; Grefenstette and Sadrzadeh, 2011; Mitchell and Lapata, 2010; Thater et al., 2010). Moreover, grammatical words and in particular logical words (quantifying determiners, coordination, negation) that for long have been considered to be part of the formal semantics realm only have became of interest within the DSMs framework too.

DSMs have been evaluated on different semantic tasks. For instance, they have proved to be very successful at modeling a wide range of lexico-semantic phenomena by geometric methods applied to the distributional space (Turney and Pantel, 2010). Sentential entailment has been the starting point of the logical view on natural language but again it is a valid task for any theory which aim to capture the semantics of a language. Hence, predicting entailment is a good test-bed for Compositional DSMs too.

In the course, after a brief general introduction to FS and DS models, I will introduce various state-of-the-art approaches to composition in DS and I will discuss ways in which they have been evaluated and point to on-going work within this research line.

Outline:

First Lecture: Formal Semantics Models

Brief introduction: Model, Domain, Function interpretation
Meaning of words, sentential entailment
Syntax-semantics and meaning of phrases/sentence

Reading:

Portner, P. and Partee, B. (eds.) (2002) Formal Semantics: The essential readings. Blackwell

Second Lecture: Distributional Semantics Models

Brief introduction to DSMs
From content words to grammatical words
Composing DS representation: state-of-art methods

Reading:

Peter D. Turney, and P. Pantel (2010) From Frequencyto Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research. 37, p. 141-188
G. Strang (2009) Introduction to Linear Algebra. Wellesley Cambridge Press
S. Evert and A. Lenci (2009) Distributional Semantic Models. Advanced course at ESSLLI 2009
http://wordspace.collocations.de/doku.php/course:esslli2009:start
K. Gimpel (2006) Modeling Topics

Third Lecture: Evaluation Tasks

Evaluation of state-of-the-art compositional DSMs

Reading:

M. Baroni and R. Zamparelli (2010) Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. Proceedings of EMNLP
M. Baroni, R. Bernardi, Q. Do, Ch. Shan (2012) Entailment above the word level in distributional semantics. Proceedings of EACL
E. Grefenstette and M. Sadrzadeh (2011) Experimenting with transitive verbs in a DisCoCat. Proceedings of GEMS
E. Guevara (2010) A regression model of adjective-noun compositionality in in distributional semantics. Proceedings of GEMS
W. Kintsch (2001) Predication. Cognitive Science, 25(2): 173-202.
J. Mitchell and M. Lapata (2008) Vector-based models of semantic composition. Proceedings of ACL
J. Mitchell and M. Lapata (2010). Composition in distributional models of semantics. Cognitive Science 34(8): 1388-1429

Jan Hajič, Jakub Mlynář: The MALACH Project: Research and Access to the Memories of Holocaust Survivors

Lecture, CVHM

The Shoa Foundation Institute's (University of Southern California, Los Angeles, USA) archive of the memories ("testimonies") of Holocaust survivors will be described, together with technology used for the creation, indexing (calaoguing) and search in the archive. The "Malach" project which was running 2002-2007 attempted to develop technology for automatic indexing of and the access to the archive; its technological achievements will be described. At present, new speech and language translation technologies are being developed to allow for more sophisticated and broader search possibilities to this and similar huge audio or videoarchives. These technologies will also be presented. The archive itself (or better, the Access Point to the archive located in the same building) will be shown to interested students later in the week.

Eckhard Bick: DeepDict, a Data-driven Relational Dictionary Tool

DeepDict slides, CG slides

DeepDict is a lexical database with a graphical interface, built from large text corpora annotated with Constraint Grammar dependency links as well as various morphological, syntactic and semantic tags. It allows the user to view collocational and frequency information for typical lexically governed constructions, such as "vote + on + proposal / amendment / report / resolution ...." or "ride / breed / frighten / tame + horse".

The lecture series will discuss not only (a) DeepDict's uses and perspectives, but also (b) the annotation system and (c) the Constraint Grammar parsing technology behind it, dedicating one session to each of these issues.

Since DeepDict is available for 10 different languages, it will be possible to accomodate participants' individual language interests to a certain degree. Also, given the modular and language-independent architecture of Constraint Grammar systems, workshop-style discussion of individual project ideas is strongly encouraged.

1. Websites

http://gramtrans.com/deepdict/ (The online DeepDict interface)
http://gramtrans.com/deepdict/reference/ (Its reference page)
http://visl.sdu.dk (SDU university site with the CG parsers etc. that are behind DeepDict),
e.g. http://beta.visl.sdu.dk/visl/en/parsing/automatic/ (English live parses at various levels).
For other languages, navigate -> sentence analysis -> machine analysis -> language; or -> language flag -> machine analysis -> flat structure / dependency
http://beta.visl.sdu.dk/constraint_grammar.html (Constraint Grammar, short introduction, CG lab and manual, language overview and references)
http://corp.hum.sdu.dk (CorpusEye, with corpus search interfaces for the corpora behind DeepDict)

2. Publications

Bick, Eckhard (2009). DeepDict - A Graphical Corpus-based Dictionary of Word Relations. Proceedings of NODALIDA 2009. NEALT Proceedings Series Vol. 4. pp. 268-271. Tartu: Tartu University Library. ISSN 1736-6305 (http://beta.visl.sdu.dk/~eckhard/pdf/nodalida2009_deepdict.pdf)
Karlsson et al. (1995). "Constraint Grammar - A Language-Independent System for Parsing Unrestricted Text". Mouton de Gruyter (The seminal book on CG, especially initial chapters, accessible in http://books.google.com)
Further, interest-based reading: The Eckhard Bick publications page with articles on various aspects of Constraint Grammar and its applications, targeting a range of languages:
http://beta.visl.sdu.dk/Artikeloversigt.html

3. Annotation, category definitions

http://beta.visl.sdu.dk/tagset_cg_general.html (Cross language Constraint Grammar tags)
http://beta.visl.sdu.dk/semantic_prototypes_overview.pdf (Semantic prototypes)
Depending on individual interests, language-specific definition overviews from http://beta.visl.sdu.dk/lecture_notes.html,
e.g. Portuguese: http://beta.visl.sdu.dk/visl/pt/info/symbolset-manual.html

Miles Osborne: Getting stuff done with Big Data

Lecture 1, Lecture 2, Lecture 3

Suppose you have one billion tweets. How do you process and manage this vast amount of information? These three classes will discuss background ideas associated with Big Data and will give an overview of techniques for dealing with it: Map Reduce, randomised algorithms (finger printing, Bloom Filters, Locality Sensitive Hashing) and streaming. Examples from natural language processing will be used. Technical aspects will be kept to a minimum and where possible everything will be explained from scratch.

Outline:

Lecture One: Big Data, Economics and Obstacles

This class will look at the problems and challenges associated with processing massive amounts of data, using commodity machines (ie cloud computing). We will touch upon questions of trust, economics as well those aspects of Big Data which make it hard to deal with.

Reading:

http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
http://queue.acm.org/detail.cfm?id=1563874
http://en.wikipedia.org/wiki/Power_law
http://perspectives.mvdirona.com/2010/07/13/HighPerformanceComputingHitsTheCloud.aspx
http://www.webhostingunleashed.com/features/server-meltdowns-millions-020309/
http://www.google.com/governmentrequests/
http://net.tutsplus.com/articles/general/supercharge-website-performance-with-aws-s3-and-cloudfront/

Lecture Two: Map Reduce and Hadoop

Given the background material presented in L1, L2 will give an overview of one popular way to problem solve using large numbers of unreliable machines. This class will introduce Hadoop (the open source version of Map Reduce), the Map Reduce programming models, efficiency concerns and will end with a critique.

Reading:

http://research.google.com/archive/papers/mapreduce-sigmetrics09-tutorial.pdf
http://database.cs.brown.edu/projects/mapreduce-vs-dbms/
http://hadoop.apache.org/

Lecture Three: Randomised Algorithms

Sometimes we need to deal with problems that are just to large for our machines. Randomised algorithms allow us to tackle such problems and can be amazingly fast (or compact). However, unlike conventional approaches, they can make mistakes. This class will show how two problems in natural language processing --representing large language models and finding breaking news in Twitter-- can be solved using Bloom Filters and Locality Sensitive Hashing.

Reading:

http://en.wikipedia.org/wiki/Universal_hashing
http://en.wikipedia.org/wiki/Randomized_algorithm
http://en.wikipedia.org/wiki/Bloom_filter
http://en.wikipedia.org/wiki/Locality_sensitive_hashing
http://homepages.inf.ed.ac.uk/miles/papers/naacl10a.pdf
http://homepages.inf.ed.ac.uk/miles/papers/acl07.pdf

Blaise Thomson: Statistical Spoken Dialogue Systems

Lecture 1, Code, SkLearn installer, Lecture 2, Lecture 3

Speech is an increasingly important medium of interaction with computer systems. With the advent of mobile applications, this is becoming even more important as people find it easier talk to their phones than to type on them. This lecture series will discuss how to build systems which interact via speech, called spoken dialogue systems, using statistical techniques. Topics covered will include supervised learning of shallow semantics and dialogue acts from text, reinforcement learning for learning dialogue strategies, Markov Decision Process and Partially Observable Markov Decision Processes.

Outline:

Lecture One: Introduction, Inputs & Outputs

Reading:

Pieraccini et al. (2009) Are We There Yet? Research in Commercial Spoken Dialog Systems. In Text, Speech and Dialogue, 12th International Conference, TSD 2009, Pilsen, Czech Republic
F. Mairesse et al. (2009) Spoken language understanding from unaligned data using discriminative classification models. In ICASSP 2009, Taiwan
F. Jurcicek et al. (2009) Transformation-based Learning for Semantic Parsing. In Interspeech 2009, Brighton, UK
Luke S. Zettlemoyer and Michael Collins (2009) Learning Context-dependent Mappings from Sentences to Logical Form. In Proceedings of the Joint Conference of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP)

Lecture Two: Dialogue belief modeling

Reading:

C. Bishop (2006). Pattern Recognition and Machine Learning. Chapter 8.
http://research.microsoft.com/en-us/um/people/cmbishop/prml/Bishop-PRML-sample.pdf
B. Thomson and S. Young (2010) Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems. Computer Speech & Language, vol. 24, no. 4, pp. 562-588
B. Thomson et al. (2010) Parameter learning for POMDP spoken dialogue models. In IEEE SLT'10: Spoken Language Technology Workshop, pp. 271-276

Lecture Three: Dialoge policy learning

Reading:

Sutton & Barto (1998) Reinforcement Learning
E. Levin et al (1998) Using Markov decision process for learning dialogue strategies. In ICASSP
N. Roy, J. Pineau and S. Thrun (2000) Spoken Dialog Management for Robots. In Association for Computational Linguistics. Hong Kong, Oct. 2000
S. Young et al. (2010) The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management? Computer Speech & Language, vol. 24, no. 2, pp. 150-174
J. Williams and S. Young (2007) Partially Observable Markov Decision Processes for Spoken Dialog Systems. Computer Speech and Language21(2):231-422.

Ondřej Bojar, Aleš Tamchyna, Jan Berka: Wild Experimenting in Machine Translation

Lecture Slides, Lab Tutorial, Lab Slides on Eman, Lab Slides on Addicter

The research in machine translation (MT) is far from a stable state. Many competing paradigms are being examined, many specific setups sound plausible, many hybrid methods are possible. The purpose of the course is to get acquainted with yet another experiment management tool (eman) and apply it to experiments with (factored) phrase-based MT. We will build on top of the outputs created in the labs of 'Natural Language Processing with Treex' (Martin Popel).

The participants can benefit from the course in two ways:

General experience with experimenting in Unix environment. (Unlike e.g. Experiment management system distributed with Moses, eman is versatile and task independent. It is only the specific set of 'seed scripts' that cover Moses training and evaluation pipeline.)
Exposure to phrase-based MT and exploration of some relevant parameters.

Outline:

(1) Lecture:

The inherent limitations of phrase-based vs. syntactic machine translation.
Factored models to include linguistic annotation in phrase-based MT.
Wild experimenting (with eman) and how to make sense of results.

(2) Lab:

Working with eman.
Using eman for Moses experiments.
Baseline and factored phrase-based English->Tamil translation.
Collecting and understanding scores.

(3) Lab:

Including Treex into eman.
Applying Treex pre-processing for English->Tamil MT.
(Time-permitting:) Visualisation of MT errors.

Reading:

Moses Manual ... especially Section 4 Background
http://www.statmt.org/moses/manual/manual.pdf
You may want to get Moses running on your laptop, but you will be provided with Unix servers and Moses pre-installed.
Bash
- (For Windows users: Console Crash-Course
  http://www.ibm.com/developerworks/linux/library/l-roadmap2/index.html)
- Introductory slides with some examples:
  http://www.cv.nrao.edu/~jmalone/talks/bash.pdf
- Bash manual page
  http://www.gnu.org/software/bash/manual/html_node/index.html
- Console text editors to your taste:
  a) nano: http://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/
  b) vim: http://blog.interlinked.org/tutorials/vim_tutorial.html
  c) whichever else you like

Martin Popel: Natural Language Processing with Treex

Lecture 1, Installation guide, First steps

Treex is is a highly modular, multi-purpose, multi-lingual, easily extendable Natural Language Processing framework. There is a number of NLP tools already integrated in Treex, such as morphological taggers, lemmatizers, named entity recognizers, dependency parsers, constituency parsers, various kinds of dictionaries. Treex allows storing all data in a rich XML-based format as well as several other popular formats (CoNLL, Penn MRG), which simplifies data interchange with other frameworks. Treex is tightly coupled with the tree editor TrEd, which allows easy visualization of syntactic structures. Treex is language universal and supports processing multilingual parallel data. Treex facilitates distributed processing on a computer cluster. One of the most sophisticated application developed in Treex is the deep-syntactic machine translation system TectoMT.

Outline:

(1) Lecture: Introduction to Treex and its main features
(2) Lab: Installing Treex, using it for tagging and parsing
(3) Lab: Implementing new Treex blocks for SMT preprocessing

Treex (formerly called TectoMT) is described in Popel and Žabokrtský (2010). If you plan to install it on your notebook, we recommend to learn basics of Perl and follow the installation guide.

Program

	9:30 -- 11:00	11:30 -- 13:00	14:30 -- 16:00	16:30 -- 18:00
Monday, Feb 13	Eckhard Bick 1 (S3)	Blaise Thomson 1 (S3)	Miles Osborne 1 (S3)
Tuesday, Feb 14	Miles Osborne 2 (S3)	Jan Hajič - CVHM (S3)	Eckhard Bick 2 (S3)	Jakub Mlynář - CVHM (Library)
Wednesday, Feb 15	Miles Osborne 3 (S1)	Blaise Thomson 2 (S1)	Eckhard Bick 3 (S1)	Ondřej Bojar 1 (S1)
Thursday, Feb 16	Blaise Thomson 3 (S1)	Martin Popel 1 (S1)	Martin Popel Lab1 (SU2)	Blaise Thomson Lab (SU2)	workshop dinner
Friday, Feb 17	Martin Popel Lab2 (SU2)	Ondřej Bojar Lab1 (SU2)	Ondřej Bojar Lab2 (SU2)
Saturday, Feb 18	CLARA consortium meeting (S6)
Saturday, Feb 18	CLARA fellows meeting (S7)

CLARA Winter School on New Developments in Computational Linguistics

Institute of Formal and Applied Linguistics

Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

CLARA Winter School on New Developments in Computational Linguistics

February 13-17, 2012, Prague, Czech Republic

Invited Speakers

Raffaella Bernardi: Distributional Compositionality

Outline:

Jan Hajič, Jakub Mlynář: The MALACH Project: Research and Access to the Memories of Holocaust Survivors

Eckhard Bick: DeepDict, a Data-driven Relational Dictionary Tool

Miles Osborne: Getting stuff done with Big Data

Outline:

Blaise Thomson: Statistical Spoken Dialogue Systems

Outline:

Ondřej Bojar, Aleš Tamchyna, Jan Berka: Wild Experimenting in Machine Translation

Outline:

Martin Popel: Natural Language Processing with Treex

Outline:

Program

Site navigation: