Research Program

The main aim of the scientific program is research and development in modern computational linguistics, on the recently gained high level based on the unique multilevel grammatical analysis of a very large corpus of Czech.

The research in linguistic foundations of computational linguistics has as its main aim the formulation and refinement of a descriptive framework meeting the methodological requirements of formal linguistics, and applying the main results of classical linguistics, especially those of the functional-structural Prague School. The core of language structure is handled as a set of underlying (tectogrammatical) sentence representations having the form of dependency trees as it is elaborated in the Functional Generative Description of Czech.

The framework describes the system of language by relatively very simple means on the basis of structural properties the handling of which corresponds to general (innate) human capacities. However, the semantic-(pragmatic) interpretation is based on knowledge-based inferencing, and the degree of understanding depends on individual degrees of the intellectual faculty. Such an economical description accounts for the interdisciplinary nature of language, esp. for the contextual anchoring of sentence.

Although using only simple means to describe the core of language (the large and complex periphery of that which has to be accounted for by more specific rules, limited by contextual restrictions), the framework may serve as an input to the semantic-(pragmatic) interpretation, which then can be based on intensional (post-Montagovian) semantics, accounting for the truth conditions of sentence tokens as relativized to contexts. A view of the content of a sentence as constituting an operation on the hearer's memory (context-change potential) can thus be achieved, applying all the explicitness of truth-conditional semantics.

The following issues represent the backbone of the research program:

(a) Theoretical aspects of computational linguistics with a special regard to the Czech language, both in its written and spoken form and with a due respect to possible applications; this research is carried out on a qualitatively higher level than has ever been possible, thanks to the existence of the Prague Dependency Treebank, which offers a semi-automatic analysis of a large set of tens (and soon hundreds) of thousands of Czech sentences;

(b) The use of the exceptionally rich Czech language resources and the Prague Dependency Treebank (PDT in the sequel), especially since Czech is the single language with rich morphology that has been analyzed to a comparable degree; PDT has been developed by the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, for a subtle grammatical, semantic and lexical analysis of Czech.

(c) Processing of multilingual resources help to make the results of the research on Czech comparable with research results for other languages; attention is paid to the study and use of parallel corpora, aiming at most different possible applications, such as information retrieval (data mining) in multilingual texts, machine (assisted) translation and the fundamental research of multimodal interaction;

(d) The methodology is based on a deep study, comparison and considerate employment of both structural and statistical approaches, including methods of machine learning, having in mind the specific typological properties of Czech as a highly inflected language; in this respect an original methodology is being developed and used since most of the hitherto proposed approaches had as their resources data from English or similar languages, with a low degree of inflection (with 'poor' morphology and a highly rigid word order); Due attention is paid to mathematical and computational foundations of the methods, algorithms and procedures of natural language processing;

(e) Close contacts are maintained and/or established with Czech and internationally based computer industry in order to supply them with well-founded and useful resources for a broader-scale development and application.

Conceptual and Methodological Basis

The research is based on the state-of-the-art results in Computer Science, Theoretical and Computational Linguistics and Speech Recognition, borrowing suitable methods and algorithms from Statistics and Probability Theory, Information Theory, Discrete and Numerical Mathematics and Artificial Intelligence.

The methods used are undoubtedly build on both the statistical branch (using mostly statistical methods and machine learning from large annotated and plain text and speech corpora), and on the structural ("introspective") branch, (using mostly the formally elaborated general linguistic knowledge accumulated in the Functional Generative Description, which have gained broad positive response. In both cases, the research has mostly experimental nature, processing large amounts of electronic texts.

Both qualitative and quantitative results allow for a rigorous evaluation of the proposed theories and methods. The research follow the current worldwide-accepted evaluation standards used in the field, thus exposing itself to a comparison with the most advanced achievements throughout the world.

Every effort is made to publish the results, using many means and ways, from early electronic publication (through web archives and publishing sites), to manuals, workshops, schools and major conferences. Due to the rapid development of the field, journal publication is encouraged only locally (in Czech, mostly for educational purposes and public awareness) and in journals which assure appearance of an article in a timely manner.

The research activities provide rich opportunities for PhD students to participate in state-of-the-art research in computational linguistics, which constitutes an important aspect of their study. Both students with good background in computer science (coming primarily from the Faculty of Mathematics and Physics, UK) and students with background in theoretical linguistics and bohemian studies (coming mostly from the Faculty of Arts, UK) co-operate on the research guaranteed by the Institute of Formal and Applied Linguistics.


Webmasters: Juraj Šimlovič.
Site is valid XHTML 1.0 and valid CSS. Maintained with TED Notepad replacement and Vim text editor.
2007 © Institute of Formal and Applied Linguistics. All Rights Reserved.

Site navigation: