Abstract: Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid. |
BibTeX:
@book{rosen-etal-2020-book, title = {Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech}, author = {Rosen, Alexandr and Hana, Jiri and Vidová Hladká, Barbora and Jelínek, Tomá\v{s} and \v{S}kodov\'{a}, Svatava and \v{S}tindlov\'{a}, Barbora}, year = {2020}, publisher = {Charles University Press}, organization = {Charles University}, address = {Prague, Czech}, isbn = {978-8024647593}, url= {http://hdl.handle.net/20.500.11956/123103} } |
BibTeX:
@TECHREPORT{mikulova:etal:2020-morphManual, author={Marie Mikulov{\'{a}} and Jan Haji{\v{c}} and Jiri Hana and Hana Hanov{\'{a}} and Jaroslava Hlav{\'{a}}{\v{c}}ov{\'{a}} and Emil Je{\v{r}}{\'{a}}bek and Barbora {\v{S}}t{\v{e}}p{\'{a}}nkov{\'{a}} and Barbora Vidov{\'{a}} Hladk{\'{a}} and Daniel Zeman}, title = {{Manual for Morphological Annotation, Revision for Prague Dependency Treebank – Consolidated}}, institution = {{\'{U}}FAL MFF UK}, year = {2020}, number = {TR-2005-64}, address = {Prague, Czech Rep.}, booktitle = {{}}, issn = {1214-5521}, language = {eng} } |
Abstract: CzeSL is a learner corpus of texts produced by non-native speakers of Czech. Such corpora area great source of information about specific features of learners’ language, helping language teachers and researchers in the area of second language acquisition. In our project, we have focused on syntactic annotation of the non-native text within the framework of Universal Dependencies. As far as we know, this is a first project annotating a richly inflectional non-native language. Our ideal goal has been to annotate according to the non-native grammar in the mind of the author, not according to the standard grammar. However, this brings many challenges. First, we do not have enough data to get reliable insights into the grammar of each author. Second, many phenomena are far more complicated than they are in native languages. We believe that the most important result of this project is not the actual annotation, but the guidelines and principles that can be used as a basis for other non-native languages. |
BibTeX:
@inproceedings{hana-hladka-2018-oslo, booktitle = {Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories ({TLT} 2018)}, title = {Universal Dependencies and Non-Native Czech}, editor = {Dag Haug and Stephan Oepen and Lilja {\{O}}vrelid and Marie Candito and Jan Haji{\v{c}}}, author = {Jirka Hana and Barbora Hladk{\'{a}}}, year = {2018}, publisher = {Link{\"{o}}ping University Electronic Press}, organization = {Universitetet i Oslo}, address = {Link{\"{o}}ping, Sweden}, venue = {Universitetet i Oslo}, pages = {105--114}, isbn = {978-91-7685-137-1}, issn = {1650-3740} } |
Abstract: Our goal has been to annotate the CzeSL corpus according to the non-native grammar in the mind of the author, not according to the standard grammar. However, this brings many challenges. First, we do not have enough data to get reliable insights into the grammar of each author. Second, many phenomena are far more complicated than they are in native languages. |
BibTeX:
@inproceedings{hana-hladka-2018-hkg, booktitle = {Proceedings of the International Conference on Bilingual Learning and Teaching.}, title = {Syntactic annotation of a second-language learner corpus.}, organization = {The Open University of Hong Kong}, author = {Jirka Hana and Barbora Hladká}, year = {to appear}, } |
Abstract: Our goal has been to annotate the CzeSL corpus according to the non-native grammar in the mind of the author, not according to the standard grammar. However, this brings many challenges. First, we do not have enough data to get reliable insights into the grammar of each author. Second, many phenomena are far more complicated than they are in native languages. |
BibTeX:
@inproceedings{hana-hladka-2017-taipei, author = "Hana, Jirka and Hladka, Barbora", title = "Understanding Non-Native Writings: Can a Parser Help?", booktitle = "Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017) ", year = "2017", publisher = "Asian Federation of Natural Language Processing", pages = "12--16", location = "Taipei, Taiwan", url = "http://aclweb.org/anthology/W17-5902" } |
Abstract: The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains 2,290 learner texts produced in standardized language certifications covering CEFR levels A1–C1. The MERLIN annotation scheme includes a wide range of language characteristics that enable research into the empirical foundations of the CEFR scales and provide language teachers, test developers, and Second Language Acquisition researchers with concrete examples of learner performance and progress across multiple proficiency levels. For computational linguistics, it provide a range of authentic learner data for three target languages, supporting a broadening of the scope of research in areas such as automatic proficiency classification or native language identification. The annotated corpus and related information will be freely available as a corpus resource and through a freely accessible, didactically-oriented online platform. |
BibTeX:
@InProceedings{boyd-etal-2014-lrec, author = "Boyd, Adriane and Hana, Jirka and Nicolas, Lionel and Meurers, Detmar and Wisniewski, Katrin and Abel, Andrea and Sch{\"o}ne, Karin and {\v{S}}tindlov{\'a}, Barbora and Vettori, Chiara", title = "The MERLIN corpus: Learner language and the CEFR", booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014)", year = "2014", publisher = "European Language Resources Association (ELRA)", location = "Reykjavik, Iceland", url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/606_Paper.pdf" } |
Abstract: The goal of this study is to investigate whether learners’ written data in highly inflectional Czech can suggest a consistent set of clues for automatic identification of the learners’ L1 background. For our experiments, we use texts written by learners of Czech, which have been automatically and manually annotated for errors. We define two classes of learners: speakers of Indo-European languages and speakers of non-Indo-European languages. We use an SVM classifier to perform the binary classification. We show that non-content based features perform well on highly inflectional data. In particular, features reflecting errors in orthography are the most useful, yielding about 89% precision and the same recall. A detailed discussion of the best performing features is provided. |
BibTeX:
@inproceedings{aharodnik-etal-2013-nagoya, booktitle = {Proceedings of the 6th International Joint Conference on Natural Language Processing (IJNCLP 2013), Nagoya, Japan, October 2013}, title = {Automatic Identification of Learners’ Language Background based on their Writing in Czech}, author = {Katsiaryna Aharodnik and Marco Chang and Anna Feldman and Jirka Hana}, year = {2013}, pages = {1428--1436}, isbn = {978-4-9907348-0-0} } |
Abstract: The paper describes CzeSL, a learner corpus of Czech as a Second Language, together with its design properties. We start with a brief introduction of the project within the context of AKCES, a programme addressing Acquisition Corpora of Czech; in connection with the programme we are also concerned with the groups of respondents, including differences due to their L1; further we comment on the choice of the sociocultural metadata recorded with each text and related both to the learner and the text production task. Next we describe the intended uses of CzeSL. The core of the paper deals with transcription and annotation. We explain issues involved in the transcription of handwritten texts and present the concept of a multi-level annotation scheme including a taxonomy of captured errors. We conclude by mentioning results from an evaluation of the error annotation and presenting plans for future research. |
BibTeX:
@inproceedings{stindlova-etal-2013-czesl-louvain, address = {Louvain-la-Neuve}, author = {Barbora {\v S}tindlov{\'a} and Svatava {\v S}kodov{\'a} and Jirka Hana and Alexandr Rosen}, booktitle = {Twenty Years of Learner Corpus Research: Looking back, Moving ahead. Proceedings of , 15-17 September 2011}, editor = {Sylviane Granger and Ga{\"e}tanelle Gilquin and Fanny Meunier}, keywords = {learner corpora, error annotation}, note = {In print}, publisher = {Presses Universitaires de Louvain}, series = {Corpora and Language in Use}, title = {A learner corpus of {C}zech: current state and future directions}, year = {2013} } |
Abstract: The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation. |
BibTeX:
@article{rosen-etal-2013-czesl-lre, author = {Rosen, Alexandr and Hana, Jirka and {\v S}tindlov{\'a}, Barbora and Feldman, Anna}, doi = {10.1007/s10579-013-9226-3}, issn = {1574-020X}, journal = {Language Resources and Evaluation}, keywords = {Learner corpus; Error annotation; Second language acquisition; Czech}, language = {English}, month = {April}, pages = {1-28}, publisher = {Springer Netherlands}, title = {Evaluating and automating the annotation of a learner corpus}, url = {http://dx.doi.org/10.1007/s10579-013-9226-3}, year = {2013} } |
Abstract: We present an approach to building a learner corpus of Czech, manually corrected and annotated with error tags using a complex grammar-based taxonomy of errors in spelling, morphology, morphosyntax, lexicon and style. This grammar-based annotation is supplemented by a formal classification of errors based on surface alternations. To supply additional information about non-standard or ill-formed expressions, we aim at a synergy of manual and automatic annotation, deriving information from the original input and from the manual annotation. |
BibTeX:
@inproceedings{jelinek-etal-2012-czesl-tsd, author = {Tomas Jelinek and Barbora Stindlov{\'a} and Alexandr Rosen and Jirka Hana}, title = {Combining Manual and Automatic Annotation of a Learner Corpus}, year = {2012}, pages = {127-134}, ee = {http://dx.doi.org/10.1007/978-3-642-32790-2_15}, editor = {Petr Sojka and Ales Hor{\'a}k and Ivan Kopecek and Karel Pala}, booktitle = {Text, Speech and Dialogue - 15th International Conference, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, volume = {7499}, isbn = {978-3-642-32789-6} } |
Abstract: The paper presents the issues of annotation of the Czesl, a Czech learner corpus, the concept of its annotation scheme and a description of the annotation process. |
BibTeX:
@inproceedings{stindlova-etal-2010-czesl-fdsl, booktitle = {Studies in Formal Slavic Linguistics. Contributions from Formal Description of Slavic Languages 8.5}, title = {Annotating foreign learners’ Czech}, editor = {Mark{\'{e}}ta Zikov{\'{a}} and Mojm{\'{i}}r Do{\v{c}}ekal}, author = {Barbora {\v{S}}tindlov{\'{a}} and Svatava {\v{S}}kodov{\'{a}} and Alexandr Rosen and Jirka Hana}, year = {2012}, publisher = {Peter Lang GmbH}, organization = {Masarykova univerzita v Brn{\v{e}}}, address = {Frankfurt am Main, Germany}, series = {Linguistik International}, pages = {205--219}, isbn = {978-3-631-63609-1} } |
Abstract: The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation. |
BibTeX:
@InProceedings{hana-etal-2012-czesl-lrec, author = {Jirka Hana and Alexandr Rosen and Barbora Štindlová and Petr Jäger}, title = {Building a learner corpus}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |
Abstract: |
BibTeX:
@inproceedings{Skodova:etal:2011, Author = {Svatava Škodová and Barbora Štindlová and Jirka Hana and Alexandr Rosen}, Title = {Víceúrovňová anotace českého žákovského korpusu}, Pages = {208-225}, Booktitle = {Korpusová lingvistika Praha 2011: 3 - Gramatika a značkování korpusů}, Address = {Praha}, Editor = {Vladimír Petkevič and Alexandr Rosen}, Publisher = {Nakladatelství Lidové noviny}, Series = {Studie z korpusové lingvistiky}, Volume = {16}, Year = {2011} } |
Abstract: Using an error-annotated learner corpus as the basis, the goal of this paper is two-fold: (i) to evaluate the practicality of the annotation scheme by computing inter-annotator agreement on a non-trivial sample of data, and (ii) to find out whether the application of automated linguistic annotation tools (tag- gers, spell checkers and grammar checkers) on the learner text is viable as a sub- stitute for manual annotation. |
BibTeX:
@InProceedings{stindlova:etal:2011:palc, author = {Barbora Štindlová and Svatava Škodová and Jirka Hana and Alexandr Rosen}, title = {CzeSL – an error tagged corpus of {C}zech as a second language}, booktitle = {PALC 2011 – Practical Applications in Language and Computers, Lódż 13–15 April 2011}, series = {Łódź Studies in Language}, publisher = {Peter Lang}, year = {to appear} } |
Abstract: The paper describes a learner corpus of Czech, currently under development. The corpus captures Czech as used by non- native speakers. We discuss its structure, the layered annotation of errors and the an- notation process. |
BibTeX:
@inproceedings{hana:etal:2010:czesl:law, title = {{Error-tagged Learner Corpus of Czech}}, author = {Hana, Jirka and Rosen, Alexandr and \v{S}kodov\'{a}, Svatava and \v{S}tindlov\'{a}, Barbora}, booktitle = {Proceedings of The Fourth Linguistic Annotation Workshop (LAW IV)}, year = {2010}, address = {Uppsala} } |
Abstract: Our goal has been to annotate the CzeSL corpus according to the non-native grammar in the mind of the author, not according to the standard grammar. However, this brings many challenges. First, we do not have enough data to get reliable insights into the grammar of each author. Second, many phenomena are far more complicated than they are in native languages. |
BibTeX:
@inproceedings{klic:hana:2015:itat, booktitle = {Proceedings of the 15th conference {ITAT} 2015: Slovak and Czech NLP Workshop (SloNLP 2015)}, title = {Resource-Light Acquisition of Inflectional Paradigms}, editor = {Jakub Yaghob}, author = {Radoslav Kl{\'{\i}}{\v{c}} and Jirka Hana}, year = {2015}, publisher = {CreateSpace Independent Publishing Platform}, organization = {Charles University in Prague}, address = {Praha, Czechia}, venue = {Hotel {\v{C}}ingov}, series = {{CEUR} Workshop Proceedings}, volume = {1422}, pages = {66--72}, isbn = {978-1515120650}, issn = {1613-0073}, } |
Abstract: This article surveys resource-light monolingual approaches to morphological analysis and tagging. While supervised analyzers and taggers are very accurate, they are extremely expensive to create. Therefore, most of the world languages and dialects have no realistic prospect for morphological tools created in this way. The weakly-supervised approaches aim to minimize time, expertise and/or financial cost needed for their development. We discuss the algorithms and their performance considering issues such as accuracy, portability, development time and granularity of the output. |
BibTeX:
@article {hana-feldman-2012-compass, author = {Hana, Jirka and Feldman, Anna}, title = {Resource-Light Approaches to Computational Morphology Part 1: Monolingual Approaches}, journal = {Language and Linguistics Compass}, volume = {6}, number = {10}, publisher = {Blackwell Publishing Ltd}, issn = {1749-818X}, url = {http://dx.doi.org/10.1002/lnc3.358}, doi = {10.1002/lnc3.358}, pages = {622--634}, year = {2012}, } |
Abstract: In this paper we describe our efforts to build a corpus of Old Czech. We report on tools, resources and methodologies used during the corpus development as well as discuss the corpus sources and structure, the tagset used, the approach to lemmatization, morphological analysis and tagging. Due to practical restrictions we adapt resources and tools developed for Modern Czech. However, some of the described challenges, such as the non-standardized spelling in early Czech and the form and lemma variability due to language change during the covered time-span, are unique and never arise when building synchronic corpora of Modern Czech. |
BibTeX:
@InProceedings{hana-etal-2012-ocz-lrec, author = {Jirka Hana and Boris Lehečka and Anna Feldman and Alena Černá and Karel Oliva}, title = {Building a Corpus of Old Czech}, booktitle = {Proceedings of the Adaptation of Language Resources and Tools for Processing Cultural Heritage Objects Workshop associated with the 8th International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {26}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |
Abstract: The paper describes a tagger for Old Czech (1200-1500 AD), a fusional language with rich morphology. The practical restrictions (no native speakers, limited corpora and lexicons, limited funding) make Old Czech an ideal candidate for a resource-light cross-lingual method that we have been developing (e.g. Hana et al., 2004; Feldman and Hana, 2010). We use a traditional supervised tagger. However, instead of spending years of effort to create a large annotated corpus of Old Czech, we approximate it by a corpus of Modern Czech. We perform a series of simple transformations to make a modern text look more like a text in Old Czech and vice versa. We also use a resource-light morphological analyzer to provide candidate tags. The results are worse than the results of traditional taggers, but the amount of language-specific work needed is minimal. |
BibTeX:
@InProceedings{hana:etal:2011:latech, author = {Hana, Jirka and Feldman, Anna and Aharodnik, Katsiaryna}, title = {A low-budget tagger for Old Czech}, booktitle = {Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities}, month = {June}, year = {2011}, address = {Portland, OR, USA}, publisher = {Association for Computational Linguistics}, pages = {10--18}, url = {http://www.aclweb.org/anthology/W11-1502} } |
Abstract:
While supervised corpus-based methods are highly accurate for different NLP tasks, including morphological tagging, they are difficult to port to other languages because they require resources that are expensive to create. As a result, many languages have no realistic prospect for morpho-syntactic annotation in the foreseeable future. The method presented in this book aims to overcome this problem by significantly limiting the necessary data and instead extrapolating the relevant information from another, related language. The approach has been tested on Catalan, Portuguese, and Russian. Although these languages are only relatively resource-poor, the same method can be in principle applied to any inflected language, as long as there is an annotated corpus of a related language available. Time needed for adjusting the system to a new language constitutes a fraction of the time needed for systems with extensive, manually created resources: days instead of years.
This book touches upon a number of topics: typology, morphology, corpus linguistics, contrastive linguistics, linguistic annotation, computational linguistics and Natural Language Processing (NLP). Researchers and students who are interested in these scientific areas as well as in cross-lingual studies and applications will greatly benefit from this work. Scholars and practitioners in computer science and linguistics are the prospective readers of this book. |
BibTeX:
@book{feldman:hana:2010:rodopi, author = {Anna Feldman and Jirka Hana}, title = {A resource-light approach to morpho-syntactic tagging}, year = {2010}, publisher = {Rodopi}, address = {Amsterdam/New York, NY}, pages = {199}, url = {http://www.rodopi.nl/ntalpha.asp?BookId=LC+70} } |
Abstract: We describe the challenges of resource creation for a resource-light system for morphological tagging of fusional lan- guages (Feldman and Hana, 2010). The constraints on resources (time, expertise, and money) introduce challenges that are not present in development of morphologi- cal tools and corpora in the usual, resource intensive way. |
BibTeX:
@inproceedings{hana:feldman:2010:morph:law, author = {Hana, Jirka and Feldman, Anna}, title = {{Challenges of Cheap Resource Creation for Morphological Tagging}}, booktitle = {Proceedings of The Fourth Linguistic Annotation Workshop (LAW IV)}, year = {2010}, address = {Uppsala}, keywords = {corpus annotation,resource-light morphology,tagset creation} } |
Abstract: Fusional languages have rich inflection. As a consequence, tagsets capturing their morphological features are necessarily large. A natural way to make a tagset manageable is to use a structured system. In this paper, we present a positional tagset for describing morphological properties of Russian. The tagset was inspired by the Czech positional system (Hajic, 2004). We have used preliminary versions of this tagset in our previous work (e.g., Hana et al. (2004, 2006); Feldman (2006); Feldman and Hana (2010)). Here, we both systematize and extend these preliminary versions (by adding information about animacy, aspect and reflexivity); give a more detailed description of the tagset and provide comparisons with the Czech system. |
BibTeX:
@inproceedings{hana:feldman:2010:lrec, title = {A Positional Tagset for Russian}, author = {Jirka Hana and Anna Feldman}, year = {2010}, booktitle = {Proceedings of the 7th International Conference on Language Resources and Evaluation ({LREC} 2010)}, publisher = {European Language Resources Association}, address = {Valletta, Malta}, pages = {1278--1284}, isbn = {2-9517408-6-7}, } |
Abstract: A simple manual for morphological annotation. It is intended to be used for various languages. Examples mostly in Czech, English, and Russian. |
BibTeX:
@unpublished{hana:feldman:2008:manual, author = {Jirka Hana and Anna Feldman}, title = {Manual For Morphological Annotation}, year = {2008}, note={Version 2008-12-07}, url = {http://www.ling.ohio-state.edu/~hana/bib/hana-feldman-2008-manual.odt} } |
Abstract: We describe a knowledge and labor-light system for morphological analysis of fusional languages, exemplified by analysis of Czech. Our approach takes the middle road between completely unsupervised systems on the one hand and systems with extensive manually-created resources on the other. For the majority of languages and applications neither of these extreme approaches seems warranted. The knowledge-free approach lacks precision and the knowledge- intensive approach is usually too costly. We show that a system using a little knowledge can be effective. This is done by creating an open, flexible, fast, portable system for morphological analysis. Time needed for adjusting the system to a new language constitutes a fraction of the time needed for systems with extensive manually created resources: days instead of years. We tested this for Russian, Portuguese and Catalan. |
BibTeX:
@ARTICLE{hana:2008-wp-morph, author = {Jirka Hana}, title = {Knowledge- and labor-light morphological analysis}, journal = {OSUWPL}, year = {2008}, volume = {58}, pages = {52-84}, url = {http://www.ling.ohio-state.edu/~hana/bib/hana-2008-wp-morph.pdf} } |
Abstract: We describe a knowledge and resource light system for an automatic morphological analysis and tagging of Brazilian Portuguese. We avoid the use of labor intensive resources; particularly, large annotated corpora and lexicons. Instead, we use (i) an annotated corpus of Peninsular Spanish, a language related to Portuguese, (ii) an unannotated corpus of Portuguese, (iii) a description of Portuguese morphology on the level of a basic grammar book. We extend the similar work that we have done (Hana et al., 2004; Feldman et al., 2006) by proposing an alternative algorithm for cognate transfer that effectively projects the Spanish emission probabilities into Portuguese. Our experiments use minimal new human effort and show 21% error reduction over even emissions on a fine-grained tagset. |
BibTeX:
@INPROCEEDINGS{hana:etal:2006-eacl, author = {Jirka Hana and Anna Feldman and Luiz Amaral and Chris Brew}, title = {Tagging Portuguese with a Spanish Tagger Using Cognates}, booktitle = {Proceedings of the Workshop on Cross-language Knowledge Induction, 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), Trento, Italy.}, year = {2006}, pdf = {http://www.ling.ohio-state.edu/~hana/bib/hanaEtal2006-eacl.pdf} } |
Abstract: We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource. |
BibTeX:
@INPROCEEDINGS{feldman:etal:2006-lrec, author = {Anna Feldman and Jirka Hana and Chris Brew}, title = {A cross-language approach to rapid creation of new morpho-syntactically annotated resources}, booktitle = {Proceedings of the fifth international conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy}, year = {2006}, pdf = {http://www.ling.ohio-state.edu/~hana/bib/feldmanHanaBrew2006-lrec.pdf} } |
Abstract: Annotated corpora are valuable resources for NLP which are often costly to create. We introduce a method for transferring annotation from a morphologically annotated corpus of a source language to a target language. Our approach assumes only that an unannotated text corpus exists for the target language and a simple textbook which describes the basic morphological properties of that language is available. Our paper describes experiments with Polish, Czech, and Russian. However, the method is not tied in any way to these languages. In all the experiments we use the TnT tagger, a second-order Markov model. Our approach assumes that the information acquired about one language can be used for processing a related language. We have found out that even breathtakingly naive things (such as approximating the Russian transitions by Czech and/or Polish and approximating the Russian emissions by (manually/ automatically derived) Czech cognates) can lead to a significant improvement of the tagger’s performance. |
BibTeX:
@INPROCEEDINGS{feldman:2006-cicling, author = {Anna Feldman and Jirka Hana and Chris Brew}, title = {Experiments in Morphological Annotation Transfer}, booktitle = {Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing)}, year = {2006}, editor = {A. Gelbukh}, series = {Lecture Notes in Computer Science}, publisher = {Springer-Verlag}, pdf = {http://www.ling.ohio-state.edu/~hana/bib/feldmanHanaBrew2006-cicling.pdf} } |
BibTeX:
@TECHREPORT{hanaEtAl:2005-morphManual, author = {Jiri Hana and Daniel Zeman and Jan Haji{\v{c}} and Hana Hanov{\'{a}} and Barbora Hladk{\'{a}} and Emil Je{\v{r}}{\'{a}}bek}, title = {{Manual for Morphological Annotation, Revision for the Prague Dependency Treebank 2.0}}, institution = {{\'{U}}FAL MFF UK}, year = {2005}, number = {TR-2005-27}, address = {Prague, Czech Rep.}, booktitle = {{}}, issn = {1214-5521}, language = {eng}, pageswhole = {55} } |
Abstract: We report on morphological tagging of Russian using very limited Russian resources. We train the TnT tagger (Brants, 2000) on a modified Czech corpus to get the transition probabilities. We believe that the two languages are similar enough for the transitional information to be useful. The Russian emission symbols are obtained using a morphological analyzer that does not rely on a manually created lexicon. Finally, we report on several simple systematic modifications transforming a Czech text into a text with more Russian-like morphological properties. |
BibTeX:
@INPROCEEDINGS{hana:feldman:2004, author = {Jiri Hana and Anna Feldman}, title = {{Portable Language Technology: Russian via Czech}}, booktitle = {{Proceedings from the Midwest Computational Linguistics Colloquium, June 25-26, 2004}}, year = {2004}, address = {Bloomington, Indiana}, pdf = {http://www.ling.ohio-state.edu/~hana/bib/HanaFeldman2004-RusViaCze.pdf} } |
Abstract: In this paper, we describe a resource-light system for the automatic morphological analysis and tagging of Russian. We eschew the use of extensive resources (particularly, large annotated corpora and lexicons), exploiting instead (i) pre-existing annotated corpora of Czech; (ii) an unannotated corpus of Russian. We show that our approach has benefits, and present what we believe to be one of the first full evaluations of a Russian tagger in the openly available literature. |
BibTeX:
@INPROCEEDINGS{hana:etal:2004:emnlp, author = {Jiri Hana and Anna Feldman and Chris Brew}, title = {{A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources}}, booktitle = {{Proceedings of EMNLP 2004}}, year = {2004}, address = {Barcelona, Spain}, pdf = {http://www.ling.ohio-state.edu/~hana/bib/HanaFeldmanBrew2004-RusMorphLite.pdf} } |
BibTeX:
@TECHREPORT{hana:etal:2002, author = {Jiri Hana and Hana Hanov{\'a} and Jan Hajic and Barbora Vidov{\'a}-Hladk\'a and Emil Jer{\'a}bek}, title = {Manual for Morphological Annotation}, institution = {CKL MFF UK}, year = {2002}, number = {TR-2002-14} } |
Abstract: The thesis describes the morphology of Esperanto by a two-level morphology system. Esperanto is an agglutinating language, therefore the two-level morphology approach is extremely suitable for it. THe system is evaluated on a large corpus of Esperanto text. |
BibTeX:
Coming soon. |
Abstract:
This thesis has three interrelated goals:
The main goal is an analysis of Czech clitics, units of grammar on the borderline between morphology and syntax with rather peculiar ordering properties both relative to the whole clause and to each other. We examine the actual set of clitics, their rather rigid ordering properties, and finally the properties of so-called clitic climbing. The analysis evaluates previous research, but it also provides new insights, especially in the position of the clitic cluster and in the constraints on clitic climbing. We show that many of the constraints regarding position of the clitic cluster suggested in previous research do not hold. We also argue that cases when clitics do not follow the first constituent are in fact not exceptions in clitic placement but instead unusual frontings. The second goal is the development of a framework within Higher Order Grammar (HOG) supporting a transparent and modular treatment of word order. Unlike previous versions of HOG, we work with signs (containing phonological, syntactic and potentially other information) as actual objects of the grammar. Apart from that, we build on the simplicity and elegance of the pre-formal part of the linearization framework within Head-driven Phrase Structure Grammar. Finally, the third objective is to test the result of the second goal by applying it on the results of the first goal. |
BibTeX:
@PHDTHESIS{hana:diss, author = {Hana, Jiri}, title = {Czech Clitics in Higher Order Grammar}, school = {The Ohio State University}, year = {2007}, pdf = {http://ling.osu.edu/~hana/bib/hana-diss.pdf} } |
Abstract:
This paper presents an analysis of certain aspects of Czech sentential clitics
in Higher Order Grammar. I focus on the relative order of clitics within the clitic
cluster. The overall aim of the paper is to show that constraints governing Czech
sentential clitics, ‘ though quite complex, can be captured relatively easily within a
higher order formalism such as Higher Order Grammar. |
BibTeX:
@INCOLLECTION{hana:2004, author = {Jirka Hana}, title = {{Czech clitics in Higher Order Grammar}}, booktitle = {{Working Papers in Slavic Studies}}, publisher = {Department of Slavic and East European Languages and Literatures}, year = {2004}, address = {Columbus, Ohio}, pdf = {http://www.ling.ohio-state.edu/~hana/bib/Hana2004-Clitics.pdf} } |
Abstract: In this paper we describe the Prague Markup Language (PML), a generic and open XMLbased format intended to define format of linguistic resources, mainly annotated corpora. We also provide an overview of existing tools supporting PML, including annotation editors,a corpus query system, software libraries, etc. |
BibTeX:
@InProceedings{hana-stepanek-2012-pml-law, author = {Hana, Jirka and \v{S}těp\'{a}nek, Jan}, title = {Prague Markup Language Framework}, booktitle = {Proceedings of the Sixth Linguistic Annotation Workshop}, month = {July}, year = {2012}, address = {Jeju, Republic of Korea}, publisher = {Association for Computational Linguistics}, pages = {12--21}, url = {http://www.aclweb.org/anthology/W12-3603} } |
Abstract: We present a new way to get more morphologically and syntactically annotated data. We have developed an annotation editor tailored to school children to involve them in text annotation. Using this editor, they practice morphology and dependency-based syntax in the same way as they normally do at (Czech) schools, without any special training. Their annotation is then automatically transformed into the target annotation schema. The editor is designed to be language independent, however the subsequent transformation is driven by the annotation framework we are heading for. In our case, the object language is Czech and the target annotation scheme corresponds to the Prague Dependency Treebank annotation framework. |
BibTeX:
@InProceedings{hana-hladka-2012-capek-lrec, author = {Jirka Hana and Barbora Hladka}, title = {Getting more data -- Schoolkids as annotators}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |
Abstract: There are many logical possibilities for marking morphological features. However only some of them are attested in languages of the world, and out of them some are more frequent than others. For example, it has been observed (Sapir 1921; Greenberg 1957; Hawkins & Gilligan 1988) that inflectional morphology tends to overwhelmingly involve suffixation rather than prefixation. This paper proposes an explanation for this asymmetry in terms of acquisition complexity. The complexity measure is based on the Levenshtein edit distance, modified to reflect human memory limitations and the fact that language occurs in time. This measure produces some interesting predictions: for example, it predicts correctly the prefix-suffix asymmetry and shows mirror image morphology to be virtually impossible. |
BibTeX:
@ARTICLE{hana:culicover:2008, author = {Jirka Hana and Peter W. Culicover}, title = {Morphological Complexity Outside of Universal Grammar}, journal = {OSUWPL}, year = {2008}, volume = {58}, pages = {85--109}, url = {http://www.ling.ohio-state.edu/~hana/bib/hana-culicover-2008.pdf} } |
Abstract: We show that the standard account of neutrality and coordination in type-logical grammar is untenable. However, when using as our framework a version of Lambek’s categorical grammar with a type theory based on Lambek and Scott’s higher order intuitionistic logic (the internal language of a topos) rather than the Lambek calculus, the account can largely be salvaged. Because of the difficulty of phonologically interpreting coordinated functors of differing directionality we need to handle both phonology and syntax within a single polymorphically typed lambda calculus. |
BibTeX:
@INPROCEEDINGS{pollard:hana:2003, author = {Carl Pollard and Jiri Hana}, title = {Ambiguity, neutrality, and coordination in higher order grammar}, booktitle = {Proceedings of Formal Grammar}, year = {2003}, editor = {Gerhard Jaeger and Paola Monachesi and Gerald Penn and Shuly Wintner}, pages = {125--136}, address = {Wien}, pdf = {http://www.ling.ohio-state.edu/~hana/bib/pollard-hana2003-fg-vienna.pdf} } |
Abstract: Coming soon. |
BibTeX:
Coming soon. |
Abstract: This paper describes a multilingual text generation system in the domain of CAD/CAM software instructions for Bulgarian, Czech and Russian. Starting from a language-independent semantic representation, the system drafts natural, continuous text as typically found in software manuals. The core modules for strategic and tactical generation are implemented using the KPML platform for linguistic resource development and generation. Prominent characteristics of the approach implemented are a treatment of multilinguality that makes maximal use of the commonalities between languages while also accounting for their differences and a common representational strategy for both text planning and sentence generation. |
BibTeX:
Coming soon. |
Abstract: Coming soon. |
BibTeX:
One day ... |
Abstract: Coming soon. |
BibTeX:
Coming soon. |