Pavel Straňák
Main Research Interests
Lexical semantics, computational lexicography, reliability of annotations, machine translation, application of NLP technology in everyday life
Projects
- I am a scientific secretary of LINDAT/CLARIAH-CZ research infrastructure
- HPLT (High Performance Language Technologies) project for European very large language and translation models from internet archive data
- CLARIN Plus: Enhancing CLARIN (H2020-INFRADEV-1-2015-1-676529)
- PARSEME: PARSing and Multi-word Expressions (ICT COST Action)
- Korektor – an open source contextual spell-checker and diacritics generation system
Curriculum Vitae
Education
- 2010 - Ph.D. in Computational Linguistics, Charles University in Prague.
- 2001 - Mgr. (equiv. of M.A.) in Czech Philology, University of Ostrava.
Teaching
- Language Technologies for Research in Humanities – NPFL131 – the record in the IS Studium
- My Proposals of Topics for Students
If you have your own proposal that concerns NLP, feel free to contact me.
Selected Bibliography
- Google Scholar
- ORCID: 0000-0002-6895-8536
- Scopus ID: 15043417100
- Researcher ID: I-3422-2017
- Charles Translator: A Machine Translation System between Ukrainian and Czech. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 3038-3045, European Language Resources Association, Torino, Italy, ISBN 978-2-493814-10-4 (pdf, local PDF, bibtex)
- LINDAT/CLARIAH-CZ: Where We Are and Where We Go. In: CLARIN: The Infrastructure for Language Resources, pp. 61-82, Berlin, Boston: De Gruyter, Berlin, Boston: De Gruyter, ISBN 978-3-11-076734-6 (bibtex)
- Corpus Annotation as a Feasible and Scientifically Beneficial Task. In: CLARIN: The Infrastructure for Language Resources, pp. 613-646, Walter de Gruyter GmbH, Berlin/Boston, Mannheim, Germany, ISBN 978-3-11-076734-6 (url, bibtex)
- ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata. In: 24th International Conference on Text, Speech and Dialogue, pp. 293-304, Springer, Cham, Switzerland, ISBN 978-3-030-83526-2 (pdf, local PDF, bibtex)
- Compiling Czech Parliamentary Stenographic Protocols into a Corpus. In: Proceedings of the LREC 2020 Workshop on Creating, Using and Linking of Parliamentary Corpora with Other Types of Political Discourse (ParlaCLARIN II), pp. 18-22, European Language Resources Association (ELRA), Paris, France, ISBN 979-10-95546-47-4 (url, local PDF, bibtex)
- The Impact of Copyright and Personal Data Laws on the Creation and Use of Language Models. In: Linköping Electronic Conference Proceedings, ISSN 1650-3740, vol. 172, no. 8, pp. 53-65 (url, local PDF, bibtex)
- Processing personal data without the consent of the data subject for the development and use of language resources. In: Linköping Electronic Conference Proceedings, ISSN 1650-3740, vol. 159, no. 8, pp. 72-82 (url, bibtex)
- CLARIN-DSpace repository at LINDAT/CLARIN : LINDAT/CLARIN FAIR repository for language data. In: the grey Journal – International Journal on Grey Literature, ISSN 1574-1796, 16, pp. 52-61 (url, bibtex)
- Bridging the LAPPS Grid and CLARIN. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1-10, European Language Resources Association, Miyazaki, Japan, ISBN 979-10-95546-00-9 (url, local PDF, bibtex)
- Implementation of an Open Science Policy in the context of management of CLARIN language resources: a need for changes?. In: Linköping Electronic Conference Proceedings, ISSN 1650-3740, vol. 9, no. 147, pp. 102-111 (url, local PDF, bibtex)
- Diacritics Restoration Using Neural Networks. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1-10, European Language Resources Association, Miyazaki, Japan, ISBN 979-10-95546-00-9 (url, local PDF, bibtex)
- Extracting Verbal Multiword Data from Rich Treebank Annotation. In: Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories (TLT 15), pp. 13-24, Indiana University, Bloomington, Bloomington, IN, USA (pdf, local PDF, local PDF, bibtex)
- The Public License Selector: Making Open Licensing Easier. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 1-10, European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1 (pdf, local PDF, bibtex)
- Improving Corpus Search via Parsing. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2862-2866, European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1 (pdf, local PDF, bibtex)
- B2SHARE: An Open eScience Data Sharing Platform. In: 2015 IEEE 11th International Conference on e-Science (e-Science), pp. 448-453, IEEE computer society, Munich, Germany, ISBN 978-1-4673-9325-6 (url, local PDF, bibtex)
- Improvements to Korektor: A case study with native and non-native Czech. In: Proceedings of the 15th conference ITAT 2015: Slovenskočeský NLP workshop (SloNLP 2015), pp. 73-80, CreateSpace Independent Publishing Platform, Praha, Czechia, ISBN 978-1515120650 (bibtex)
- HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 3550-3555, European Language Resources Association, Reykjavík, Iceland, ISBN 978-2-9517408-8-4 (pdf, local PDF, local PDF, bibtex)
- Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures. In: The 9th Workshop on Multiword Expressions (MWE 2013), pp. 106-115, Association for Computational Linguistics, Atlanta, Georgia, USA, ISBN 978-1-937284-47-3 (pdf, local ZIP, local PDF, local PDF, bibtex)
- From PDT 2.0 to PDT 3.0 (Modifications and Complements) (technical report). In: (local PDF, bibtex)
- Úpravy a doplňky Pražského závislostního korpusu (Od PDT 2.0 k PDT 3.0) (technical report). In: (local PDF, bibtex)
- Prague Dependency Treebank 2.5 -- a revisited version of PDT 2.0. In: Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pp. 231-246, Coling 2012 Organizing Committee, Mumbai, India (local PDF, local PDF, bibtex)
- Korektor – A System for Contextual Spell-checking and Diacritics Completion. In: Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pp. 1-12, Coling 2012 Organizing Committee, Mumbai, India (pdf, local PDF, bibtex)
- Influence of Treebank Design on Representation of Multiword Expressions. In: Lecture Notes in Computer Science, ISSN 0302-9743, 6608, pp. 1-14 (url, local PDF, bibtex)
- Annotation of Multiword Expressions in the Prague Dependency Treebank. In: Language Resources and Evaluation, ISSN 1574-020X, vol. 44, no. 1-2, pp. 7-21 (url, local PDF, bibtex)
- Data Issues in English-to-Hindi Machine Translation. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 1771-1777, European Language Resources Association, Valletta, Malta, ISBN 2-9517408-6-7 (local ODP, local PDF, local PDF, bibtex)
- Annotation of Multiword Expressions in The Prague Dependency Treebank (PhD thesis). In: (local PDF, local PDF, bibtex)
- Representing Layered and Structured Data in the CoNLL-ST Format. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources, pp. 143-152, City University of Hong Kong, Hong Kong, China, ISBN 978-962-442-323-5 (local PDF, local PDF, bibtex)
- Finalising Multiword Annotations in PDT. In: Proceedings of 8th Treebanks and Linguistic Theories Workshop (TLT), pp. 17-25, Università Cattolica del Sacro Cuore, Milano, Italy, ISBN 978-88-8311-712-1 (local PDF, local PDF, local PDF, bibtex)
- English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009: 7th International Conference on Natural Language Processing, pp. 316-321, Macmillan Publishers, India, Hyderabad, India, ISBN 978-023-032-845-7 (local PDF, local PDF, bibtex)
- The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL): Shared Task, pp. 1-18, Association for Computational Linguistics, Boulder, CO, USA, ISBN 978-1-932432-29-9 (url, local PDF, bibtex)
- Anotace víceslovných výrazů v Pražském závislostním korpusu. In: Grammar & Corpora / Gramatika a korpus 2007, pp. 143-149, Academia, Praha, ISBN 978-80-200-1634-8 (local PDF, bibtex)
- Annotation of Multiword Expressions in the Prague Dependency Treebank. In: IJCNLP 2008 Proceedings of the Third International Joint Conference on Natural Language Processing, pp. 793-798, International Institute of Information Technology, Hyderabad, India (local PDF, local PDF, bibtex)
- English-Hindi Translation in 21 Days. In: Proceedings of the 6th International Conference On Natural Language Processing (ICON-2008) NLP Tools Contest, International Institute of Information Technologies, Hyderabad, Pune, India (url, local PDF, local PPT, bibtex)
- The lexico-semantic annotation of PDT: Some results, problems and solutions. In: Lecture Notes in Computer Science, ISSN 0302-9743, 4188, pp. 21-28 (url, local PDF, local PDF, bibtex)
- Review of Leonard Talmy: Toward a Cognitive Semantics, Volume I, Concept Structuring Systems (review). In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 83, pp. 85-86 (local PDF, bibtex)
- Validating and Improving the Czech WordNet via Lexico-Semantic Annotation of the Prague Dependency Treebank. In: Proceedings of LREC 2004, pp. - - (bibtex)
- Approaches to Building Semantic Lexicons. In: WDS'03 Proceedings of Contributed Papers, Part I, pp. 173--178, MATFYZPRESS, Prague, ISBN 80-86732-18-5 (bibtex)
Students
Defended
- 2012 – Patrik Černý: Voice command of a Tv for disabled users (Bc.)
- 2010 – Michal Richter: Advanced Czech spellchecker – Mgr. (M.A.)
- 2009 – Michal Richter: Integration of an n-gram language model with a Czech spellchecker (Bc.)
Other Activities
- I am also on the editoral board of UFAL's Publishing House that publishes a monograph-oriented series "Studies in Computational and Theoretical Linguistics".
Past Activities
- The Prague Bulletin of Mathematical Linguistics editorial staff
- CoNLL 2009 Shared Task organisation
- Publicity co-chair of the ACL 2007 conference (with Jiří Mírovský)
Data
I have participated on production of several datasets, all of which are freely available in the LINDAT-Clarin Repository.