Pavel Straňák

office: 424
email: stranak@ufal.mff.cuni.cz
phone: +420 951 554 279
address: Malostranské náměstí 25
118 00 Praha 1
Czech Republic

Main Research Interests

Lexical semantics, computational lexicography, reliability of annotations, machine translation, application of NLP technology in everyday life

Projects

I am a scientific secretary of LINDAT /CLAR IAH-CZ research infrastructure

I am a Charles University PI for these projects:

ATRIUM

EVERSE

FIDELIS (trusted digital repositories)

Recent past projects:

HPLT (High Performance Language Technologies) project for European very large language and translation models from internet archive data

CLARIN Plus: Enhancing CLARIN (H2020-INFRADEV-1-2015-1-676529)

PARSEME: PARSing and Multi-word Expressions (ICT COST Action)

Korektor – an open source contextual spell-checker and diacritics generation system

Past projects:

Curriculum Vitae

Education

2010 - Ph.D. in Computational Linguistics, Charles University in Prague.

2001 - Mgr. (equiv. of M.A.) in Czech Philology, University of Ostrava.

Teaching

Language Technologies for Research in Humanities – NPFL131 – the record in the IS Studium

My Proposals of Topics for Students
If you have your own proposal that concerns NLP, feel free to contact me.

Selected Bibliography

Google Scholar
ORCID: 0000-0002-6895-8536
Scopus ID: 15043417100
Researcher ID: I-3422-2017

Martin Popel, Lucie Poláková, Michal Novák, Jindřich Helcl, Jindřich Libovický, Pavel Straňák, Tomáš Krabač, Jaroslava Hlaváčová, Mariia Anisimova, Tereza Chlaňová (2024): Charles Translator: A Machine Translation System between Ukrainian and Czech. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 3038-3045, European Language Resources Association, Torino, Italy, ISBN 978-2-493814-10-4 (pdf, local PDF, bibtex)

Jan Hajič, Eva Hajičová, Barbora Hladká, Ondřej Košarko, Jozef Mišutka, Pavel Straňák (2022): LINDAT/CLARIAH-CZ: Where We Are and Where We Go. In: CLARIN: The Infrastructure for Language Resources, pp. 61-82, Berlin, Boston: De Gruyter, Berlin, Boston: De Gruyter, ISBN 978-3-11-076734-6 (bibtex)

Eva Hajičová, Jan Hajič, Barbora Hladká, Jiří Mírovský, Lucie Poláková, Kateřina Rysová, Magdaléna Rysová, Pavel Straňák, Barbora Štěpánková, Šárka Zikánová (2022): Corpus Annotation as a Feasible and Scientifically Beneficial Task. In: CLARIN: The Infrastructure for Language Resources, pp. 613-646, Walter de Gruyter GmbH, Berlin/Boston, Mannheim, Germany, ISBN 978-3-11-076734-6 (url, bibtex)

Matyáš Kopp, Vladislav Stankov, Jan Oldřich Krůza, Pavel Straňák, Ondřej Bojar (2021): ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata. In: 24th International Conference on Text, Speech and Dialogue, pp. 293-304, Springer, Cham, Switzerland, ISBN 978-3-030-83526-2 (pdf, local PDF, bibtex)

Barbora Hladká, Matyáš Kopp, Pavel Straňák (2020): Compiling Czech Parliamentary Stenographic Protocols into a Corpus. In: Proceedings of the LREC 2020 Workshop on Creating, Using and Linking of Parliamentary Corpora with Other Types of Political Discourse (ParlaCLARIN II), pp. 18-22, European Language Resources Association (ELRA), Paris, France, ISBN 979-10-95546-47-4 (url, local PDF, bibtex)

Aleksei Kelli, Arvi Tavast, Krister Lindén, Kadri Vider, Ramūnas Birštonas, Penny Labropoulou, Irene Kull, Gaabriel Tavits, Age Värv, Pavel Straňák, Jan Hajič (2020): The Impact of Copyright and Personal Data Laws on the Creation and Use of Language Models. In: Linköping Electronic Conference Proceedings, ISSN 1650-3740, vol. 172, no. 8, pp. 53-65 (url, local PDF, bibtex)

Aleksei Kelli, Krister Lindén, Kadri Vider, Paweł Kamocki, Ramūnas Birštonas, Silvia Calamai, Penny Labropoulou, Maria Gavrilidou, Pavel Straňák (2019): Processing personal data without the consent of the data subject for the development and use of language resources. In: Linköping Electronic Conference Proceedings, ISSN 1650-3740, vol. 159, no. 8, pp. 72-82 (url, bibtex)

Pavel Straňák, Ondřej Košarko, Jozef Mišutka (2019): CLARIN-DSpace repository at LINDAT/CLARIN : LINDAT/CLARIN FAIR repository for language data. In: the grey Journal – International Journal on Grey Literature, ISSN 1574-1796, 16, pp. 52-61 (url, bibtex)

Erhard Hinrichs, Nancy Ide, James Pustejovsky, Jan Hajič, Marie Hinrichs, Mohammad Fazleh Elahi, Keith Suderman, Marc Verhagen, Kyeongmin Rim, Pavel Straňák, Jozef Mišutka (2018): Bridging the LAPPS Grid and CLARIN. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1-10, European Language Resources Association, Miyazaki, Japan, ISBN 979-10-95546-00-9 (url, local PDF, bibtex)

Aleksei Kelli, Krister Lindén, Kadri Vider, Penny Labropoulou, Erik Ketzan, Paweł Kamocki, Pavel Straňák (2018): Implementation of an Open Science Policy in the context of management of CLARIN language resources: a need for changes?. In: Linköping Electronic Conference Proceedings, ISSN 1650-3740, vol. 9, no. 147, pp. 102-111 (url, local PDF, bibtex)

Jakub Náplava, Milan Straka, Pavel Straňák, Jan Hajič (2018): Diacritics Restoration Using Neural Networks. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1-10, European Language Resources Association, Miyazaki, Japan, ISBN 979-10-95546-00-9 (url, local PDF, bibtex)

Eduard Bejček, Jan Hajič, Pavel Straňák, Zdeňka Urešová (2017): Extracting Verbal Multiword Data from Rich Treebank Annotation. In: Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories (TLT 15), pp. 13-24, Indiana University, Bloomington, Bloomington, IN, USA (pdf, local PDF, local PDF, bibtex)

Paweł Kamocki, Pavel Straňák, Michal Sedlák (2016): The Public License Selector: Making Open Licensing Easier. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 1-10, European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1 (pdf, local PDF, bibtex)

Natalia Klyueva, Pavel Straňák (2016): Improving Corpus Search via Parsing. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2862-2866, European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1 (pdf, local PDF, bibtex)

Sarah Berenji Ardestani, Carl Johan Håkansson, Erwin Laure, Ilja Livenson, Pavel Straňák, Emanuel Dima, Dennis Blommesteijn, Mark van de Sanden (2015): B2SHARE: An Open eScience Data Sharing Platform. In: 2015 IEEE 11th International Conference on e-Science (e-Science), pp. 448-453, IEEE computer society, Munich, Germany, ISBN 978-1-4673-9325-6 (url, local PDF, bibtex)

Loganathan Ramasamy, Alexandr Rosen, Pavel Straňák (2015): Improvements to Korektor: A case study with native and non-native Czech. In: Proceedings of the 15th conference ITAT 2015: Slovenskočeský NLP workshop (SloNLP 2015), pp. 73-80, CreateSpace Independent Publishing Platform, Praha, Czechia, ISBN 978-1515120650 (bibtex)

Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Vít Suchomel, Aleš Tamchyna, Daniel Zeman (2014): HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 3550-3555, European Language Resources Association, Reykjavík, Iceland, ISBN 978-2-9517408-8-4 (pdf, local PDF, local PDF, bibtex)

Eduard Bejček, Pavel Straňák, Pavel Pecina (2013): Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures. In: The 9th Workshop on Multiword Expressions (MWE 2013), pp. 106-115, Association for Computational Linguistics, Atlanta, Georgia, USA, ISBN 978-1-937284-47-3 (pdf, local ZIP, local PDF, local PDF, bibtex)

Marie Mikulová, Eduard Bejček, Jiří Mírovský, Anna Nedoluzhko, Jarmila Panevová, Lucie Poláková, Pavel Straňák, Magda Ševčíková, Zdeněk Žabokrtský (2013): Úpravy a doplňky Pražského závislostního korpusu (Od PDT 2.0 k PDT 3.0) (technical report). In: (local PDF, bibtex)

Marie Mikulová, Eduard Bejček, Jiří Mírovský, Anna Nedoluzhko, Jarmila Panevová, Lucie Poláková, Pavel Straňák, Magda Ševčíková, Zdeněk Žabokrtský (2013): From PDT 2.0 to PDT 3.0 (Modifications and Complements) (technical report). In: (local PDF, bibtex)

Eduard Bejček, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek, Zdeněk Žabokrtský (2012): Prague Dependency Treebank 2.5 -- a revisited version of PDT 2.0. In: Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pp. 231-246, Coling 2012 Organizing Committee, Mumbai, India (local PDF, local PDF, bibtex)

Michal Richter, Pavel Straňák, Alexandr Rosen (2012): Korektor – A System for Contextual Spell-checking and Diacritics Completion. In: Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pp. 1-12, Coling 2012 Organizing Committee, Mumbai, India (pdf, local PDF, bibtex)

Eduard Bejček, Pavel Straňák, Daniel Zeman (2011): Influence of Treebank Design on Representation of Multiword Expressions. In: Lecture Notes in Computer Science, ISSN 0302-9743, 6608, pp. 1-14 (url, local PDF, bibtex)

Eduard Bejček, Pavel Straňák (2010): Annotation of Multiword Expressions in the Prague Dependency Treebank. In: Language Resources and Evaluation, ISSN 1574-020X, vol. 44, no. 1-2, pp. 7-21 (url, local PDF, bibtex)

Ondřej Bojar, Pavel Straňák, Daniel Zeman (2010): Data Issues in English-to-Hindi Machine Translation. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 1771-1777, European Language Resources Association, Valletta, Malta, ISBN 2-9517408-6-7 (local PDF, local PDF, local ODP, bibtex)

Pavel Straňák (2010): Annotation of Multiword Expressions in The Prague Dependency Treebank (PhD thesis). In: (local PDF, local PDF, bibtex)

Pavel Straňák, Jan Štěpánek (2010): Representing Layered and Structured Data in the CoNLL-ST Format. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources, pp. 143-152, City University of Hong Kong, Hong Kong, China, ISBN 978-962-442-323-5 (local PDF, local PDF, bibtex)

Eduard Bejček, Pavel Straňák, Jan Hajič (2009): Finalising Multiword Annotations in PDT. In: Proceedings of 8th Treebanks and Linguistic Theories Workshop (TLT), pp. 17-25, Università Cattolica del Sacro Cuore, Milano, Italy, ISBN 978-88-8311-712-1 (local PDF, local PDF, local PDF, bibtex)

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič (2009): English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009: 7th International Conference on Natural Language Processing, pp. 316-321, Macmillan Publishers, India, Hyderabad, India, ISBN 978-023-032-845-7 (local PDF, local PDF, bibtex)

Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, Yi Zhang (2009): The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL): Shared Task, pp. 1-18, Association for Computational Linguistics, Boulder, CO, USA, ISBN 978-1-932432-29-9 (url, local PDF, bibtex)

Eduard Bejček, Pavel Straňák (2008): Anotace víceslovných výrazů v Pražském závislostním korpusu. In: Grammar & Corpora / Gramatika a korpus 2007, pp. 143-149, Academia, Praha, ISBN 978-80-200-1634-8 (local PDF, bibtex)

Eduard Bejček, Pavel Straňák, Pavel Schlesinger (2008): Annotation of Multiword Expressions in the Prague Dependency Treebank. In: IJCNLP 2008 Proceedings of the Third International Joint Conference on Natural Language Processing, pp. 793-798, International Institute of Information Technology, Hyderabad, India (local PDF, local PDF, bibtex)

Ondřej Bojar, Pavel Straňák, Daniel Zeman (2008): English-Hindi Translation in 21 Days. In: Proceedings of the 6th International Conference On Natural Language Processing (ICON-2008) NLP Tools Contest, International Institute of Information Technologies, Hyderabad, Pune, India (url, local PDF, local PPT, bibtex)

Eduard Bejček, Petra Möllerová, Pavel Straňák (2006): The lexico-semantic annotation of PDT: Some results, problems and solutions. In: Lecture Notes in Computer Science, ISSN 0302-9743, 4188, pp. 21-28 (url, local PDF, local PDF, bibtex)

Pavel Straňák (2005): Review of Leonard Talmy: Toward a Cognitive Semantics, Volume I, Concept Structuring Systems (review). In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 83, pp. 85-86 (local PDF, bibtex)

Jan Hajič, Martin Holub, Marie Hučínová, Martin Pavlík, Pavel Pecina, Pavel Straňák, Pavel Šidák (2004): Validating and Improving the Czech WordNet via Lexico-Semantic Annotation of the Prague Dependency Treebank. In: Proceedings of LREC 2004, pp. - - (bibtex)

Martin Holub, Pavel Straňák (2003): Approaches to Building Semantic Lexicons. In: WDS'03 Proceedings of Contributed Papers, Part I, pp. 173--178, MATFYZPRESS, Prague, ISBN 80-86732-18-5 (bibtex)

Students

Defended

2012 – Patrik Černý: Voice command of a Tv for disabled users (Bc.)

2010 – Michal Richter: Advanced Czech spellchecker – Mgr. (M.A.)

2009 – Michal Richter: Integration of an n-gram language model with a Czech spellchecker (Bc.)

Other Activities

I am also on the editoral board of UFAL's Publishing House that publishes a monograph-oriented series "Studies in Computational and Theoretical Linguistics".

Past Activities

The Prague Bulletin of Mathematical Linguistics editorial staff
CoNLL 2009 Shared Task organisation
Publicity co-chair of the ACL 2007 conference (with Jiří Mírovský)

Data

I have participated on production of several datasets, all of which are freely available in the LINDAT-Clarin Repository.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

Pavel Straňák

Main Research Interests

Projects

Curriculum Vitae

Education

Teaching

Selected Bibliography

Students

Defended

Other Activities

Past Activities

Data