Milan Straka
Main Research Interests
- Machine Learning
- Artificial Neural Networks
- Deep Learning
- Structured Prediction
- Bayesian Nonparametrics Modelling and Unsupervised Learning
- NLP Tools
- POS Tagging
- Dependency Parsing
- Named Entity Recognition and Linking
Projects
- Tools
- Data
- Czech Named Entity Corpus
- MorfFlex CZ (generation of Lindat files and MorphoDiTa dictionaries)
- DeriNet
Curriculum Vitae
- My Curriculum Vitae in English and in Czech.
Teaching
Selected Bibliography
- Google Scholar
- ORCID: 0000-0003-3295-5576
- Scopus ID: 23391086200
- Researcher ID: N-1897-2017
Papers
- Practical End-to-End Optical Music Recognition for Pianoform Music. In: Document Analysis and Recognition -- ICDAR 2024, pp. 55-73, Springer International Publishing, Cham, Switzerland, ISBN 978-3-030-86333-3 (url, local PDF, bibtex)
- Findings of the Third Shared Task on Multilingual Coreference Resolution. In: Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference, pp. 78-96, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-171-1 (url, local PDF, bibtex)
- CorPipe at CRAC 2024: Predicting Zero Mentions from Raw Text. In: Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference, pp. 97-106, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-171-1 (url, local PDF, bibtex)
- Open-Source Web Service with Morphological Dictionary--Supplemented Deep Learning for Morphosyntactic Analysis of Czech. In: 27th International Conference on Text, Speech and Dialogue, pp. 279-290, Springer, Cham, Switzerland, ISBN 978-3-031-70563-2 (url, local PDF, bibtex)
- ÚFAL LatinPipe at EvaLatin 2024: Morphosyntactic Analysis of Latin. In: Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pp. 207-214, ELRA and ICCL, Torino, Italia, ISBN 978-2-493814-46-3 (pdf, local PDF, bibtex)
- beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems. In: Proceedings of the 18th ACM Conference on Recommender Systems, pp. 1102-1107, Association for Computing Machinery, New York, NY, United States, ISBN 979-8-4007-0505-2 (url, local PDF, bibtex)
- CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1221-1231, Association for Computing Machinery, New York, NY, USA, ISBN 9798400704314 (url, local PDF, bibtex)
- Language Technology Tools and Services. In: European Language Grid: A Language Technology Platform for Multilingual Europe, pp. 131-150, Springer Nature Switzerland AG, Cham, Switzerland, ISBN 978-3-031-17257-1 (url, bibtex)
- ÚFAL CorPipe at CRAC 2023: Larger Context Improves Multilingual Coreference Resolution. In: Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution, pp. 41-51, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-955917-02-5 (url, local PDF, bibtex)
- Quality and Efficiency of Manual Annotation: Pre-annotation Bias. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 2909-2918, European Language Resources Association, Marseille, France, ISBN 979-10-95546-72-6 (url, local PDF, bibtex)
- Czech Grammar Error Correction with a Large and Diverse Corpus. In: Transactions of the Association for Computational Linguistics, ISSN 2307-387X, 10, pp. 452-467 (url, local PDF, bibtex)
- ÚFAL CorPipe at CRAC 2022: Effectivity of Multilingual Models for Coreference Resolution. In: Proceedings of the CRAC 2022 Shared Task on Multilingual Coreference Resolution, pp. 28-37, Association for Computational Linguistics, Gyeongju, Korea (url, local PDF, bibtex)
- Understanding Model Robustness to User-generated Noisy Texts. In: Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021), pp. 340-350, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-954085-90-9 (url, local PDF, bibtex)
- Diacritics Restoration using BERT with Analysis on Czech language. In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 116, pp. 27-42 (pdf, local PDF, bibtex)
- ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5. In: Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021), pp. 483-492, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-954085-90-9 (url, local PDF, bibtex)
- Character Transformations for Non-Autoregressive GEC Tagging. In: Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021), pp. 417-422, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-954085-90-9 (url, local PDF, bibtex)
- RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model. In: 24th International Conference on Text, Speech and Dialogue, pp. 197-209, Springer, Cham, Switzerland, ISBN 978-3-030-83526-2 (url, local PDF, bibtex)
- Prague Dependency Treebank - Consolidated 1.0. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 5208-5218, European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4 (url, local PDF, bibtex)
- Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer. In: 23rd International Conference on Text, Speech and Dialogue, pp. 171-179, Springer, Cham, Switzerland, ISBN 978-3-030-58322-4 (url, local PDF, bibtex)
- ÚFAL at MRP 2020: Permutation-invariant Semantic Parsing in PERIN. In: Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing, pp. 53-64, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-952148-64-4 (url, local PDF, bibtex)
- UDPipe at EvaLatin 2020: Contextualized Embeddings and Treebank Embeddings. In: Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, pp. 124-129, European Language Resources Association (ELRA), Marseille, France, ISBN 979-10-95546-53-5 (url, local PDF, bibtex)
- 75 Languages, 1 Model: Parsing Universal Dependencies Universally. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2779-2795, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-90-1 (url, local PDF, bibtex)
- Grammatical Error Correction in Low-Resource Scenarios. In: Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pp. 346-356, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-84-0 (url, local PDF, bibtex)
- CUNI System for the Building Educational Applications 2019 Shared Task: Grammatical Error Correction. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 183-190, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-34-5 (url, local PDF, bibtex)
- MRP 2019: Cross-Framework Meaning Representation Parsing. In: Proceedings of the CoNLL 2019 Shared Task: Cross-Framework Meaning Representation Parsing, pp. 1-27, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-60-4 (url, local PDF, local PDF, bibtex)
- ÚFAL MRPipe at MRP 2019: UDPipe Goes Semantic in the Meaning Representation Parsing Shared Task. In: Proceedings of the CoNLL 2019 Shared Task: Cross-Framework Meaning Representation Parsing, pp. 127-137, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-60-4 (url, local PDF, bibtex)
- Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Proceedings of the 22nd International Conference on Text, Speech and Dialogue - TSD 2019, Lecture Notes in Computer Science, ISSN 0302-9743, 11697, pp. 137-150, Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-030-27946-2 (url, local PDF, bibtex)
- Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing (Electronic). In: ArXiv.org Computing Research Repository, ISSN 2331-8422, 1904.02099 (url, local PDF)
- UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging. In: Proceedings of the 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 95-103, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-36-9 (pdf, local PDF, bibtex)
- Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326-5331, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2 (pdf, local PDF, bibtex)
- Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, ISSN 0037-7031, vol. 80, no. 4, pp. 306-327 (bibtex)
- Using Adversarial Examples in Natural Language Processing. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 3693-3700, European Language Resources Association, Miyazaki, Japan, ISBN 979-10-95546-00-9 (url, local PDF, bibtex)
- LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP 2018, pp. 4921-4928, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-84-1 (url, local PDF, bibtex)
- Diacritics Restoration Using Neural Networks. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1-10, European Language Resources Association, Miyazaki, Japan, ISBN 979-10-95546-00-9 (url, local PDF, bibtex)
- UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task. In: Proceedings of CoNLL 2018: The SIGNLL Conference on Computational Natural Language Learning, pp. 197-207, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-72-8 (pdf, local PDF, bibtex)
- SumeCzech: Large Czech News-Based Summarization Dataset. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 3488-3495, European Language Resources Association, Miyazaki, Japan, ISBN 979-10-95546-00-9 (url, local PDF, bibtex)
- CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1-21, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-82-7 (pdf, local PDF, bibtex)
- Neural Networks for Multi-Word Expression Detection. In: Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 60-65, Association for Computational Linguistics (ACL), Stroudsburg, PA, USA, ISBN 978-1-945626-48-7 (pdf, local PDF, bibtex)
- Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88-99, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-945626-70-8 (pdf, local PDF, bibtex)
- Prague at EPE 2017: The UDPipe System. In: Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation at the Fourth International Conference on Dependency Linguistics and the 15th International Conference on Parsing Technologies, pp. 65-74, Association for Computational Linguistics (ACL), Stroudsburg, PA, USA, ISBN 978-1-945626-74-6 (pdf, local PDF, bibtex)
- Czech Named Entity Corpus. In: Handbook of Linguistic Annotation, pp. 855-873, Springer Netherlands, Netherlands, ISBN 978-94-024-0879-9 (bibtex)
- CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1-19, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-945626-70-8 (pdf, local PDF, bibtex)
- UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 4290-4297, European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1 (pdf, local PDF, bibtex)
- Neural Networks for Featureless Named Entity Recognition in Czech. In: Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Lecture Notes in Computer Science, ISSN 0302-9743, 9924, pp. 173-181, Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-319-45509-9 (url, local PDF, bibtex)
- Lexikální síť DeriNet: elektronický zdroj pro výzkum derivace v češtině. In: Časopis pro moderní filologii, ISSN 0008-7386, vol. 98, no. 1, pp. 62-76 (bibtex)
- Merging Data Resources for Inflectional and Derivational Morphology in Czech. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 1307-1314, European Language Resources Association, Paris, France, ISBN 978-2-9517408-9-1 (pdf, local PDF, bibtex)
- Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle. In: 14th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 208-220, IPIPAN, Warszawa, Poland, ISBN 978-83-63159-18-4 (pdf, local PDF, bibtex)
- Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13-18, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-941643-00-6 (pdf, local PDF, bibtex)
- Stop-probability estimates computed on a large corpus improve Unsupervised Dependency Parsing. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 281-290, Association for Computational Linguistics, Sofija, Bulgaria, ISBN 978-1-937284-50-3 (pdf, local PDF, bibtex)
- A New State-of-The-Art Czech Named Entity Recognizer. In: Text, Speech and Dialogue: 16th International Conference, TSD 2013. Proceedings, Lecture Notes in Computer Science, ISSN 0302-9743, 8082, pp. 68-75, Springer Verlag, Berlin / Heidelberg, ISBN 978-3-642-40584-6 (url, local PDF, bibtex)
- Adams’ Trees Revisited – Correct and Efficient Implementation. In Proceedings of TFP 2011, Symposium on Trends in Functional Programming, Madrid, Spain, May 2011 (local PDF)
:- The performance of the Haskell containers package. In Proceedings of Haskell 2010, 3rd ACM Haskell symposium on Haskell, Baltimore, Maryland, September 2010 (local PDF)
:- Optimal worst-case fully persistent arrays. In TFP 2009, Symposium on Trends in Functional Programming, Komarno, Slovakia, June 2009 (local PDF)
:- Linear-Time Ranking of Permutations. In Proceedings of ESA 2007, 15th Annual European Symposium, Eilat, Israel, October 2007 (local PDF)
:Theses
- Doctoral thesis: Functional Data Structures and Algorithms (local PDF)
- Master thesis: Quadratic Fields Based Cryptography (local PDF), Czech only
- Bachelor thesis: Factoring Polynomials over Finite Fields (local PDF), Czech only