Daniel Zeman: CV

Daniel Zeman is a research associate at ÚFAL MFF UK (Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University), Malostranské náměstí 25, Praha, CZ-11800, Czechia

Education and Research

2000 – now: Univerzita Karlova, Praha (Czechia)

  • Researcher, Center for Computational Linguistics, since 2004 Institute of Formal and Applied Linguistics. Research interests: natural language processing, especially syntactic dependency parsing of Czech and other languages, morphological analysis, named entities, coreference, semantic relations, data annotation.

  • Co-leader of the Universal Dependencies consortium since 2014.

  • 2023 – 2026 team member of TAČR project TQ01000072 HiČKoK (History of Czech in Corpus Continuum)

  • 2022 – 2026 vice-chair and Czech representative in the management committee of COST Action CA21167: Universality, diversity and idiosyncrasy in language technology (UniDive)

  • 2020 – 2024 team member of GAČR project GX20-16819X LUSyD: Language Understanding: from Syntax to Discourse

  • 2015 – 2017 principal investigator of GAČR project GA15-10472S: Morphologically and Syntactically Annotated Corpora of Many Languages (MANYLA)

  • 2011 – 2013 principal investigator of GAČR project P406/11/1499: Czech in the Machine Translation Era (CzechMaTE) … (statistical machine translation between Czech and English/German/Spanish)

  • 2010 – 2018 team member of European projects KHRESMOI, QTLEAP, and HimL

  • 2004 – 2008 co-PI of project 1ET101470416: Multimodal Human Speech and Sign Language Processing for Human-Machine Communication (MUSSLAP) … consortium led by the University of West Bohemia

  • 2000 – 2004 team member of GAČR project Center for Computational Linguistics

  • Since 2010 member of the ISO TC37 working group (terminology and language resources)

2006: University of Maryland, College Park (Maryland, USA)

  • Awarded Fulbright-Masaryk Fellowship (January to July), post-doc (July to December). Project with Philip Resnik at the Institute of Advanced Computer Studies (UMIACS). The project involved cross-language parser adaptation and machine translation.

1997 – 2005: Univerzita Karlova, Praha (Czechia)

  • Doctoral study of mathematical linguistics at the Faculty of Mathematics and Physics. Dissertation called Parsing with a Statistical Dependency Model defended 2005-01-13, obtained the Ph.D. title.

1999, May – July: University of Pennsylvania, Philadelphia (Pennsylvania, USA)

  • Research visit at IRCS (Institute for Research in Cognitive Science). Invited by prof. Aravind Joshi, worked together with Anoop Sarkar on automatic extraction of subcategorization frames from the Prague Dependency Treebank.

1998, July – August: Johns Hopkins University, Baltimore (Maryland, USA)

  • Participation at the summer workshop Core NLP Technology Applicable to Multiple Languages at the Center for Language and Speech Processing. Member of a team of 4 senior researchers, 4 doctoral students and 4 undergraduate students. The main topic was dependency parsing of Czech.

1990 – 1997: Univerzita Karlova, Praha (Czechia)

  • Master-level study of computer science. Obtained the Mgr. title (an equivalent of master of science).

Teaching

Supervising theses of bachelor, master and PhD students of mathematical linguistics / language technologies / artificial intelligence. Number of students already graduated (as of March 2024): 5 Bc, 13 Mgr, 1 PhD.

Courses taught

  • 2020 – now: Natural Language Processing (Univerzita Karlova / Charles University, Prague) … an introduction for bachelor students, taught in English and Czech

  • 2020 – now: Dependency Grammars and Treebanks (Univerzita Karlova / Charles University, Prague) … core elective course for master students of language technologies, taught in English

  • 2018 – now: Multilingual Natural Language Processing (Univerzita Karlova / Charles University, Prague) … elective course for master and doctoral students, taught in English

  • 2016 – 2019: Introduction to Natural Language Processing (České vysoké učení technické / Czech Technical University, Prague), taught in Czech

  • 2013: Linguistic Software Tools (Palacký University, Olomouc) … two-week course for students of linguistics, taught in Czech

  • 2010 – now: Morphological and Syntactic Analysis (Univerzita Karlova / Charles University, Prague) … core elective course for master students of language technologies, taught mostly in English

  • 2001 – 2002: Programming (Univerzita Karlova / Charles University, Prague) … introductory programming seminar for computer science students

  • 2000 – 2018: Computers and Natural Language (České vysoké učení technické / Czech Technical University, Prague) … introductory course for bachelor students, taught in Czech

  • 1999 – 2009: Computational Processing of Natural Language (Univerzita Karlova / Charles University, Prague) … introductory course, in 2009 split with Morphological and Syntactic Analysis being the main successor

Languages

  • Czech (native)

  • English (fluent)

  • German (good)

  • Russian (fair)

  • Spanish (basic)

Awards

  • Fulbright-Masaryk Fellowship (January – July 2006, extended to January 2007, University of Maryland, College Park)

  • Dean’s award for best monograph at the Faculty of Mathematics and Physics in 2018

Program Committees

  • Senior area chair for LREC-COLING 2024 (CORE B conference)

  • Area chair for ACL 2020 (CORE A* conference)

  • PC member (organizer) of shared task workshops – see separate sectioin below

  • PC member (reviewer) for 28 international conferences and workshops, some of them in multiple years (e.g. ACL (CORE A*), NAACL (CORE A), EACL (CORE A), COLING (CORE A/B), CoNLL (CORE A/B))

Other Reviews

  • Reviewer for Language Resources and Evaluation (Springer journal, IF 2.7 Q3)

  • Grant reviewer for Rannsóknir Ísland (2019)

  • Opponent of PhD theses: 3 at CUNI, 5 abroad

  • 2014 – 2015 member of the scientific advisory board of the Czech National Corpus

Shared Task Organization

  • Main organizer of 2 large CoNLL shared tasks on multilingual end-to-end dependency parsing

  • Co-organizer of 8 other shared tasks (CoNLL, SemEval, IWPT, CRAC)

Publications and Citations

https://orcid.org/0000-0002-5791-6568
https://scholar.google.com/citations?user=QZsuZ_cAAAAJ
https://www.semanticscholar.org/author/Daniel-Zeman/1771298

 

Google Scholar
2019 – 2024

Google Scholar
total

Citations

4087

6294

H-index

21

29


 

Selected Publications

  1. Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel Zeman (2021): Universal Dependencies. In: Computational LinguisticsISSN 1530-9312, vol. 47, no. 2, pp. 255-308.

  2. Gosse Bouma, Djamé Seddah, Daniel Zeman (2020): Overview of the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies. In: Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies, pp. 151-161, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-952148-11-8.

  3. Zdeněk Žabokrtský, Daniel Zeman, Magda Ševčíková (2020): Sentence Meaning Representations across Languages: What Can We Learn from Existing Frameworks?. In: Computational LinguisticsISSN 1530-9312, vol. 46, no. 3, pp. 605-665.

  4. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman (2020): Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 4027-4036, European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4

  5. Daniel Zeman (2018): The World of Tokens, Tags and Trees, ISBN 978-80-88132-09-7, ÚFAL, Praha, 2018

  6. Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, Slav Petrov (2018): CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1-21, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-82-7

  7. Héctor Martínez Alonso, Daniel Zeman: Universal Dependencies for the AnCora treebanks. In: Procesamiento del Lenguaje Natural, Vol. 57, Copyright © Sociedad Española para el Procesamiento del Lenguaje Natural, Salamanca, Spain, ISSN 1135-5948, pp. 91-98, Sep 2016

  8. Daniel Zeman, Ondřej Dušek, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, Jan HajičHamleDT: Harmonized Multi-Language Dependency Treebank. In: Language Resources and Evaluation, Vol. 48, No. 4, Copyright © Springer Netherlands, Dordrecht, Netherlands, ISSN 1574-020X, pp. 601-637, Dec 2014

  9. Martin Popel, David Mareček, Jan Štěpánek, Daniel Zeman, Zdeněk Žabokrtský: Coordination Structures in Dependency Treebanks. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Sofija, Bulgaria, ISBN 978-1-937284-50-3, pp. 517-527, 2013

  10. Daniel Zeman: Reusable Tagset Conversion Using Tagset Drivers. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Copyright © European Language Resources Association, Marrakech, Morocco, ISBN 2-9517408-4-0, pp. 213-218, 2008

  11. Daniel Zeman, Philip Resnik: Cross-Language Parser Adaptation between Related Languages. In: IJCNLP 2008 Workshop on NLP for Less Privileged Languages, Copyright © International Institute of Information Technology, Hyderabad, India, pp. 35-42, 2008

  12. Anoop Sarkar, Daniel Zeman: Automatic Extraction of Subcategorization Frames for Czech. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING), Copyright © Universität des Saarlandes, Saarbrücken, Germany, ISBN 1-55860-717-X, pp. 691-697, 2000

Data and Software

  • Interset (since 2006) – a framework for automatic conversion among morphosyntactic descriptions of words (includes over 60 tagsets)

  • HamleDT (2012 – 2015) – a harmonized collection of treebanks for 30 languages

  • Universal Dependencies (since 2014) – successor of HamleDT, became a de-facto standard for morphosyntactically annotated data, widely used in research and language technologies, currently covering 148 languages, 2 releases every year

    • I co-lead the project, handle the release process, have designed and am maintaining large part of the related software tools and infrastructure

  • Deep UD (since 2017), CorefUD (since 2021) … other multilingual datasets related to UD, with my contribution

  • Prague Dependency Treebank and its conversion for use in semantic parsing tasks

  • Hindi and Urdu monolingual and parallel (with English) corpora (2008 – 2010) … have been used especially in machine translation