Jan Hajič

office
Room 420
office hours
Monday 10:00-11:30 (unless traveling)
email
jan.hajic@mff.cuni.cz
phone
+420 951 554 257

Curriculum Vitae

My current CV in English with a list of selected publications.

Selected Bibliography

(Major) News

2021/09/30 The detailed evaluation report from the Research Infrastructure evaluation 2021 is in. The LINDAT/CLARIAH-CZ research infrastructure, which I lead, has done very well - virtually securing continuation after 2022. We have got the highest mark overall, and from the 13 detailed criteria, only two are graded at the second highest mark, and the rest is "Excellent". Thanks go to all the teams at all the institutions taking part in this complex, cooperative project!

2021/02/24 The Institute has signed, through the CUIP (Charles University Innovations Prague Ltd., Charles-University-fully-owned IPR and Tech Transfer company), its first "microlicence" for commercial use of its highly acclaimed and accurate CUBBITT Machine Translation service. The service, developed by my colleagues at UFAL, is being run as part of the LINDAT/CLARIAH-CZ Research Infrastructure, I have been working with CUIP on previous contracts as well; now we have a simple, one-page licence agreement for almost instant possibility to commercially use of all the LINDAT/CLARIAH-CZ run services and hope to close a host of such contracts in the near future..

 

2021/02/01 A "European Language Equality" project, funded by the EP and EC, has started. I am involved as the Co-PI, leading the Charles University group, which is one of the six core partners of the project. I also lead Workpackage 2 which should prepare a host of users', developers' and other stakeholdrs' surveys, in order to get wide support and relevant information for creating the Horizon Europe-aimed Strategic Research Agenda in Digital Language Equality by 2030.

 

2021/01/01 A "microproject" within the Humane-AI-Net H2020 AI Center of Excellence project started, between my group and a group at DFKI Berlin. We will enrich the SynSemClass event ontology with selected German verbs, creating multilingual synonym classes now for three languages (Czech, English and German).

 

2020/12/19 A number of new and/or massively curated datasets will be released in the coming days and weeks through the LINDAT/CLARIAH-CZ repository and the related services, both richly annotated corpora such as the 4M PDT-C and the associated morphological and valency lexicons. The related heavily interdependent projects, which I had the opportunity to personally lead, lasted for about three years, and involved over 20 colleagues and studentd annotators.

 

2020/10/01 The Humane-AI-Net Call 48 "AI Centre of Excellence" H2020 project, in which I have been substantially involved during its preparation and which I lead as a Co-PI for the Czech partner Charles University, has officially started! We will strive to get involved in as many "microprojects", advancing the SoA in NLP and AI, as possible.

2020/09/01 The Evaluation 2021 process for Research Infrastructures begins. LINDAT/CLARIAH-CZ will do its best to succeed and expand its coverage to oral history and the study of the Holocaust, by adding the EHRI CZ partners. The Evaluation report shall be submitted by December; success will mean the possibility to apply for extension of funding beyond 2022.

 

2020/06/20 The Center for Visual History Malach, as an integral part of UFAL and LINDAT/CLARIAH-CZ, will continue with one more co-coordinator with focus on the Balakns. This all despite the fact that CVHM is losing central supprot form the University - LINDAT/CLARIAH-CZ will help, as well as new grants and externally funded projects, and in fact. further expansion of CVHM activities is being planned.

 

2020/04/09 The web page of ICCL (International Committee for Computational Linguistics, the body that organizes Coling conferences since 1960s) is now fully hosted at UFAL (take over from Univ. of Sheffiled - big thanks for having it there for so long!), with a list and an archive of past Coling conferences, all current and former members, and permanent call for hosting future Colings. Just in time for Coling 2020, even though it wil move to December and become partly or fully virtual. 

2020/01/27 LINDAT/CLARIAH-CZ and UFAL kicked off a continuation of the Mellon Foundation project, led by James Pustejovsky at Brandeis University, Waltham, MA. Highlight of the kickoff meeting visit: visit to the WGBH Archives in Boston.

2020/01/01 LINDAT/CLARIAH-CZ project continues as a merged project of the former LINDAT/CLARIN, adding new 2 equipment grant has started and soon the infrastructure will be extended by new computers and software. Also the LUSyD project formally started.

2019/11/10 I have been awarded the LUSyD project (Language Understanding: from Syntax to Discourse) within the EXPRO program of the Grant Agency of the Czech Republic, to run 2020-2024. We will tackle challenging problems in Computational Linguistics - namely, hoe to represent knowledge in relation to semantic representation of texts, and whether it can be learned by state-of-the-art machine learning approaches.

2019/07/01 I am leaving for the U.S. this summer to teach a Multilingual Natural Language Processing (full summer semester) course (Masters/graduate level) at the University fo Colorado in Boulder, Computer Science Department (CS 7800). I am excited to take up this (re)new(ed) challenge of teaching in the U.S.!

2019/01/01 LINDAT/CLARIAH-CZ project has started, for the DARIAH CZ partners. LINDAT/CLARIN still continues until the end of 2019, when both projects merge.

2019/01/01 All the four UFAL H2020 projects have started - from the infrastructural SSHOC and ELG through Bergamot to he ELITR automatic interpretation proejct (by O. Bojar).

2018/07/01 UFAL has been awarded four H2020 grants in the H2020 Call 29 and INFRA Call. Two are Ondrej Bojar's (ELITR, which he coordinates, and Bergamot, coordinated by Edinburgh), and two mine: SSHOC (INFRA project for SSH cooperation in Research Infrastructures) and European Language Grid - platform for Language Technology marketplace, coordinated by DFKI Berlin.

2017/10/31 The last Workprogramme (2018-2020) of the H2020 EU reserach funding has been published. After 4 years, there is again one EUR 25M call for Language Technology, thanks to the efforts of the whole META-NET community. 

2017/09/15 Our proposal for DARIAH CZ infrastructure has passed 2st round of evaluation succesfully, with the second highest mark possible, only with a request to reduce budget. Funding will start after the next government's budget approval in 2019 or 2020 - jointly with LINDAT/CLARIN, which obtained the highest possible evaluation grade by the international evaluation board. Thanks to all members of the LINDAT/CLARIN team for this achievement!

2017/06/20 The Structural Funds project to improve LINDAT/CLARIN infrastructure has finally been signed, for 2017-2019. It will allow to expand our computing and data storage capabilities, as well as to fund some infrastructural research.

2017/04/01 Our proposal for DARIAH CZ infrastructure has passed 1st round of evaluation succesfully. We are now working with our 9 partner institutions on a full proposal for the 2nd round.

2017/04/01 Two new NAKI II projects have started, one of which I coordinate (VIADAT). We also particiapte on another project coordinated by our colleagues at the University of West Bohemia, with the Institute for the Study of totalitarian Regimes.

2017/01/01 A project to help the Prague's Mayor's Office to better search in public documents has started, funded from Prague's Structural Funds (for 2017-2018).

2016/03/01 I have started my Adjunct Professor position at the University of Colorado in Boulder, Department of Computer Science, working with Martha Palmer and others on Computational Linguistics, NLP and new multilingual language resources.

2016/01/01 Second phase of the LINDAT/CLARIN Research Infrastructure has started, of which I mamthe PI. It has been reduced by the government decision of 2015/12/21 in budget and time, but it will allow us to continue our work on language resources and service.

 


Research Interests, Grants

My research interest evolved from morphology and tagging of inflective languages (lexicons, analysis and generation tools - now reimplemented by Milan Straka as MorphoDiTa) to machine translation (French-English while at IBM and Czech-English; also, Czech-Russian and other closely related languages). I am also interested in parsing (see e.g. the CLSP Workshop on parsing Czech) and generation. However, in the past 10 years, I devoted most of my research time to creating linguistic resources, such as the Prague Dependency Treebank family of projects (CzechEnglishArabic) and managing new research projects, mainly funded by the EU (see below for a complete list). I am also involved in the Universal Dependencies project, led by Joakim Nivre of Uppsala University and hosted by LINDAT/CLARIAH-CZ as the official UD repository. My management responsibilities include the LINDAT/CLARIAH-CZ Research Infrastructure and also being the Deputy Chair of the Institute.

I am also interested in spoken language understanding. I participated in the now finishing project Malach, both on the language modeling part (for ASR), on thesaurus translation and on the IR Czech test collection.

I closely work not only with my students, but also with other Czech and foreign teams, such as the University or West Bohemia in the Czech Republic, Center for Speech and Language Processing at the JHU, the CLEAR lab at CU BoulderLinguistic Data Consortium, the European Language Resources Association (ELRA/ELDA), and several European Universities on EU projects (see below).

I am or have been the PI, or the national PI of several major Czech, EU and NSF (US) research projects. The list of current projects (or of those finished within the last 10 years) is below.

 

 Projects 

2021-2022

European Language Equality project, by EP/EC, PPP Action. Preparation of Language-centric AI Strategic Research Agenda for Digital Language Equality by 2030.

2020-2023 Humane-AI-Net / AI Centre of Excellence, Call 48 H2020 project (Co-PI for Chalres University)
2019-2022 LINDAT/CLARIAH-CZ, Large infrastructural grant for digital humanities, language resources, data access and distribution and related reseearch, project LM2018101 of the Ministry of Education of the Czech Republic (continuation and extension to digital hummanities of LM2010013 and LM2015071); complemented by the Structural Funds project "OP VVV VI 2 LINDAT/CLARIAH-CZ" for equipment and computing facilities extensions and renewal (PI)
2020-2024 Language Understanding: from Syntax to Discourse (LUSyD). Grant Agency of the Czech Republic, large grant from the EXPRO programme (PI).
2019-2022 European Lannguage Grid (ELG), EC Call 27 project for building a Language Technology platform with a host of resources and LT services for both commercial and reserach use. I am the Co-PI and lead of the Charles University team provding 600+ services and all resource metadata to the ELG. I also supervise the effort to fund ELG Pilot Projects which are selected in Open Calls throughout Europe, distributing almost EUR 2M through the FSTP financial mechanism.
2010-2015, extended to 2019 LINDAT/CLARIN, Large infrastructural grant for language resources, data access and distribution and related reseearch, project LM2010013 and since 2016 as LM2015071 of the Ministry of Education of the Czech Republic; now complemented by the Structural Funds project "OP VVV LINDAT/CLARIN" for equipment and computing facilities extensions and renewal (PI)
2017-2018 "Document Access" (sub)project for Prague's Mayor's Office (PI)
2016-2020 Mellon Foundation grant, with Brandeis Univ. (coordinator), Vassar Univ., Univ. of Tübingen - harmonization of access to language resources and tools between LAPPS Grid and Clarin
2016-2019 VIADAT, Virtual Assistent for Access to Oral History Archives, with the Institute of Contemporary Hisotory of the Academy of Sciences of the Czech Republic and the National Film Archive of the Czech Republic (PI)
2015-2017 CRACKER, Cracking the Language Barrier: Coordination, Evaluation and Resources for European MT Research. H2020 CSA, PI of the Czech partner, Charles University in Prague. Under negotiation. Coordinated by Hans Uszkoreit, DFKI Berlin, Germany.
2015-2018 HimL, Health in my Language. H2020 Innovation Action. PI of the Czech partner, Charles University in Prague. Under negotiation. Coordinated by Barry Haddow, University of Edinburgh, Scotland.
2015-2018 QT21, Quality Translation 21. H2020 Research and Innovatio Action. PI of the Czech partner, Charles University in Prague. Under negotiation. Coordinated by Barry Haddow, University of Edinburgh, Scotland.
2013-2016 QTLeap, Quality Translation by Deep Language Engineering Approaches. FP7 STREP project. PI of the Czech partner, Charles University in Prague. Coordinated by Antonio Branco, FCUL, Lisabon, Portugal.
2011-2015 AMALACH, Access to multilingual archives, with ZCU in Pilsen and USC, Los Angeles, USA, a Czech Ministry of Culture applied project. PI of the grant.
2011-2014 EUDAT, European Data Infrastructure, Large infrastructural EU project. PI of the Czech partner, Charles University in Prague.
2010-2014 Khresmoi, IP of the 7th FP of the EU (Coordinator: Hennig Müller, HES-SO, Switzerland)
2010-2013 META-NET, Network of Excellence of the 7th FP of the EU - Building the Multilingual Europe Technology Alliance (coordinated by DFKI, Berlin, Prof. Hans Uszkoreit)
2010-2013 Faust, STREP of the 7th FP of the EU - Feedback analysis for improved Statistical Machine Translation (Coordinator: William Byrne, University of Cambridge)
2009-2012 EuroMatrixPlus, STREP of the 7th FP of the EU (Coordinator: Hans Uszkoreit, Univ. of Saarland, Germany)
2006-2010 Companions, IP of the 6th FP of the EU - Conversational Dialogue system (Coordinator: Yorick Wilks, Univ. of Sheffield, GB)
2005-2011 Center for Computational Linguistics, a virtual Center for joint research with the University of West Bohemia, Masaryk University of Brno, and the Institute of the Czech Language in Prague)
2006-2010 PIRE, a project funded by the NSF to promote U.S. graduate student education in Europe. Topic: Investigation of Meaning Representations in Language Understanding for Speech Reconstruction and Machine Translation Systems.
2002-2007 Malach, project for automatic speech recognition (in many languages) of taped interviews with Holocaust survivors, collected by the Shoah Visual History Foundation. Also, Information Retrieval experiments and resource creation.

Before that, I have been the PI or Co-PI of many other projects, such as the Czech Grant-Agency supported highly collaborative, nation-wide Czech National Corpus project (2003-2006), of several collaborative grants for mutual visits to/from U.S. institutions (Johns Hopkins University, University of Pennsylvania, Univ. of Colorado), and of several smaller subcontracting grants (such as the U.S.-based GALE project). In the 90s, I have been the Czech PI of several collaborative EU projects specifically aimed at the formerly Soviet Bloc Countries (EU project STEEL, EU project CEGLEX).

I have been working on some other grants as a researcher as well, such as the predecessor Center for Computational Linguistics (2000-2004), the Laboratory for Linguistic Data (1996-2000), Czech-English MT project supported by the Czech Grant Agency MATRACE (1993-1995), and many smaller projects.

Several industrial projects have got my attention as well, such as the Czech Grammar Checker project and certain lexicon(s) for Microsoft, morphological databases for companies like IBM, Xerox, Lotus, Morphologic, Zi Corp., Lernout & Hauspie, and cooperation on product development for several Czech companies, such as ASPI (legal information system using NL search), Oracle (the Oracle Context product) and morphological dictionary development for the Czech and Slovak portals centrum.cz and centrum.sk. I now contiune to be engaged in negotiations with national as well as international companies regarding licensing of lagnauge resources and/or providing services, such as secure machine translation and others.

Back to top. 


Short Bio

2003- Institute of Formal and Applied LinguisticsSchool of Computer ScienceFaculty of Mathematics and PhysicsCharles University in Prague. Vice-director (2012-). LINDAT/CLARIN infrastructural project director/coordinator (2010-). Director (2003-2011). Acting director (2001-2003, 2011-2012). 
2017/8 Fellow at Norwegian Academy of Sciences, SymSem group at the Center for Advanced Studies
2016- Department of Computer ScienceUniversity of Colorado in Boulder. Adjunct Professor. Teaching 2019 Summer Term b (Multilingual Natural Language Processing, CS/LING 7800)
2008- Full Professor of the Charles University in Prague
2003-2007 Associate Professor of the Charles University in Prague
2002 Team Leader, CLSP JHU Summer Workshop, Generation in the Context of Machine Translation
1999-2000 Visiting Assistant Professor, Computer Science Dept. and Center for Speech and Language Processing, Johns Hopkins University, Baltimore, MD, USA. Teaching "Introduction to NLP" (two semesters) and "Data Structures"
1998 Team Leader, CLSP JHU Summer Workshop, Core Natural Language Processing Technology Applicable to Multiple Languages
1994 PhD ("Dr.") in Computational Linguistics, Faculty of Mathematics and PhysicsCharles University in Prague. Topic: Computational Morphology of Czech.
1993-2003 Researcher, Assistant Professor, Institute of Formal and Applied LinguisticsSchool of Computer ScienceFaculty of Mathematics and Physics,Charles University in Prague.
1991-1993 Visiting Scientist, IBM T.J.Watson Research Center, Yorktown Heights, NY, USA. Project: Candide (Statistical Machine Translation French -> English, project head(s): Robert Mercer, Peter Brown)
1990,1991 Visiting Scientist, ISSCO, Univ. of Geneva, Switzerland. Project: Multilingual Morphological Analysis.
1984-1991 Researcher, Research Institute of Mathematical Machines, Prague. Project: Machine Translation Czech -> Russian (software documentation).
1979-1984 Bc. & Master Degree study, Faculty of Mathematics and PhysicsCharles University in Prague (high honors, RNDr. 1984, thesis topic: Natural Language Robot Control).

Back to top.


Publications

My complete list of publications as recorded in our Institute's bibliography system is here. The full publication database of our Institute can be found here. 

Some pre-2000 publications can be missing from the above system. For a complete list of my publications published before 2008 please see this PDF.

Back to top.


Teaching

I am now teaching an adapted version of the "Introduction to (statistical) NLP" course which I developed while at JHU. The current course is divided into two parts: NPFL067 and NPFL068. Please see also my Hopkins' archive web pages for more information and the complete set of foils in html form.


Service

General Conference Chair

2010 ACL'10, Uppsala, Sweden

Program Committee Chair, Co-chair

2018 Treebanks and Linguistic Theories 17, Oslo, Norway, with S. Oepen, M. Candito, K. Gerdes, S. Kübler
2018 Treebanks and Linguistic Theories 16, Prague, Czech Republic, with S. Oepen, S. Kübler
2017 Treebanks and Linguistic Theories 15, Bloomington, Indiana, USA, with S. JKübler, M. Dickinson and A. Przieporkowski
2014 Coling 2014, Dublin, Ireland; Programme Committee co-chair, with Jun-ichi Tsujii.
2012 META-RESEARCH Workshop on Advanced Treebanking, LREC 2012, Istanbul, Turkey (with Koenraad deSmedt, Antonio Branco and Marko Tadic).
2007 TLT'07 (Treebanks and Linguistic Theories), Bergen, Norway
2006 TLT'06 (Treebanks and Linguistic Theories), Prague, Czech Rep.
2003 EACL'03 (European ACL Conference), Budapest, Hungary
2002 EMNLP'02 (Empirical Methods in NLP), Philadelphia, PA, USA
1999 Thematic Session on "Parsing inflective and free word order languages" ACL '99, June 1999, College Park, MD, USA

Program Committee Area Chair, Full PC Member

2004 EMNLP'04, Barcelona, Spain
2004 EAMT Workshop, La Valetta, Malta
2002 ACL'02, Philadelphia, PA, USA
1995 EACL'95, Dublin, Ireland
2003- Text, Speech and Dialog Conference, Czech Rep., (standing) PC (SC) Member

I have also served as a reviewer at additional 43 conferences or workshops (between 1994 and 2013).

Organization or co-organization of conferences and workshops

2020 The Second International Workshop on Designing Meaning Representations (DMR 2020), at Coling 2020
2020 Cross-Framework Meaning Representation 2nd Parsing Shared Task 2020, at EMNLP/CoNLL 2020
2019 Cross-Framework Meaning Representation Parsing 1st Shared Task 2020, coorganization and Publication Chair
2018 Treebanks and Linguistic Theories 16, Prague, Czech Republic, with and associated Data Provenance Workshop by M. Butt 
2018 CoNLL 2018 Second Shared Task on Multilingual Parsing Universal Dependencies, at CoNLL 2018, Brussels, Belgium
2017 CoNLL 2017 Fisrt Shared Task on Multilingual Parsing Universal Dependencies, at ACL/CoNLL 2017, Vancouver, Canada
2014 Fred Jelinek JHU Summer Workshop for Speech and Language Processing, July 2014, Prague, Czech Rep. (in cooperation with Johns Hopkins Univ., Baltimore, MD, USA)
2012 META-RESEARCH Workshop on Advanced Treebanking, LREC 2012, Istanbul, Turkey.
2007 ACL'07 and EMNLP'07, Prague, Czech Republic (Local Coordinator)
2006 TLT'06, Prague, Czech Republic
2006-2010 Vilem Matheisus Courses (Schools), Prague, Czech Republic

Committees, Boards

2023-2026 Member of the Scientific Council of the Czech Science Foundation (Grantová agentura ČR)
2018- Member of the Scientific Council of Charles University in Prague
2015- Executive Board of META-NET, chair.
2015-2021 Member (external) of the Scientific Council of the Faculty of Electrical Engineering, Czech Technical University
2015- Member (external) of the Scientific Council of the Czech Institute for Informatics, Robotics and Cybernetics, Czech Technical University
2013- Member of the joint Clarin DE / Dariah DE Technical Advisory Board (Germany).
2012-2014 Member of the International Advisory Board, Clarin NL (Netherlands).
2012- Member of the International Committee for Computational Linguistics.
2013-2017 Member of the Management Committee (for Czech Republic) for the COST IC1207 Action of the ESF, within the 7th FP EU (PARSEME, IC1207).
2012-2018 Member of the Standing Committee for CLARIN Technical Centres (SCCTC), of the EU-wide language resource infrastructure Clarin ERIC.(1st and 2nd term)
2012-2024 Member of the Scientific Council of the Faculty of Mathematics and Physics, Charles University in Prague
2012-2021 Member of the Council of the core research PRVOUK project, awarded to the Computer Science School by the Charles University in Prague; continuing in the PROGRESS Q48 and Q18 project boards
2011-2019 Research Council of the Technology Agency of the Czech Republic, member (2 terms)
2011-2012 Expert panel of the Coordinating Committee on the strategy of applied research in the Czech Republic ("Priorities 2030") of the Council for Science, Research and Innovations of the the Czech Republic
2011- Steering Committee for the establishment of the Transactions of the Association for Computational Linguistics journal; head of search committee
2011-2012 Scientific Council of the Faculty of Mathematics and Physics, Charles University in Prague (1st term)
2010-2014 Subcommittee for social sciences and humanities, Council for Science, Research and Innovations of the government of the Czech Republic
2008-2012 Computational Linguistics, Editorial Board Member
2003- NSF Panels (ITR, HLT)
2002- ACL SIGDAT Advisory Board member
1999-2002 TEI Consortium Board of Directors Member, ACL Representative
1998-1999 TEI Steering Committee Member, ACL Representative
1997- EU Evaluation Committee(s), Research Projects
1996- Grant Agency of the Czech Republic, reviewer (Linguistic and Computer Science Programs)
1995-1996 European Chapter of the ACL Advisory Board Member
1990- Czech National Corpus Founding Member, member of CNC Advisory Board (2016-)

Awards

2022 Donatio Prize of Charles University
2020 Silver Medal of the Charles University in Prague
2012 Silver Medal I of the Faculty of Mathematics and Physics, Charles University in Prague.
2009 Award of the Academy of Sciences of the Czech Republic for the best research project in the programme "Information Society" 2005-2009 (Project: "From natural language to the semantic web")
2005 Co-author of a best student paper at EMNLP 2005, Vancouver, with Ryan McDonald, Fernando Pereira and Kiril Ribarov: "Non-projective Dependency Parsing using Spanning Tree Algorithms"
2001 Silver Medal of the Charles University in Prague (as a member of the Czech National Corpus team)

Membership

I am member of the ACL, ISCA, ACM, IEEE, Czech Cybernetics Society and the Prague Linguistic Circle


Former Web Page(s)

You might want to visit my previous page(s) and teaching pages at http://www.cs.jhu.edu/~hajic.

You might also want to visit our Institute's pages at http://ufal.mff.cuni.cz

A little personal project wrt the COVID-19 pandemic (in Czech only): http://ufallab.ms.mff.cuni.cz/~hajic/covid-stats-current.php