CZECH ACADEMIC CORPUS 1.0 GUIDE
|česky|

1. Preface

The guide to the Czech Academic Corpus version 1.0 is a roadmap to the CD-ROM. Although we are able to physically touch the silver disc, this manual will provide a means for exploring the abstract content digitized underneath the plastic coating and will ultimately provide the necessary steps to view, experiment, or modify the content.

Within the context of the CD-ROM you will find Czech texts consisting of nearly 600,000 words along with tools for viewing and modifying them. Every word is described by means of part of speech class and is classified into the categories within the part of speech. In other words, the content of the CD is a corpus with a manual annotation of morphology. The fact that its name is the Czech Academic Corpus already results from the name of the guide. The tools offered treat texts from the standpoint of morphology.

Due to the diverse aspects of the content of the CD, we expect users to search for information selectively based on their particular needs. For example, the user-theoretician might be mainly interested in the corpus whereas the user-practician will primarily be interested in the tools. That is why we recommend to all to take a guided tour of the guide itself. The guide is made up of three thematic entities highlighting the introduction of the material, the technical details of the data representation and the tools, and finally the installation of the CD itself.

The first part, Chapter 2 represents a key part for all users. It provides the users with the fundamental characteristics of the academic corpus, presenting it as a project that has been in progress for the past twenty years. The evolution of the corpus that has spanned this time period is documented for the user. Motivation for this current edition is also explained as well as the seeming paradox of why this version is labeled version 1.0. The chapter is also complemented with quantitative data about the corpus.

The second part, Chapter 3 is more technically oriented as it concentrates on the structure of the CD itself. In this chapter, the corpus is explained as a data file with inner representation (Section 3.2) and also paying attention to tools (Section 3.3). This section is primarily aimed at the user-practician, and although the user-theoretician may benefit from some of the information presented, he or she may skip this part with the exception of Section 3.3.1 which is devoted to effectively searching through the corpus.

The third part, from Chapter 5 to Chapter 8 including four appendixes: Appendix A, Appendix B, Appendix C and Appendix D is again intended for all users. Chapter 5 leads the user through the installation process of the CD. Chapter 6 and Chapter 7 provide users with information on researchers who contributed to the Czech Academic Corpus version 1.0 and the foundations and organizations who financially supported or still support this project. The bonus Chapter 4 is a teaching aid for practice in parsing. In the printed version, the Appendix D is put into the guide in the form of a solid list to provide help for users to become more familiar with morphological annotations. The Appendix C provides a description of lemma's structure that provides further assistance in becoming more familiar with annotations. The Appendix A enumerates the sources that were used for the academic corpus. The Appendix B shows Internet sources that complement the guide.

The edition of the Czech Academic Corpus would not come to fruition without the results of the project of the Prague Dependency Treebank. We would like to thank gratefully all who contributed and hope others will pardon us for naming just four (in alphabetical order): Jan Hajič, Eva Hajičová, Jarmila Panevová and Petr Sgall. There would be no Prague Dependency Treebank if it was not for them.

The Czech Academic Corpus as well as the Prague Dependency Treebank are annotated corpora of the Czech language. The Czech computational linguistics is further strengthened by the publication of the first version of the Czech Academic Corpus. The CD is published within the frame of the project Resources and Tools for Information Systems, No. 1ET101120413, financed by the Grant Agency of the Academy of Sciences of the Czech Republic.