KUK 0.0
KUK 0.0 is a corpus of Czech legal and administrative texts built at the initial stage of the PONK project as a pilot release of training data for a web application to automatically assess the accessibility (comprehensibility or clarity) of Czech legal texts. KUK is designed to provide suitable data for training, development, and evaluation of the PONK application; the intended audience includes clerks and lawyers responsible for crafting authoritative decisions, findings, recommendations and public education materials related to subjects involving or relevant to individuals without legal expertise.
KUK 0.0 can be downloaded from the LINDAT repository, see Licence below.
Authors
Barbora Hladká (Charles University, Faculty of Mathematics and Physics),
Silvie Cinková (Charles University, Faculty of Mathematics and Physics),
Michal Kuk (Frank Bold Society),
Jiří Mírovský (Charles University, Faculty of Mathematics and Physics),
Tereza Novotná (Charles University, Faculty of Mathematics and Physics),
Kristýna Nguyen Zahálková (Frank Bold Society)
Resources
KUK 0.0 contains:
-
data and meta data from/for three sources:
- Public materials of Frank Bold Society (FrBo)
- Statements of the Public Defender of Rights (ESO - Evidence stanovisek ombudsmana)
- Information flyers by the Public Defender of Rights (OmbuFlyers)
- meta data for two external corpora:
Size of the three internal subcorpora (FrBo + ESO + OmbuFlyers):
- documents: 6,051
- words: 19,196,037
Frank Bold texts
Frank Bold texts (FrBo) are divided into two categories, Articles (FrBo_articles) and Analyses (FrBo_analyses). Both types of texts are used in Frank Bold, a non-profit online legal advice service focusing on topics in the public interest: specifically, environmental protection, corruption or civic participation. The texts are addressed to the public in order to help them navigate legal situations and support them in self-help solutions.
The Articles are more comprehensive information materials covering a specific topic or life situation. It is the main type of content that is divided into topic categories. The texts are written as a comprehensive help for legal laymen among active citizens. The texts are not written as educational, but as helpful in a specific life situation. In many cases, the database contains two versions of the same article. Between 2019-2022, a comprehensive redesign of the articles was carried out, following a uniform methodology, with the aim of unifying the text, improving their clarity and promoting readability. Where original texts and texts that have undergone this process are available, both are listed in the database.
The Analyses are detailed legal analyses prepared in response to specific inquiries from the public that the Frank Bold Legal Advisory Service has prepared. These analyses examine a client's particular life situation in a more comprehensive manner and provide information tailored to that life situation. They are still documents intended to provide signposted information. They are not prepared submissions or a specific list of recommended steps. The analyses contained in the KUK 0.0 corpus were previously selected as typical and covering recurring themes, so they have been published in anonymized form with the consent of the clients.
ESO Texts
The office of the Czech Public Defender of Rights (henceforth Ombudsman) continuously release their findings and reports in the ESO public database (ESO - Evidence stanovisek ombudsmana, n.d.). ESO contains the following basic document types:
- Quarterly activity reports for the Czech Parliament divided in separate documents each describing one individual agenda. Each document briefly summarizes important or interesting individual cases, gives overall statistics, and appends a separate list of unsuccessful cases to request the law makers' attention.
- Ombudsman's recommendations cover broader topics synthesized from cases recurring with a pattern. These recommendations address the national authorities to inform them about a problem and to provide a guideline for improvement. Very often they refer to citizens with disabilities or to discrimination cases and are backed by the ombudsman's own surveys.
- Comments to law bills
- Statements for international authorities (e.g. about the status of disabled people for United Nations)
- Statements for the Constitutional Court - typically suggestions to cancel specific local or ministerial regulations.
- Surveys - the ombudsman conducts surveys on different societal issues, for instance discrimination of ethnic minorities on the housing market or extensive testing of standard routines by various authorities. The results of these surveys often back the ombudsman's recommendations.
- Reports from detention facilities - the ombudsman regularly examines the status of inmates' rights by on-site visits without prior notice and issues records.
The Ombudsman's office has developed its own taxonomy of statements with their specific guidelines regarding structure and content. Their typology will be explained in more detail in Section Metadata Structure.
OmbuFlyers
The Ombudsman's Office's web displays PDF flyers with advice and instructions for people in diverse challenging situations (https://www.ochrance.cz/letaky/, https://www.ochrance.cz/situace/). The staff kindly shared with us their current as well as outdated versions.
The Czech Court Decision Corpus 1.0 (external resource)
The Czech Court Decisions Corpus 1.0 (CzCDC 1.0) is an external part of KUK 0.0, meaning the data has been published separately and KUK 0.0 only provides meta-data for the corpus.
The Czech Court Decisions Corpus 1.0 is a corpus containing decisions of three highest courts - the Supreme, the Supreme Administrative and the Constitutional Court (Harašta a kol., 2020), (Novotná, Harašta, 2019). The Supreme Court is the court of last instance for civil and criminal law disputes. The Supreme Administrative Court rules in the last instance on administrative law disputes. The Constitutional Court is a special instance of deciding whether a constitutional right has been violated. The texts of decisions are accompanied by basic identifying information extracted from their texts, i.e. the file number, the date of publication of the decision and the deciding court. The decisions were collected either from the website of a specific court or they were requested via a request for public information. These decisions represent all decisions of these courts published since 1993, in the case of the Supreme Administrative Court since 2003, until September 2018. All the decisions are anonymized by the employees of the respective courts.
The Czech Court Citations Dataset is a dataset containing citations extracted from the Czech Court Decisions Corpus 1.0. These citations are both citations of another court's decision from these three as well as citations of other courts or institutions. In KUK 0.0, only reciprocal citations between the three courts are used.
LiFR-Law (external resource)
LiFR-Law (Corpus of Paraphrased Czech Administrative Texts with Reading Comprehension for Readability Studies, Cinková et al., 2023) is an external part of KUK 0.0, meaning the data has been published separately and KUK 0.0 only provides meta-data for the corpus.
LiFR-Law is a corpus of eighteen Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties. These texts are six triples, each composed of an original document and two paraphrases with identical information content. The original documents are authentic administrative documents concerning common civil agendas, whereas the paraphrases were written by lawyers committed to plain legal writing.
Data
Documents can be found in directory data/ and are organized in subdirectories according to their origin and format:
-
FrBo/ - Frank Bold texts - are further divided into the following subdirectories:
-
articles/
- DOC/ - documents in their original format (DOC or DOCX)
- TXT/ - documents from directory DOC/ transformed to TXT format
-
analyses/
- PDF/ - documents in their original format (PDF)
- TXT/ - documents from directory PDF/ transformed to TXT format
-
articles/
-
ESO/ - Statements of the Public Defender of Rights
- HTML/ - documents in their original format (HTML)
- TXT/ - documents from directory HTML/ transformed to TXT format
-
OmbuFlyers/ - Flyers from the Ombudsman’s Office’s web pages
-
originals/ - outdated versions of the flyers
- PDF_DOC/ - documents in their original format (PDF or DOCX)
- TXT/ - documents from directory PDF_DOC/ transformed to TXT format
-
redesigns/ - current versions of the flyers
- DOC_PDF/ - documents in their original format (DOCX or PDF)
- TXT/ - documents from directory DOC_PDF/ transformed to TXT format
-
originals/ - outdated versions of the flyers
Metadata structure
Each document (physically a file) is represented separately. Its metadata is stored in several tables (typically four), which are easy to join by its unique document ID (KUK_ID). So far, it has made sense to create the quadruple of tables for each source individually, but the column names and order for a given table type are identical across the sources and are thus easy to merge by rows. The four table types are:
- DocumentIdentificationGenreProperties
- DocumentFileFormat
- DocumentVersion
- ContentLinks
The tables are stored as TSV (Tab-Separated-Values) files. The file names encode the identification of the source and the table type; e.g. FrankBold_DocumentVersion.tsv. The following sections describe in detail the structure of each table type. Whenever relevant, a source has a fifth table that captures source-specific information that did not fit into the common scheme. These tables are also TSV files named after their source and their type (e.g. ESOSpecificColumns.tsv)
DocumentIdentificationGenreProperties
KUK_ID
The KUK_ID is the unique identifier of each document in this corpus. So far, the documents came in bulk – either regular corpora (CzCDC) or public databases (ESO) or web archives (materials from Frank Bold). Since each source has its persistent agenda and produces slightly different documents than the other, we encode the source in the KUK_ID to keep this information readily available in any pipeline. The rest of the ID is not meant to encode any additional information. It is either a random string of alphanumeric characters, or it is derived from its SourceID. The KUK_IDs do not necessarily have the same length, even within one source.
SourceDB
The source of the documents. We provide a link to it whenever it is publicly available online.
SourceID
Whenever the document was accompanied by metadata containing its ID in the source database, we preserved this original ID in this column.
DocumentTitle
The document title is source-specific. It can be the case index under which it was filed, or a regular title, or a combination of those. In ESO, we generated the title from the original metadata, merging some descriptive columns with the source-specific categorization, such as the law domain, matter/topic, form of the finding, and keywords.
- Some examples from CzCDC: Na 52/2009-13, Vol 12/2010
- Some examples from ESO (obtained by merging some fields of the original metadata and divided by semicolons): poplatek za komunální odpad;Spolupráce se státními orgány, NGO, soukromým sektorem, zahraničními subjekty;Místní poplatky a řízení o nich;starobní důchod;Odložení;Důchody;příspěvek na bydlení;Zpráva o šetření - § 18;Dávky státní sociální podpory
- Some examples from FrBo: Obrana proti vysušování toků;Životní prostředí;Kterých řízení se může váš spolek účastnit?;Životní prostředí, Spolky a účast veřejnosti, Správní řízení;Účast veřejnosti, Spolky, Účastníci správního řízení
ClarityPursuit
Some documents have been written with an explicit ambition for clarity, by a source that has been consistently applying a clarity management policy. These documents are marked with TRUE; others with FALSE. Note that no additional readability/clarity checks on the individual documents have been applied, and therefore this is at most a preliminary assessment.
Anonymized
To avoid ethical concerns, this column states whether or not the given document has been anonymized and where the anonymization happened. For some documents, anonymization is irrelevant (e.g., flyers or guidelines). Whenever it was relevant, the documents were anonymized either by their sources (typically CzCDC, ESO) or by us. The accepted values are "Anonymized by source", "On-site anonymization", "No".
RecipientType
The accepted values are "natural person", "legal person", and "combined". Value "legal person" means that the intended recipient regularly uses or even provides legal services. It is typically an institution. Value "natural person" denotes a legal layman that does not hire legal services on a regular basis. Value "combined" means that the document has both recipients (e.g. a court verdict addresses the parties, with their lawyers, as well as other institutions in the justice or administration (e.g., lower court instances). This criterion roughly stratifies the documents according to the expected amount of legal terminology and required familiarity with legal procedures: note that the documents were not classified individually but bulkwise, based on their genre and source and the actual presence of legal jargon may differ from the expected!
RecipientIndividuation
This criterion roughly stratifies the documents according to the expected familiarity with the matter beyond the legal context. The accepted values are "individual", "bulk", and "public". Public documents are expected to contain more explanation than the individual- and bulk-addressed documents. The explanation can either concern the key concepts, events, and procedures within a specific domain (e.g., agriculture, elections, or tax record submission), or recount events, whose legal interpretations led to the presented legal conclusion, in more detail than would be necessary for their direct participants. Value "bulk" denotes a distinguished group of recipients (e.g. mining experts and geologists as addresses of a relevant legally-binding technical guideline, or a group of creditors in a debt lawsuit). Note that this value results from a bulk-wise assessment and may not be correct in individual cases.
AuthorType
The accepted values are "authority" or "individual".
Objectivity
The accepted values are "quasiobjective" and "persuasive". Most genres are "quasiobjective": guidelines, information flyers, court verdicts. Persuasive documents are typically recommendations or appeals.
LegalActType
The accepted values are "individual" and "normative". Normative documents are general laws and guidelines, whereas individual legal acts are concrete cases where normative documents are applied.
Bindingness
This criterion describes whether or not the document is legally binding. The accepted values are TRUE and FALSE.
DocumentFileFormat
One document can occur in several formats. All these formats share the same KUK_ID. For instance, when documents from the source SomeSource come in two different formats, HTML and TXT, then a document saved in the file myfile.html and myfile.txt will have KUK_ID "somesource_123". In this table, the ID would occur on two rows. In the first row, the FileFormat value would be "HTML" and the FolderPath value would be "data/SomeSource/HTML/". In the second row, this KUK_ID and FileName combination would have "TXT" as the FileFormat value and "data/SomeSource/TXT/" as the Folder Path.
FileFormat
The file format is the file suffix written in upper case, without period at the beginning. A document takes as many rows in this table as many formats it occurs in, with appropriate folder paths.
FileName
The file name is the file name without the format suffix, without the period introducing the format suffix, and without the file path.
Folder Path
The folder path ends with slash.
DocumentVersion
Version
Some sources provide documents in several versions. Each version is a different document with a different KUK_ID (unlike different formats of the same document). The accepted values are "Original", "Translation", "Partial Redesign", and "Redesign". This classification is based on the most typical case, namely milestones in the editing history, as they were preserved in a source database. "Original" is simply the first version. When a later version is a "Translation", it underwent minor stylistic changes but the content remained the same. "Redesign" is a completely new document that replaces the earlier ones. The content might be changed, that is, information updated, substantially extended, but also substantially cut. "Partial Redesign" is a milder redesign (as captured by metadata delivered by the/each source).
CreationDate
The creation date is formatted yyyy-mm-dd. When the exact day or month is not known, we set it to 01. Note that in most cases the date is the date of the last change of the document and might be different from one inside document.
SourceOriginalID
When the given document is an "Original", it has no source original and the value is NA. When not, this cell contains the KUK_ID of the related original (that is, the very first version of the document, even if there are more than two versions).
ContentLinks
RefersTo
This column lists KUK_IDs of documents that a given document refers to (quotes etc.). Unlike SourcOriginalID, it does not capture earlier versions of the given document. Each row contains exactly one document in RefersTo. Each KUK_ID is represented as many times as many other documents the given document refers to.
<Source Name>SpecificColumns
Documents from some sources were accompanied with rich metadata. We did our best to merge the metadata information across the sources, but some was just too source-specific. Tables whose names are composed of the name of the source corpus and the suffix SpecificColumns (e.g. ESOSpecificColumns) individually capture some source-specific metadata information.
Notes about sources
FrBo
Frank Bold's documents come in two folders: analyses and articles. The KUK_ID distinguishes them with the prefixes Fana_ and Fart_. These are followed by one of these three possible strings: "orig_", "partred_", or "red_" (which served to internally mark their editing status at FrBo) and a sequence of alphanumeric characters.
Frank Bold had two Specific columns tables, one for analyses and one for articles. They contain a classification of the documents according to a Frank-Bold internal taxonomy.
ESO
The Ombudsman's database ESO also comes with a Specific-Columns table. It contains the following columns (all in Czech):
- Spisová značka (file index)
- Oblast práva (legal domain, according to Ombudsman's internal classification)
- Věc (matter/case/topic)
- Forma zjištění (genre/template/document type)
- Právní věty (legal summary) – this is often a longer running text. In the original database it is divided into paragraphs. We have replaced the paragraphs with two spaces to prevent import and export issues with the plain-text tabular format
- Heslář (key words) – internal at the Ombudsman's office
- Agenda – internal codes at the Ombudsman's office
The original metadata is thoroughly documented at https://eso.ochrance.cz/Napoveda.
OmbuFlyers
One original typically has one redesigned version, but a few were elaborated into a series of self-sufficient flyers. When this was the case, we created links between the re-designed documents to make it clear that they belong together (column RefersTo). At the same time, all re-designed documents are linked to the original (column SourceOriginalID). Whenever only one re-designed document refers to the original, the RefersTo column has the NA value and only the SourceOriginalID is filled out.
Citation
Please cite the data when using the corpus for your research:
Barbora Hladká, Silvie Cinková, Michal Kuk, Jiří Mírovský, Tereza Novotná and Kristýna Nguyen Zahálková: KUK 0.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, 2023, LINDAT http://hdl.handle.net/11234/1-5363.
Licence
The corpus KUK 0.0 is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) licence.
Acknowledgement
The work on the corpus was financed by the TAČR SIGMA project TQ01000526: PONK - Asistent přístupné úřední komunikace.
References
- Cinková, Silvie; et al., 2023. LiFR-Law. Corpus of Paraphrased Czech Administrative Texts with Reading Comprehension for Readability Studies (2023-10-08), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-5225
- Harašta, Jakub, Tereza Novotná a Jaromír Šavelka, 2020. Citation Data of Czech Apex Courts. arXiv:2002.02224 [cs] [online]. [vid. 2021-09-15]. Dostupné z: http://arxiv.org/abs/2002.02224
- Novotná, Tereza a Jakub Harašta, 2019. The Czech Court Decisions Corpus (CzCDC): Availability as the First Step. arXiv:1910.09513 [cs] [online]. [vid. 2020-01-20]. Dostupné z: http://arxiv.org/abs/1910.09513