ELITR Minuting Corpus v1.0 consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
More details about the corpus structure and creation are available in the paper referenced below.
The data is separated into the following directories:
Both directories are further split into train, dev, test and test2.
Each meeting has its directory containing the following files:
the full manually revised transcript of the meeting
X - number of manual revisions, if 2 or more
YY - ID of annotator who did them
the original agenda or minutes, written by meeting organizer
zero or one file
the minutes files, i.e. summaries written by our annotators
YY - ID of annotator who wrote it
one or more files
the alignment between the transcript and minutes
zero or more files, at most one per each minutes file
The files have the following formats:
Each line contains one utterance and has one of the
(SPEAKER) utterance text
... an utterance spoken by SPEAKER
... an utterance spoken by the same speaker as the immediately
Textual summary written by annotators or meeting participants.
The format is somewhat free form but is always in the form of bullet points rather than a coherent text summary.
Space separated data in three columns with the following meaning:
- transcript DA line number
- minutes line number to which it is aligned or "None" if unaligned
- ID of a "problem label" with this DA (see below) or "None"
DAs with neither alignment nor any problem label are not mentioned.
Indices start at 1.
The alignment maps each line of the transcript to either the line of the minutes in which it is summarized, a problem label (see below), both or neither. The alignments are done in such a way that the whole longer piece of conversation is aligned to the same minutes line which summarizes it.
Alignments are only provided for a portion of the data.
Some DAs have one of these problematic or interesting properties, signified by the following "problem labels" (the alignment file uses the number 1..5 to indicate the problem):
1 - Organizational
Organizational talk not directly related to the subject of the meeting
(e.g. discussing technical issues with the video call).
2 - Speech incomprehensible
It is not clear what the speaker is saying.
3 - Other issue
4 - Small talk
Small talk or conversation unrelated to the subject of the meeting
(e.g. discussing the weather).
5 - Censored
This part of the transcript had to be removed for privacy reasons.
The data is deidentified. Speakers and other named entities are not identified by names, but rather by IDs in the format ENTITYNUMBER (e.g. PERSON1 or PROJECT3) or just ENTITY (e.g. PATH). Speaker IDs at the beginning of transcript lines are enclosed in round brackets, all other deidentified entities in square brackets.
The ID numbers are shuffled and unique for each meeting, i.e. PERSON1 denotes the same person across all the files of one meeting but a different person in the files of another meeting.
We use these entity types:
All other instances of square brackets are regular parts of the text, not our deidentified named entities.
The transcript data also contains the following tags:
<another_language/> or <another_language>...</another_language>
speech in a different language than the rest of the transcript
<typing/>sounds of typing
<parallel_talk>...</parallel_talk> or <parallel_talk/>
speakers talking over each other
section of the transcript has been censored for privacy or ethical reasons
<unintelligible/> speech is not comprehensible
<talking_to_self/> speaker talking to themselves
<other_noise/> another further unspecified noise
ELITR Minuting Corpus is available at:
If you use this corpus, please cite:
CITATION IN PLAIN TEXT
CITATION IN BIBTEX