ELITR Minuting Corpus v1.0 consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
More details about the corpus structure and creation are available in the paper referenced below.
The data is separated into the following directories:
Both directories are further split into train, dev, test and test2.
Each meeting has its directory containing the following files:
transcript_MANX_annotYY.txt
the full manually revised transcript of the meeting
X - number of manual revisions, if 2 or more
YY - ID of annotator who did them
one file
minutes_ORIG.txt
the original agenda or minutes, written by meeting organizer
zero or one file
minutes_GENER_annotYY.txt
the minutes files, i.e. summaries written by our annotators
YY - ID of annotator who wrote it
one or more files
alignment+<transcript_filename>+<minutes_filename>
the alignment between the transcript and minutes
zero or more files, at most one per each minutes file
The files have the following formats:
transcript:
Each line contains one utterance and has one of the
following formats:
(SPEAKER) utterance text
... an utterance spoken by SPEAKER
utterance text
... an utterance spoken by the same speaker as the immediately
preceding one
minutes:
Textual summary written by annotators or meeting participants.
The format is somewhat free form but is always in the form of bullet points rather than a coherent text summary.
alignment:
Space separated data in three columns with the following meaning:
- transcript DA line number
- minutes line number to which it is aligned or "None" if unaligned
- ID of a "problem label" with this DA (see below) or "None"
DAs with neither alignment nor any problem label are not mentioned.
Indices start at 1.
The alignment maps each line of the transcript to either the line of the minutes in which it is summarized, a problem label (see below), both or neither. The alignments are done in such a way that the whole longer piece of conversation is aligned to the same minutes line which summarizes it.
Alignments are only provided for a portion of the data.
Some DAs have one of these problematic or interesting properties, signified by the following "problem labels" (the alignment file uses the number 1..5 to indicate the problem):
1 - Organizational
Organizational talk not directly related to the subject of the meeting
(e.g. discussing technical issues with the video call).
2 - Speech incomprehensible
It is not clear what the speaker is saying.
3 - Other issue
4 - Small talk
Small talk or conversation unrelated to the subject of the meeting
(e.g. discussing the weather).
5 - Censored
This part of the transcript had to be removed for privacy reasons.
The data is deidentified. Speakers and other named entities are not identified by names, but rather by IDs in the format ENTITYNUMBER (e.g. PERSON1 or PROJECT3) or just ENTITY (e.g. PATH). Speaker IDs at the beginning of transcript lines are enclosed in round brackets, all other deidentified entities in square brackets.
The ID numbers are shuffled and unique for each meeting, i.e. PERSON1 denotes the same person across all the files of one meeting but a different person in the files of another meeting.
We use these entity types:
PERSON
ORGANIZATION
PROJECT
LOCATION
ANNOTATOR
URL
NUMBER
PASSWORD
PHONE
PATH
EMAIL
OTHER
All other instances of square brackets are regular parts of the text, not our deidentified named entities.
The transcript data also contains the following tags:
<another_language/> or <another_language>...</another_language>
speech in a different language than the rest of the transcript
<typing/>sounds of typing
<parallel_talk>...</parallel_talk> or <parallel_talk/>
speakers talking over each other
<cough/> coughing
<other_yawn/> yawning
<censored/>
section of the transcript has been censored for privacy or ethical reasons
<laugh/> laughter
<unintelligible/> speech is not comprehensible
<other_sigh/> sighing
<talking_to_self/> speaker talking to themselves
<other_noise/> another further unspecified noise
ELITR Minuting Corpus is available at:
http://hdl.handle.net/11234/1-4692
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O. (2022). ELITR Minuting Corpus: A novel dataset for automatic minuting from multi-party meetings in English and Czech. In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC-2022), Marseille, France, June. European Language Resources Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022, author = {Anna Nedoluzhko and Muskaan Singh and Marie Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar}, title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech}, booktitle = {Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC-2022)}, year = 2022, month = {June}, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, note = {In print.} }