The Phonetic Corpus of Estonian Spontaneous Speech consists of recordings that have been annotated on different linguistic tiers including words and segments and their boundaries in the speech signal. The corpus mainly contains dialogues. The corpus can be used for studying different phonetic and linguistic research questions and for training various language technological applications (e.g. speech recognition, dialogue systems). In addition to the detailed phonetic segmentation the corpus has word-level annotation in standard orthography so the corpus can be used with most NLP tools built for written language.

The corpus includes:

  • Studio quality sound recordings, separate channels for each speaker
  • Spontaneous conversation between 2-3 speakers, approximately 30 minutes for each recording
  • Manual transcription of words and phonemes
  • 205 individual speakers in the age range of 20–85 years
  • A total of 134 hours of speech recordings
  • Word & phoneme level annotation of 106 hours / 914 thousand word level intervals

Recordings

Most of the recordings were done in a sound-proof booth at the phonetics lab or in a quiet room in the case of fieldwork recordings in the sub-corpus SKK2. The recordings were done using high quality equipment, with each speaker wearing a head-set microphone, and each signal has been recorded on a separate channel. The recordings are saved in 16 bit 44.1 kHz PCM wave format.

The corpus consists of four sub-corpora:

  • SKK0 – Dialogues recorded in the sound-proof booth at the phonetics lab
  • SKK1 – Monologues recorded in a lecture hall
  • SKK2 – Dialogues recored in quiet settings during fieldwork
  • SKK3 – Conversations between three participants recorded in the Stockholm University breathing lab

The sub-corpus SKK3 contains conversations between three participants recorded by Kätlin Aare in the Stockholm University breathing lab. The recordings include sound, video and respiratory data. More information about the methods is available from here.

Since 2018 all dialogues have been recorded with video. Video is recorded with GoPro cameras. Each speaker is recorded with a separate camera. Face, hand and pose positions have been detected in each video file using OpenPose software. Here is an example of a video and an animation of the OpenPose data:

Participants

All speakers have given their informed consent for participating in the corpus recording. The purpose of this corpus was explained to them, and they have been instructed to speak freely about any prefered topic for half an hour. In the case of monologues, the recording have been carried out at a conference or in a lecture.

Currently there are 205 speakers in the corpus. The distribution of the speakers’ age, gender and regional background is illustrated in the following figures. Most of the speakers have received higher education; many of them are students or faculty members.

Figure 1: Regional background of the participants.

Figure 2: The speakers' gender and year of birth.

Figure 2: The speakers’ gender and year of birth.

Figure 3: The distribution of age and gender in the dialogue pairs.

Figure 3: The distribution of age and gender in the dialogue pairs.

Annotation

The detailed annotation principles can be found in the corpus annotation guide (in Estonian).

The corpus is annotated using Praat software. Segmentation and annotations are available in TextGrid files.

The files are named in this manner: the file name begins with the sub-corpus ID (4 characters: SKK+number), followed by the conversation ID (2 digits), followed by a hyphen and the speaker’s ID. The speaker ID consists of three digits followed by an underscore and M/N indicating the speaker’s gender (M for males, N for females). For example, the file name “SKK001-003_M” indicates that the recording belongs to SKK0 sub-corpus of studio dialogues, this is the recording 01 and the speaker is male 003. The file with the recording of the other participant from the same conversation would be named “SKK001-005_N”, where the first part of the file name is the same, but the speaker ID indicates that the speaker is female 005. All files related to the same recording carry the same name and differ only by the file extention (wav, mp4, TextGrid).

Figure 4: Example of segmentation.

Figure 4: Example of segmentation.

The following annotations are provided in separate tiers:

  1. sõnad – The word tier marks word boundaries and includes words, which are written in standard orthography. Pauses and fillers are also labelled. Silent pauses are marked with #. Filled pauses are marked with a dot, e.g. “.ee”, “.sisse” (‘.inhale’). Additional infomation about the words is added to the end of the orthographic form after a slash, and starts with a dot, e.g. “midagi/.naerdes” (‘something/.laughter’)
  2. häälikud – The phoneme tier marks the sound segmentation. The phonemes are labelled using the SAMPA transcription.
  3. CV – Phoneme classes: C = consonant, V = vowel. This tier is derived automatically from the phoneme tier using a Praat script.
  4. silbidSyllable boundaries and types: LL - short open, PL - long open, PK - long closed. The number before the characters indicates the number of the syllable counting from the beginning of the word stem.
  5. taktid – Metric feet. In Estonian, feet are left-headed, i.e. a foot consists of a stressed syllable followed by one or two unstressed syllables. These annotations indicate primary and secondary stress as well as the three quantity degrees.
  6. morf – Morphological annotations are automatically done using the Filosoft Vabamorf analyser. See the documentation from Filosoft page.
  7. häälelaad – the voice quality tier currently only shows creaky voice. In the future we plan to also annotate other non-modal voice qualities (e.g. whisper, breathy, falsetto).
  8. IP-piirid – Speech is divided into intonation phrases (ip), feedback (ts) and hesitations (he).
  9. lausungid – Interpausal Units (IPU) have been detected automatically using a Praat script.

The workflow is as follows:

  • In the first stage, word level anntotation is created with automatic speech recognition. This annotation is then manually corrected and manual phoneme level segmentation is added. For monologues force alligner has also been used to create phoneme segmentation, which is then manually corrected. Currently, the total duration of files that have been hand-labelled on word and phoneme level is 106:19:54.

  • After the file has been manually annotated on word and phoneme level, the following tiers are added automatically using rule-based Praat scripts: C/V, syllables, interpausal units; morphological tagging uses Filosoft’s morphological analyser.

  • Part of the corpus has been manually annotated at foot level. This annotation includes information about lexical stress and quantity. Currently, the total duration of files with foot level annotation is 24:31:06.

  • The voice quality tier currently includes labels for creaky voice. This is initially detected using automatic voice quality detection, which is then manually corrected. Currently, the total duration of files with manually corrected creaky voice labels is 69:46:22.

  • Part of the corpus that has been manually annotated for IP-phrases. Currently, the total duration of files with IP-phrases annotation is 17:05:52.

Size of the corpus

The current version of the corpus is v 1.2 compiled 8.09.2021 (available on DataDOI).

Size of the corpus: duration of recordings (h:mm:ss) and number of word intervals
Sound Video Words Phonemes Syllables Feet IP boundaries Creak
SKK0 83:43:31 24:35:35 565314 75:54:31 23:45:40 23:45:40 17:05:52 48:12:01
SKK1 12:51:39 NA 124170 12:51:39 02:34:08 00:45:26 NA 12:16:10
SKK2 17:33:44 NA 134287 17:33:44 04:50:03 NA NA 09:18:11
SKK3 19:41:04 17:17:14 89772 NA NA NA NA NA
Total 133:49:57 41:52:49 913543 106:19:54 31:09:51 24:31:06 17:05:52 69:46:22

Using the corpus

The web-based search engine allows users to search within a word; matches are returned with a context of two seconds. The corresponding wav and TextGrid can be downloaded.

The full corpus can be downloaded for the purposes of linguistic research and developing NLP tools. In order to get access to the full corpus please contact Pärtel Lippus ().

Citing

When using the corpus in your research please cite:

Lippus, Pärtel, Kätlin Aare, Anton Malmi, Tuuli Tuisk & Pire Teras. 2021. Phonetic Corpus of Estonian Spontaneous Speech v1.2. Institute of Estonian and General Linguistics, University of Tartu. https://doi.org/10.23673/RE-293.

Copy the citation in BibTeX format:

@misc{ekskfk_2021,
    title = {Phonetic {Corpus} of {Estonian} {Spontaneous} {Speech} v1.2},
    url = {https://datadoi.ee/handle/33/351},
    doi = {10.23673/RE-293},
    language = {et},
    author = {Lippus, Pärtel and Aare, Kätlin and Malmi, Anton and Tuisk, Tuuli and Teras, Pire},
    month = {Sep. 8},
    year = {2021},
    organization = {Institute of Estonian and General Linguistics, University of Tartu},
}

People and funding

The Phonetic Corpus of Estonian Spontaneous Speech has been funded by the national project “Estonian language technology”:

Over the years the following people have contributed to the corpus by doing manual annotations: Anette Ross, Ann Siiman, Anneliis Klaus, Annika Pant, Anton Malmi, Enel Põld, Hannabel Aria, Helen Türk, Helena Joachim, Helmi Lindström, Joel Kannukene, Käbi Suvi, Kätlin Aare, Katrin Leppik, Leena Karin Toots, Liis Raasik, Lotta Saadla, Maarja-Liisa Pilvik, Maia Bubnov, Margit Tätte, Margot Möller, Merike Parve, Merle Põdra, Nele Ots, Pärtel Lippus, Pille Jahisoo, Pille Pipar, Pire Teras, Sander Pajusalu, Sille Midt, Tjorven Siiboja, Tuuli Tuisk.

Word frequencies

This word frequency table has been created from the corpus (version from 20.06.2019 (v1.0.5)). The corpus was lemmatized using the Filosoft morphological analyser. The table gives the 1000 most frequent lemmas with their morphological class and frequency. (Find more about Estmorf word classes).