The Phonetic Corpus of Estonian Spontaneous Speech consists of recordings that have been annotated on different linguistic tiers including words and segments and their boundaries in the speech signal. The corpus mainly contains dialogues. The corpus can be used for studying different phonetic and linguistic research questions and for training various language technological applications (e.g. speech recognition, dialogue systems). In addition to the detailed phonetic segmentation the corpus has word-level annotation in standard orthography so the corpus can be used with most NLP tools built for written language.
The corpus includes:
Most of the recordings were done in a sound-proof booth at the phonetics lab or in a quiet room in the case of fieldwork recordings in the sub-corpus SKK2. The recordings were done using high quality equipment, with each speaker wearing a head-set microphone, and each signal has been recorded on a separate channel. The recordings are saved in 16 bit 44.1 kHz PCM wave format.
The corpus consists of four sub-corpora:
The sub-corpus SKK3 contains conversations between three participants recorded by Kätlin Aare in the Stockholm University breathing lab. The recordings include sound, video and respiratory data. More information about the methods is available from here.
Since 2018 all dialogues have been recorded with video. Video is recorded with GoPro cameras. Each speaker is recorded with a separate camera. Face, hand and pose positions have been detected in each video file using OpenPose software. Here is an example of a video and an animation of the OpenPose data:
All speakers have given their informed consent for participating in the corpus recording. The purpose of this corpus was explained to them, and they have been instructed to speak freely about any prefered topic for half an hour. In the case of monologues, the recording have been carried out at a conference or in a lecture.
Currently there are 207 speakers in the corpus. The distribution of the speakers’ age, gender and regional background is illustrated in the following figures. Most of the speakers have received higher education; many of them are students or faculty members.
The detailed annotation principles can be found in the corpus annotation guide (in Estonian).
The corpus is annotated using Praat software. Segmentation and annotations are available in TextGrid files.
The files are named in this manner: the file name begins with the sub-corpus ID (4 characters: SKK+number), followed by the conversation ID (2 digits), followed by a hyphen and the speaker’s ID. The speaker ID consists of three digits followed by an underscore and M/N indicating the speaker’s gender (M for males, N for females). For example, the file name “SKK001-003_M” indicates that the recording belongs to SKK0 sub-corpus of studio dialogues, this is the recording 01 and the speaker is male 003. The file with the recording of the other participant from the same conversation would be named “SKK001-005_N”, where the first part of the file name is the same, but the speaker ID indicates that the speaker is female 005. All files related to the same recording carry the same name and differ only by the file extention (wav, mp4, TextGrid).
The following annotations are provided in separate tiers:
The workflow is as follows:
In the first stage, word level anntotation is created with automatic speech recognition. This annotation is then manually corrected and manual phoneme level segmentation is added. For monologues force alligner has also been used to create phoneme segmentation, which is then manually corrected. Currently, the total duration of files that have been hand-labelled on word and phoneme level is NA.
After the file has been manually annotated on word and phoneme level, the following tiers are added automatically using rule-based Praat scripts: C/V, syllables, interpausal units; morphological tagging uses Filosoft’s morphological analyser.
Part of the corpus has been manually annotated at foot level. This annotation includes information about lexical stress and quantity. Currently, the total duration of files with foot level annotation is NA.
The voice quality tier currently includes labels for creaky voice. This is initially detected using automatic voice quality detection, which is then manually corrected. Currently, the total duration of files with manually corrected creaky voice labels is NA.
Part of the corpus that has been manually annotated for IP-phrases. Currently, the total duration of files with IP-phrases annotation is NA.
The current version of the corpus is v 1.3 compiled 20.10.2023 (available on DataDOI).
Sound | Video | Word segments | Lexical words | Phonemes | Syllables | Feet | IP boundaries | Creak | |
---|---|---|---|---|---|---|---|---|---|
SKK0 | 84:48:21 | 25:40:25 | 641301 | 394668 | 84:15:04 | 25:52:46 | 25:52:46 | 17:05:52 | 51:16:44 |
SKK1 | 12:51:39 | NA | 124170 | 73920 | 12:51:39 | 02:34:08 | 00:45:26 | NA | 12:16:10 |
SKK2 | 17:33:44 | NA | 134295 | 88580 | 17:33:44 | 04:50:03 | NA | NA | 09:18:11 |
SKK3 | 19:41:04 | 17:17:14 | 109800 | 72531 | 19:41:04 | NA | NA | NA | 16:44:56 |
Total | 134:54:48 | 42:57:40 | 1009566 | 629699 | 134:21:30 | 33:16:57 | 26:38:12 | 17:05:52 | 89:36:01 |
The web-based search engine allows users to search within a word; matches are returned with a context of two seconds. The corresponding wav and TextGrid can be downloaded.
The full corpus can be downloaded for the purposes of linguistic research and developing NLP tools. In order to get access to the full corpus please contact Pärtel Lippus (partel.lippus@ut.ee).
When using the corpus in your research please cite:
Lippus, Pärtel, Kätlin Aare, Anton Malmi, Tuuli Tuisk & Pire Teras. 2023. Phonetic Corpus of Estonian Spontaneous Speech v1.3. Institute of Estonian and General Linguistics, University of Tartu. https://doi.org/10.23673/re-438.
Copy the citation in BibTeX format:
@misc{ekskfk_2023,
title = {Phonetic {Corpus} of {Estonian} {Spontaneous} {Speech} v1.3},
url = {https://doi.org/10.23673/re-438},
doi = {10.23673/re-438},
language = {et},
author = {Lippus, Pärtel and Aare, Kätlin and Malmi, Anton and Tuisk, Tuuli and Teras, Pire},
month = {oct. 20},
year = {2023},
organization = {Institute of Estonian and General Linguistics, University of Tartu},
}
The Phonetic Corpus of Estonian Spontaneous Speech has been funded by the national project “Estonian language technology”:
Over the years the following people have contributed to the corpus by doing manual annotations: Anette Ross, Ann Siiman, Anneliis Klaus, Annika Pant, Anton Malmi, Enel Põld, Hannabel Aria, Helen Türk, Helena Joachim, Helmi Lindström, Joel Kannukene, Käbi Suvi, Kätlin Aare, Katrin Leppik, Leena Karin Toots, Liis Raasik, Lotta Saadla, Maarja-Liisa Pilvik, Maia Bubnov, Margit Tätte, Margot Möller, Merike Parve, Merle Põdra, Nele Ots, Pärtel Lippus, Pille Jahisoo, Pille Pipar, Pire Teras, Sander Pajusalu, Sille Midt, Tjorven Siiboja, Tuuli Tuisk.
This word frequency table has been created from the corpus (version from 20.06.2019 (v1.0.5)). The corpus was lemmatized using the Filosoft morphological analyser. The table gives the 1000 most frequent lemmas with their morphological class and frequency. (Find more about Estmorf word classes).