Estonian Teen Language Corpus
Vihman, Virve-Anneli; Pilvik, Maarja-Liisa; Mandel, Aive; Kängsepp, Annika; Aigro, Mari; Koreinik, Kadri; Praakli, Kristiina; Lindström, Liina
Loading
Name | Size | Description |
---|---|---|
README.txt | 3.333Kb | Overall description |
teke_chat_metadata.txt | 5.425Kb | Metadata on chat files |
teke_spoken_metadata.txt | 7.089Kb | Metadata on spoken language files |
teke_labellers.txt | 364bytes | Labellers of the corpus |
chat_html.zip | 363.3Kb | Chat corpus in HTML form |
chat_pictures.zip | 26.03Mb | Chat corpus pictures |
chat_tsv.zip | 336.3Kb | Chat corpus in TSV form |
spoken_eaf.zip | 65.78Mb | Spoken corpus in EAF form |
spoken_tsv.zip | 3.707Mb | Spoken corpus in TSV form |
teke_participants.csv | 48.66Kb | Metadata on participants |
teke_recordings.csv | 9.208Kb | Metadata on recordings |
Abstract
Estonian Teen Language Corpus (Eesti teismeliste keele korpus) is a corpus representing spoken and written language data, collected from Estonian teenagers (ages 9-18) between 2019-2023. The corpus consists of four types of files. Spoken language data is represented by .eaf and .tsv files (spoken_eaf.zip, spoken_tsv.zip), and contain transcriptions of recordings made of teenagers' spontaneous speech, where one participant recorded a conversation between themselves and another person or several oother people. Transcriptions are annotated on different linguistic tiers, including words, morphology, language, etc (see teke_spoken_metadata.txt). The corpus version 1.0 contains transcriptions of 116 conversations, most around one hour in length. The corpus can be used for addressing various linguistic research questions, as well as training various language technological applications (e.g. speech recognition, dialogue systems).
Written language data is made up of online chats between two teenagers (ages 10-17). Chats are represented by .tsv and .html files (chat_html.zip, chat_tsv.zip). The corpus version 1.0 includes 110 chats. Annotation includes language tags and abbreviations. All personal information has been anonymised.
Estonian Teen Language Corpus is a product of several consequtive projects, which are further described here: https://teismelistekeel.ee/.... Show more Show less
To access the corpus, please write to Virve Vihman (virve.vihman@ut.ee).
Keyword
speech corpus; chat corpus; internet speech; transcriptions; morphological analysis; teenager languageItem type
info:eu-repo/semantics/datasetCollections
The following license files are associated with this item: