Estonian Teen Language Corpus

Vihman, Virve-Anneli; Pilvik, Maarja-Liisa; Mandel, Aive; Kängsepp, Annika; Aigro, Mari; Koreinik, Kadri; Praakli, Kristiina; Lindström, Liina

Vihman, Virve-Anneli; Pilvik, Maarja-Liisa; Mandel, Aive; Kängsepp, Annika; Aigro, Mari; Koreinik, Kadri; Praakli, Kristiina; Lindström, Liina

Name	Size	Description
README.txt	3.333Kb	Overall description
teke_chat_metadata.txt	5.425Kb	Metadata on chat files
teke_spoken_metadata.txt	7.089Kb	Metadata on spoken language files
teke_labellers.txt	364bytes	Labellers of the corpus
chat_html.zip	363.3Kb	Chat corpus in HTML form
chat_pictures.zip	26.03Mb	Chat corpus pictures
chat_tsv.zip	336.3Kb	Chat corpus in TSV form
spoken_eaf.zip	65.78Mb	Spoken corpus in EAF form
spoken_tsv.zip	3.707Mb	Spoken corpus in TSV form
teke_participants.csv	48.66Kb	Metadata on participants
teke_recordings.csv	9.208Kb	Metadata on recordings

Date

2023

URI

https://datadoi.ee/handle/33/596
https://doi.org/10.23673/re-455

Metadata

Show full item record

Abstract

Estonian Teen Language Corpus (Eesti teismeliste keele korpus) is a corpus representing spoken and written language data, collected from Estonian teenagers (ages 9-18) between 2019-2023. The corpus consists of four types of files. Spoken language data is represented by .eaf and .tsv files (spoken_eaf.zip, spoken_tsv.zip), and contain transcriptions of recordings made of teenagers' spontaneous speech, where one participant recorded a conversation between themselves and another person or several oother people. Transcriptions are annotated on different linguistic tiers, including words, morphology, language, etc (see teke_spoken_metadata.txt). The corpus version 1.0 contains transcriptions of 116 conversations, most around one hour in length. The corpus can be used for addressing various linguistic research questions, as well as training various language technological applications (e.g. speech recognition, dialogue systems). Written language data is made up of online chats between two teenagers (ages 10-17). Chats are represented by .tsv and .html files (chat_html.zip, chat_tsv.zip). The corpus version 1.0 includes 110 chats. Annotation includes language tags and abbreviations. All personal information has been anonymised. Estonian Teen Language Corpus is a product of several consequtive projects, which are further described here: https://teismelistekeel.ee/.... Show more Show less

To access the corpus, please write to Virve Vihman (virve.vihman@ut.ee).

Keyword

speech corpus; chat corpus; internet speech; transcriptions; morphological analysis; teenager language

Item type

info:eu-repo/semantics/dataset

Collections

Eesti ja üldkeeleteaduse andmed

The following license files are associated with this item:

Creative Commons

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/restrictedAccess