################################### ######### SPOKEN DATA ######### ################################### Transcriptions of dialogues and trialogues between participants are represented by two file types. First, each conversation has its own .eaf file, which can be opened in transcription software ELAN (https://archive.mpi.nl/tla/elan) or processed as an .xml file. Second, plain text versions of all transcriptions are also included in the repository (.tsv), readable with Notepad variants and spreadsheet software, such as MS Excel or LibreCalc. - All names (people, locations, institutions, usernames, or anything else linked to actual people involved) have been pseudononymised and substituted. - .wav and .mp3 format audio recordings are not included in the repository. -------------------------- EAF FILES -------------------------- The files can have 6 tiers. 1) Subtitle-Tier: automatically annotated speech, using the tool (https://tekstiks.ee/). This tier is not present in all files. 2) ref@: time-aligned turn segments 3) orto@: speech as transcribed by the transcriber (see notes below) 4) sna@: transcribed speech on orto-tier automatically divided into words. This tier has 8 subtiers: - lemma@: reference forms of lexemes (nominal singular for nouns, ma-infinitive for verbs, the only uninflected form for other words) - root@: roots of lexemes (reference forms without form morphology, e.g. -ma for verbs). - root_token@: same as root@, except that initial components of compound words are not analysed ('noorte_seriaal' on this level, 'noor+te_seriaal' on root@ tier) - POS@: part-of-speech as defined in the EstNLTK system (see below) - form@: inflectional form as defined in the EstNLTK system (see below) - ending@: inflectional morpheme as defined in the EstNLTK system (see below) - clitic@: presence or absence of clitics ('gi' or 'ki'), as defined in the EstNLTK system (see below) - other@: mostly empty, tags for uttered words belonging to titles in different languages (e.g., 'ee_pealkiri', 'ingl_pealkiri', 'hisp_pealkiri') and song lyrics ('laul') 5) keel@: time-aligned language segments for languages other than Estonian (e.g., 'inglise keeles', 'vene keeles'). This tier also includes dialect words ('murdes'), regional accents ('aktsendiga'), names ('nimi'), and non-standard colloquialisms ('slängis'). A single segment on keel-tier can contain multiple words. Brand names were exempt from language annotation as they were categorised as names. 6) kommentaar@: free-text comments specifying situational conditions, speech peculiarities (e.g., imitation, strong stress on a word, voice quality), the use of less frequently used languages not specified on the keel@ tier, etc. orto@ tier ---------- The main transcribed text is found on the @orto tier. - Text is presented according to standard orthographic conventions, except for the following cases: 1) initial words in sentences are not capitalised (only names are capitalised, acronyms are fully in caps), 2) no punctuation is used, except for hyphens, 3) well-known colloquialisms are presented together with their corresponding standard variants, e.g. 'absull[absoluutselt]', 'õps[õpetaja]', 'nv[nädalavahetus]', 'kakskend[kakskümmend]', 'vä[või]', etc. This does not apply to phonological reduction where speakers cut a single phoneme (e.g. 'käind' -> 'käinud', 'teind' -> 'teinud', 'mai tea' -> 'ma ei tea') or regional accents (e.g. 'mötlesin' -> 'mõtlesin'). - Titles are given in quotation marks (e.g. "Tõde ja õigus"). - Words in other languages are annotated phonologically (kam_oon for come on), followed by the orthograpic form in their original language in square brackets and underscores (kam_oon[_come_on_]). If foreign words are used with Estonian morphology, the morphemes are kept in the brackets and separated by an apostrophe, e.g. laavin[_love'in_]. An inclusive approach was used for categorising lexical borrowings and code-switches, meaning that also many of the words more adapted to Estonian got annotated as non-Estonian (e.g. 'vau[_wow_]', 'laikis[_like'is_]', 'sorri[_sorry_]'). The project team discussed specific cases and checked the data for consistency. - Pauses within turns are marked with (.) and (...), depending on pause length. - Undeciphered text is marked with (-) and (---), depending on text length. - Words and phrases where the transcriber was not entirely confident in what was said are enclosed in regular brackets (e.g. 'ja siis (ma) tulin', 'ta oli (siis mingi) miks sa nii ütlesid'). - Unfinished or interrupted words are marked with final or initial hyphens (e.g. 'selli-', 'taht- (...) ee -sid') - Text spoken while laughing is enclosed in @-tags (e.g. '@tegelt[tegelikult]@', 'ema @ei saanud midagi aru@'). Pure laugh is annotated as %naer%. - Sound imitation and humming is annotated as %imit%. - Filled pauses (e.g., 'mm', 'ee') are also transcribed as words. - Elongation of sounds within words is marked with : after the corresponding grapheme (e.g., 'ma üldse: ei tahtnud'). sna@ tier ---------- - EstNLTK (Python 3.8, EstNLTK 1.7.1) tags for morphological annotation are described here: https://github.com/estnltk/estnltk/blob/main/tutorials/nlp_pipeline/B_morphology/00_tables_of_morphological_categories.ipynb - Unfinished or interrupted words, pauses, unclear speech (--), laughter (%naer%) and imitation (%imit%) do not have morphological annotation and their corresponding sna-related tiers (sna, lemma, root etc.) contain empty values (''). - Filled pauses such as 'ee', 'mm', 'aa' usually receive the abbreviation annotation 'Y'. - For nonstandard units, annotation is only provided for the standard variant ('ainult' in 'aint[ainult]'). - Morphological analysis of non-Estonian units was done using the Python package Spacy, which uses the Universal Dependency annotation (https://universaldependencies.org/u/pos/). Any Estonian morphology added to the non-Estonian unit (e.g., 'subskraibisin[_subscribe'isin_]') is not considered in the annotation. sna@ subtiers are meaningful only for English as they have been annotated using an English model (en_core_web_trf, ver 3.7.3). English, however, accounts for more than 99% of foreign words in this corpus. - None of the automatically added morphological annotation has been manually checked and therefore contains errors! -------------------------- TSV FILES -------------------------- These files present transcriptions of sound files in simple text format (tab-delimited files), readable in Excel, Notepad, etc. Each file represents one conversation and presents information in 9 columns: 1) Osaleja_kood: the unique ID described in the table teke_participants.csv 2) Vooru_tekst: transcribed speech including the number of annotation, see below 3) Ref_id: ID for ref@ tier 4) Vooru_id: ID for @orto tier 5) Algusaeg_sek: beginning time of utterance 6) Lõpuaeg_sek: end time of utterance 7) Kestus_sek: duration of utterance (column 6 value minus column 5 value) 8) Sugu: gender of participant 9) Vanus: age of participant Annotation in Vooru_tekst follows the same principles described for the orto@ tier above.