################################### ############ CHATS ############ ################################### Chat data is represented by two file types. First, each conversation has its own .tsv file, which may be opened with any plain text viewer (Notepad, Notepad++, etc) or spreadsheet software (MS Excel, LibreCalc). In addition, each conversation is also included in the form of an .html file, which is more easily human-readable, with emojis presented graphically. -------------------------- TSV FILES -------------------------- These are tab-delimited plain text files, one per chat conversation, with each turn (more specifically, each sent message) on a separate line. Text is annotated for certain types of abbreviations and language use (see below). Each tsv file contains 5 columns: 1) Jutt (Text): 'jah' if the row includes a turn typed by a participant, 'no' if the row includes metatext by the messaging app, e.g. declaring a voice message and its duration (contents of voice messages are not included in this repository), a video chat and its duration, forwarded messages, replying to messages, unsending messages. 2) Aeg (Time): information pertaining to each message. Sometimes this is expressed as a date, sometimes as the time of day. This depends on the format copied by participants into a text processor, meaning the column is not informatively cohesive. 3) Osaleja (Participant): participant IDs, the metadata of which is found in the file teke_participants.csv (it includes both spoken conversation and chat participants, distinguished in the second column). 4) Puhas_tekst (Plain_text): the text as written by a participant. The following information types have been replaced for pseudonymisation: - first and last names of people, - pet names, - institution names, - titles of projects, - usernames used online, - nicknames - passwords - dates and locations. Note that all personal pictures have been replaced with a brief description of their contents, e.g. [pilt_kodutööst] ('a picture of homework'). All non-personal pictures are retained with file names and may be accessed via the folder "chat_pictures" in this repository. 5) Margendatud_tekst (Annotated_text): the content of the "Puhas_tekst" column with added annotation. The following types are included: - confidential info, annotated as ... where xxx stands for different subcategories of confidentiality: - "name" (...): names of people and animals - "other" (...): all other information pertaining to relevant people, e.g. institutions, locations, etc. (see above). All information between the confidential info tags is already pseudonymised. - language, annotated as ... where xxx stands for a particular language, with the following subcategories: - "eng": English (e.g. whatever) - "rus": Russian - "fin": Finnish - "fre": French - "ger": German - "unk": unknown language For instance, the following phrases were all coded as English: "Don't worry", "Dõunt vörri", "Sorry", "sorri", "bro", "bruh", "tänks", "thanks". Names were not coded for language ("McDonald's"). - abbreviations, annotated as where xxx stands for a type of abbreviation with the following levels: - "CUT": a simple shortening, either by dropping the final part of lexemes or deleting whitespace or characters from the middle (e.g. Lic as an abbreviation of "lihtsalt" ("simply")). - "INIT": a phrase is shortened into intial letters only (e.g. v for "või", the phrase-final polar question marker, gg for "good game") - "PRON": a phrase shortened by matching its orthography more closely to pronunciation (e.g. sen for "see on" ("this is")) - "KON": a phrase shortened to a widely used colloquial form (e.g. siuke for "selline" ("this kind of"/"in this way")) -------------------------- HTML FILES -------------------------- HTML files present information in four columns: 1) participant: the unique ID described in the table teke_participants.csv. 2) time: time of the conversation as copied by the participant sending the file to the project. Sometimes this specifies time of day, sometimes only dates. 3) text_role: this column specifies whether the row contains text written by the participant of metatext by the messaging application. For instance, replying to other messages, voice messages, video messages are reflected on this row when text_role = "ei". 4) text: text written by the participant or meta text added by the application (e.g. replies, voice messages, etc). Emojis are presented as pictures, except for complex emojis which are made up of several unicode values (U+...) at once. If a conversation contains such emojis, the header specifies their codes under the subcategory "Unconverted emojis" (e.g. Ant6m_397241_Messenger_1.html) Picture files are presented identically with tsv files. This means file names are provided for non-confidential pictures and picture descriptions are provided for confidential pictures. Descriptions vary, some being more ambiguous than others. [peidetud_pilt] refers to a picture that was included in text, the contents of which went missing during the copying and pasting of chats by participants. All picture information is given in square brackets.