Author of README file: Virve Vihman, Merilin Miljan Last updated: 19.10.2023 ------------------- GENERAL INFORMATION ------------------- Title of the dataset: Data for "A corpus study of grammatical case forms in written and spoken Estonian: Frequency, distribution and grammatical role" DOI: https://doi.org/10.23673/re-429 URL: https://datadoi.ee/handle/33/567 Description: Coded corpus data of Nominative, Genitive, Partitive Nouns used in the paper "A corpus study of grammatical case forms in written and spoken Estonian: Frequency, distribution and grammatical role" Authors of the paper: Merilin Miljan, Virve-Anneli Vihman Licence: CC-BY Acknowledgements: This study was supported by funding from the University of Tartu's base funding grants in humanities, awarded to each author, which supported the data coding and analysis for this study. Initial manual coding of the data was performed by Carl Eric Simmul and Merilyn Muru. Aims of the study: To probe the relationship in distributional frequency between morphological case (nominative, genitive, partitive) and the various factors which affect its use. Contact: Virve Vihman, University of Tartu, virve.vihman@ut.ee -------------------- DATA & FILE OVERVIEW -------------------- The dataset consists of the following files: - Documentation: 'README.txt'. This file (= the current document) contains the documentation of the dataset. - Coded data: 'Nouns_23-08-20.csv' and 'Nouns_23-10-19.csv' -------------------------------------------------------------------------------- DATA-SPECIFIC INFORMATION -------------------------------------------------------------------------------- Files: 'Nouns_23-08-20.csv' and 'Nouns_23-10-19.csv' * encoding: UTF-8 Unicode * delimiter: , * All text is in quotation marks The 'Nouns_23-08-20.csv' file contains coded data analysed and reported on in the paper "A corpus study of grammatical case forms in written and spoken Estonian: Frequency, distribution and grammatical role". The 'Nouns_23-10-19.csv' file contains the same data with minor corrections and an additional column to identify the analysed noun. Both files contain a total of 2370 items. This represents all nominative, partitive, and genitive-marked nominals found in a random sample of 1509 clauses. The sample of clauses is comprised of 751 clauses extracted randomly from the Fiction subcorpus of the University of Tartu's Balanced Corpus of Written Estonian (cl.ut.ee/korpused) and 758 clauses representing spoken data, also randomly drawn from the University of Tartu's Corpus of Spoken Estonian, maintained by the research group of Spoken Estonian (not publicly available at the time of coding). The data selection and coding process is described in the paper. Clauses are coded for the variables listed below, which are defined and described in the paper, along with the levels used in the coding. Columns in the data file (with levels where appropriate): [1] 'source': WRI (written corpus), SPO (spoken corpus) [2] 'number': item number [3] 'clause': context of the nominal [4] 'word_no': the number of the word analysed in the clause Note: This column only exists in the 'Nouns_23-10-19.csv' file. When counting the word number the following applies: * punctuation marks not counted * hyphenated words count as one word * some words are represented as a range, (e.g. 4-5 for a name "Balti Kett") * in SPO corpus, sign "=" acts as a separator * in SPO corpus, symbol "$" counts as one word * in SPO corpus, (0.2)/(0.6) etc. count as one word [5] 'clause_type': clause type (decl, excl, imper, inter) [6] 'const_order': constituent order in the clause S (subject), V (finite verb), B (copula; in SPO corpus, all verbs are marked as V), O (object), X (adverbial, subordinate clause, etc.) [7] 'Vposition': position of verb in the linear order of constituents (first, second, third, final, other) [8] 'polarity': negative, affirmative [9] 'NPform': NP (lexical nouns and noun phrases), pron (pronoun) [10] 'NPcase': nom (nominative), gen (genitive), par (partitive) [11] 'NPnumber': sg (singular), pl (plural) [12] 'NPcount': countability (mass, count, NA) [13] 'NPanimacy': an (animate), in (inanimate) [14] 'GramRole': subject, object, compl (predicate complement), possessor, obj-pp (object of pre- or postposition), obj-inf (infinitival complement), noun-compl (noun complement), other (various less frequent categories)