# Utterance Final Weakening (UFW) in Pite Saami (v1.0)

---

## Basic overview information

|category|details|
|---|---|
|TITLE OF THE DATASET| Utterance Final Weakening (UFW) in Pite Saami (v1.0)|
|SHORT DESCRIPTION| dataset of UFW instances from a subset of Joshua Wilbur's Pite Saami corpus|
|CREATOR| Joshua Wilbur|
|CONTACT| joshua.wilbur@ut.ee; jeutzsch@gmail.com |
|VERSION| 1.0|
|VERSION DATE| 2025-02-19|
|FILE(S) INCLUDED| UFW_rawdata.json|
|FILE FORMAT| .json|
|DataDOI ABSTRACT | A .json file containing data for 182 instances of Utterance Final Weakening (UFW) in Pite Saami, a Uralic language spoken in and around the Arjeplog municipality in northern Sweden and adjacent areas in Norway. The dataset was extracted from a subset of J. Wilbur's Pite Saami corpus. More detailed metadata is found in the files. |
|KEYWORDS |Pite Saami, utterance-final weakening, prosody, devoicing, sociolinguistic variance |


## Methodological information

13 recordings of spontaneous Pite Saami speech featuring 11 different Pite Saami native speakers were chosen as a sample dataset which the data in this file was extracted from. The natural flow of speech in these recordings was divided into utterance units roughly corresponding to sentences and/or intonational units, using the software [ELAN](https://archive.mpi.nl/tla/elan) to create and manage annotations. The utterances were then transcribed manually by J. Wilbur and with the assistance of native speakers using the current Pite Saami orthography standard (as of 2019), and then tokenized automatically. For each token, annotations for lemma, word class, relevant morphological values and an English gloss were added automatically using natural language processing (more specifically, this runs a python script applying a finite state transducer and constraint grammar, following the method outlined in Gerstenberger et al (2017); any remaining ambiguities were manually resolved by the J. Wilbur whenever possible. These recordings consist of 2368 utterances by Pite Saami native speakers; 228 of those utterances are instances of UFW, but a subset of 182 instances of UFW were chosen for this dataset (46 instances were removed from the dataset: in 34 cases because no reliable transcription was available; in 7 cases because the utterances were predominantly in Swedish; in 5 cases because the extant transcription was insufficient for an unambiguous analysis). Each instance was evaluated by hand and using python scripts to compile the current dataset.

**Reference:**
Ciprian, Niko Partanen, Michael Rießler, & Joshua Wilbur (2017). “Instant annotations. Applying NLP methods to the annotation of spoken language documentation corpora”. In: *Proceedings of the 3rd International Workshop on Computational Linguistics for Uralic languages*. Ed. by Tommi A. Pirinen, Michael Rießler, Trond Trosterud, & Francis M. Tyers. ACL Anthology. St. Petersburg: Association for Computational Linguistics, pp. 25–36. DOI: [10.18653/v1/W17-0604](https://doi.org/10.18653/v1/W17-0604).


## Data specific information

Metadata about the dataset is found in the initial and highest node of the .json file itself, and repeated here for convenience:

|category|explanation|
|---|---|
| ID | full session name, '(final) dash', three-digit utterance reference; data type: string |
| ID_link | short ID for use in publication; data type: string |
| utterance | utterance in standard Pite Saami orthography; data type: string |
| speaker | ID code for speaker; data type: string |
| sylNo | number of syllables at least partially subject to the instance of UFW; data type: integer |
| pos | part-of-speech for the word(s) affected by  the instance of UFW; data type: string |
| pause | duration of the pause between the end of the instance of UFW and the following utterance (in milliseconds); data type: integer |
| nextSameSp | 'true' if the utterance following the instance of UFW is the same speaker; 'false' if not (boolean value); 'null' if final utterance in a session; data type: pseudo-boolean |
| notes | notes about the instance of UFW; data type: string |
| cgTail | the Constraint Grammar (CG) analyses for the ultimate or penultimate tokens of the instance of UFW (see ) when available in the source ELAN file (otherwise 'mull'); data type: string |
| gender | the biological gender of the speaker of the instance of UFW ('m' for male; 'f' for female); data type: string |
| syllCats | two-digit code used to classify the syllable preceding the onset of the instance of UFW (represented by the first digit) and the initial syllable affected by the instance of UFW (represented by the second digit); values: 1 = initial stressed syllable; 2 = second unstressed syllable; 3 = third unfooted/unstressed syllable; 4 = unstressed monosyllabic form; data type: string |
| syntaxBoundary | code used to classify whether the syllable containing the onset of the instance of UFW aligns with a word-level syntactic constituent; values: 0 = no alignment; 1 = UFW-onset aligns with a syntactic constituent boundary; 2 = UFW-onset aligns with a word-internal compound boundary; data type: string |


## Sharing and Access information

Use of this file is governed by a **CC BY-NC-SA** license (Attribution-NonCommercial-ShareAlike). This license allows you to remix, tweak, and build upon this work non-commercially, provided you acknowledge the source (see above) and licence any derivative works on the same terms. [See here for full details](https://creativecommons.org/licenses/by-nc-sa/4.0/)


---

v20250613 - J. Wilbur