*********************************************************************************
***  Native Language Background Affects the Perception of Duration and Pitch  ***
*********************************************************************************

Authors: Siqi Lyu, Nele Põldver, Liis Kask, Luming Wang, Kairi Kreegipuu

Corresponding author: Kairi Kreegipuu
Contact Information: kairi.kreegipuu@ut.ee
NÃ¤ituse 2, Tartu 50409, Tartu, Estonia



****** General Introduction ****** This project includes the raw EEG data and the pre-processed datasets exported from BrainVision Analyzer.
The R/Matlab scripts used for statistical analyses and plotting are available separately on OSF: https://osf.io/acxq9/.

The data was collected at the Cross-linguistic and Brain Sciences Lab at Zhejiang University of Technology, China, between May-June 2021, and at the Laboratory of Experimental Psychology at the University of Tartu, Estonia, between November 2021 to April 2022.



****** 0_raw_EEG_data ******
*** Equipment ***
EST: Biosemi 64-channel system, recorded by software ActiView
CHN: Brain Products actiCHamp 64-channel system, recorded by software BrainVision Recorder

*** Data format ***
EST: each recording session generated one .bdf data file
CHN: each recording session generated one .eeg, one .vhdr, and one .vmrk data file

*** Naming convention ***
Group number _ Subject number Native language Age Gender _ series order _ series name
e.g., G1_01CN22F_1_JIDI1.eeg
NB! 
1. The data for this paper included group numbers G1, G3, and G5. The group number was labeled for other purposes that were not relevant to the current paper.
2. The series "JIDI2" was for our other project and was not analyzed in the current paper.

*** Event markers/triggers in the raw EEG data ***
Sada word: 128 (Q2-170), 64 (Q2-290), 32 (Q3-170), 16 (Q3-290)
Sada pure tone: 128 (Q2-170), 64 (Q2-290), 32 (Q3-170), 16 (Q3-290)
Jidi word: 128 (T1-150), 64 (T1-250), 32 (T2-150), 16 (T2-250)
Jidi pure tone: 128 (T1-150), 64 (T1-250), 32 (T2-150), 16 (T2-250)



****** 0_raw_AX_data ******
*** Equipment ***
EST: E-Prime 2
CHN: E-Prime 3

*** Naming convention ***
EST: G1EE, G3EE, G5EE
CHN: G1CN, G3CN, G5CN
NB! Each file contains 20-21 subjects.

*** Data structure ***
Each row is one trial
Columns:
Subject: subject number
audioA: name of the first audio in that trial
audioX: name of the second audio in that trial
Comp.CRESP: correct response for that trial
Comp.RESP: subject's response for that trial
Comp.RT: subject's response time (0 means the subject missed that trial)
condition: "same" for same pairs and "different" for different pairs
item: D1, D6, D7, D12 - duration difference; D2, D5, D8, D11 - pitch difference; D3, D4, D9, D10 - duration plus pitch difference; S1, S2, S3, S4 - same pairs
Running: "pra" for practice trials and "Trials" for experimental trials



****** 1_preprocessing_steps ******
The templates used for pre-processing the EEG data in Analyzer.
The EST and CHN steps differed in the editing of channels because the data were recorded by different systems. The other steps were identical for both groups.



****** 2_datasets_for_GAMM ******
*** File format ***
Each subject's each event in each series was exported separately. For each subject's each event in each series, there were three types of files: .dat, .vhdr, and .vmrk. The .dat file contained the EEG amplitudes. The .vhdr and .vmrk files contained artifacts information of the corresponding .dat file and were used to exclude trials/segments containing artifacts during the data preparation for GAMM performed later in R (see codes on OSF).

*** Data structure ***
The .dat file contained unaveraged single-trial data.
Each row is a sampling point of a single trial. For example, the sampling rate in China was 500 Hz. For a segment of -100~600ms (i.e., 700ms), there were 350 sampling points/rows of data for each trial, and there were 350*nTrial rows of data in total for each .dat file.
Each column is one EEG channel, as indicated by the column name.



****** 2_datasets_for_LMM ******
*** File format ***
Each subject's each event in each series was exported separately. For each subject's each event in each series, there was one file in the .txt format that contained clean averaged ERP amplitudes. Trials/segments containing artifacts were excluded automatically by Analyzer during the averaging process.

*** Data structure ***
The file contained averaged data (averaged within each subject).
Each row is a sampling point averaged across trials. For example, the sampling rate in China was 500 Hz. For a segment of -100~600ms (i.e., 700ms), there were 350 sampling points/rows of data for each trial. In the averaged .txt file, there were 350 rows of data in total, because the data were averaged across trials (within the subject).
Each column is one EEG channel, as indicated by the column name.