Cleaning raw web data
The TEDS26 data file initially downloading from the Qualtrics server, and the CATSLife data files downloaded from the Quodit server, were not suitable for long-term retention, for several reasons. Firstly, they included a range of identifiers that were included for admin purposes, including twin logins, names, postal addresses, phone numbers, IP addresses, children names, and other details. Secondly, they included a number of redundant variables that were added by the web server but were not useful for research purposes. Thirdly, the TEDS26 raw data file included a number of text response fields, some containing potentially identifiable responses, which needed to be coded before they could be added to the dataset. Fourthly, the TEDS26 data file contained two rows of headings, one of which was surplus to requirements. Fifthly, the CATSLife raw data were originally in separate files for Spatial Spy and Test My Brain tests, with different identifiers, which had to be combined. Sixthly, the Test My Brain raw data file had a complex structure including multiple rows per participant and data fields encapsulated in long JSON strings; these have been restructured into a single file and with data fields in conventional variables.
The final version of the downloaded data files were therefore transformed into a single cleaned raw data file for each data collection (TEDS26 and CATSLife). The cleaned raw data files have been permanently retained, for the purpose of building the dataset. The original downloaded files were discarded when all data transformations and checks were completed.
The process of converting the original data files into the cleaned raw data files involved the following steps:
- TEDS26
- The original data file downloaded from the server was a plain text file with tab-separated variables and two rows of headings. This was imported into SPSS for processing, retaining only the shorter and more useful headings (variable names).
- All unwanted variables were dropped. These included various identifiers as listed above, data that had been intended only for admin purposes, and redundant web server variables.
- Text data response fields, alongside associated category responses, were exported into Excel files for coding. The text data fields were then dropped.
- After coding, numeric coded data variables were imported and merged with the main data file.
- The cleaned raw data file was saved as a plain text file with comma-separated variables (csv file). This file is the one used to build the dataset.
- CATSLife. There were two main raw data files,
containing data collected on the Quodit and TestMyBrain
servers respectively. Data from TestMyBrain server were automatically
copied to the Quodit server, to enable TEDS staff to download all
data from the Quodit server.
- The raw Quodit data file was a tab-delimited file, with a conventional file structure of one row per twin and separate variables for item data. This was imported into SPSS, dropping unwanted identifiers and redundant fields, and saved with renamed variables for merging with the TMB data at a later stage.
- The raw TMB data file was a comma-delimited text file (csv), with four rows of data per twin (one per test). The item data for each test were encapsulated within long JSON strings. For the initial processing, this csv file was imported into a SQL Server database table.
- Within SQL Server, a sequence of processing stages was used to restructure the data: firstly, to separate the JSON strings into distinct variables (sometime involving a multiplication of rows of data); secondly, to restructure the data into conventional tables (one table per test) with one row per twin; thirdly, to clean the data. Cleaning included the removal of practice items for each test; removal of timed out items from the Vocabulary and Remembering Words tests (such items were then repeated until completed in a timely fashion); and removal of duplicated instances of the Digit Symbol Matching test (which could be repeated under rare circumstances). The data for each TMB test were then exported as csv files, with renamed variables.
- The four TMB test files were then imported from csv into SPSS and merged together. This was then merged with the Quodit data file. An additional file, from the admin system, was merged in order to translate the IDs used on the Quodit and TMB servers into conventional twin identifiers.
- The combined CATSLife data file, containing all the TMB and Spatial Spy data for every twin, was then saved as a csv file. This is the file used to build the dataset.
Exporting raw admin data
Exporting involves copying appropriate parts of the admin data from the Access database into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page. In the case of the 26 Year dataset, such admin data comprise only the return dates for paper versions of the TEDS26 questionnaire.
In the event that any changes are made in the raw data stored in the Access database, for example to improve data cleaning, then the data should be re-exported into new versions of the csv files. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. A query selects appropriate columns from the relevant tables, excluding data not needed in the dataset. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values).
A convenient way of exporting these data files is to run a macro that has been served for this purpose in the database. See the data files summary page and the 26 Year data files page for further information about the storage of the files mentioned above.
Processing by scripts
Having exported and prepared the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.
Scripts 1a and 1b: Importing and merging raw data files
The main purpose of these scripts is to import raw data from text files into SPSS, then to carry out low-level variable formatting changes, ready for further processing. Script 1a imports and processes the raw TEDS26 data. Script 1b imports and processes the raw CATSLife data.
Script 1a, importing the TEDS26 data, carries out the following steps in order.
- Import into SPSS the cleaned raw data file of twin TEDS26 questionnaire data. This is a comma-delimited (csv) text file, as described above.
- Set the visible width and number of decimal places for each item variable.
- Sort by participant identifier TwinID.
- Give the item variables systematic names.
- Some measures contain QC items - recode each of these from a raw response into an error flag.
- Derive 'status' variables as a way of keeping track of the completion of each section or 'block' in the questionnaire. This is done by counting the responses to items in each section, and by looking at the missingness of particular variables at the start and end of each section.
- Delete any rows in which the status variables indicate that no meaningful data are present.
- Create a data flag to indicate the presence of TEDS26 data for each twin in the dataset so far.
- Derive time variables for each section, based on the raw start and end date-time items. These new variables measure the time spent on each section, and flag cases where a pause seems to have occurred (such that an exact time cannot be measured).
- In the TEDS26 raw data, missing responses are coded -99, "prefer not to answer" responses are coded -11 and "don't know" responses are coded -88; recode these values to missing.
- Recode item variables consistently throughout the dataset, using coding conventions used in other TEDS datasets. For example, "yes/no" responses are coded 1/0, ordinal responses are coded 0/1/2, etc if the first response means no/none/never, or are coded 1/2/3, etc, otherwise.
- Where appropriate, recode wide-ranging numeric responses into suitable categories. In the MCTQ measure, recode raw responses into time measurements.
- More generally, check and clean all items to eliminate invalid responses or extreme outliers. A few variables containing negligible numbers of responses are dropped.
- Create reversed versions of items where these will be needed to derive scales in a later script.
- Set the variable level (nominal, ordinal or scale) for every item.
- Save this working dataset of TEDS26 data.
- Import into SPSS the csv file of raw admin data, containing TEDS26 paper questionnaire return dates.
- Double enter the return dates in this file as follows:
- Compute twin identifier ztempid for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
- Re-compute ztempid for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
- Combine the elder and younger twin parts together by adding cases. Sort in ascending order of ztempid and save as the double entered family data file.
- Import into SPSS the raw admin data file containing all twin IDs and birth orders.
- Merge this with the working TEDS26 twin data file (above). The files are merged using TwinID as the key variable.
- Double enter the TEDS26 twin data flag, as follows:
- Compute the alternative twin identifier ztempid as the FamilyID followed by the twin order (1 or 2).
- Change the name of the twin data flag variable by appending the suffix 1.
- Sort in ascending order of ztempid and save this file as the twin 1 part.
- Change the flag variable name by changing the ending from 1 to 2. Change the values of ztempid to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of ztempid and save with just the renamed variables as the twin 2 part.
- Merge the twin 1 and twin 2 parts using ztempid as the key variable. The double entered data flags can now be used to select twin pairs having data.
- Save this file as the aggregated twin data file.
- Merge the prepared file of admin data (above), using ztempid as the key variable.
- Sort by ztempid and save this TEDS26 dataset for merging at a later stage.
Script 1b, importing the CATSLife data, carries out the following steps in order.
- Import into SPSS the cleaned raw data file of twin CATSLife data. This is a comma-delimited (csv) text file.
- Give the variables systematic names. Set the visible width, variable level and number of decimal places for each item variable.
- Recode raw platform-related variables (browser, device type, etc) from strings into more useful numeric categories.
- Recode the raw Spatial Spy mission/attempt items into more useful and systematic response items, using similar conventions to those used in the very similar 18 Year Navigation data.
- Derive data flags and status flags for each Spatial Spy mission. Identify instances of missions during which the browser was apparently closed (resulting in data loss) and flag these with the status flag.
- Remove data from incomplete Spatial Spy missions.
- Derive overall duration items, and average item response times, for Spatial Spy and for TMB tests.
- Recode the feedback items, from the end of the CATSLife battery.
- Set variable levels and visible variable widths for all variables.
- Convert twin identifiers to ztempid, as used in the TEDS26 data.
- Double enter the twin CATSLife data flags, using the same procedure as for the TEDS26 data flags.
- Save the dataset for the next stage of processing.
Script 2: QC checking and cleaning
The purpose of this script (filename Z2_clean.sps) is to clean the data, as far as is possible. Cases of apparently careless or random responses are identified, in the TEDS26 twin questionnaire, based on responses to QC items and item response times. In TEDS26, response time for each theme containing a QC item was recorded directly on the web server by timers embedded in the web pages. The script carries out these tasks in order:
- Open the two data files saved at the end of the previous scripts (1a and 1b) and merge them together on ztempid.
- Remove any twin pairs in which neither twin has meaningful TEDS26 or CATSLife data, using information from the status and data flags.
- Scramble the family and twin IDs; the algorithm for scrambling IDs is outlined on another page.
- Sort in ascending order of scrambled twin ID id_twin.
- Save the file and drop the raw ID variables.
- Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
- Identify probable random responders in the TEDS26 twin data.
Start by searching within each of the 13
sections (blocks) containing a QC item and a timer, as follows.
- Measure the mean item response time by dividing the time spent answering the questions by the number of questions actually answered.
- Label the twin as a probable random responder if (a) the QC item was answered incorrectly and (b) the mean item answer time within the measure was at the low extreme of the distribution. The latter was defined using a cut-off at roughly the 10%-ile of the distribution.
- Wherever such a case is found, recode the block's status variable to value 4 to flag random responding; and recode the item variables for the entire measure to missing to prevent their use in analysis.
- Exclude across the entire TEDS26 questionnaire if either (a) two or more blocks are excluded using the rules above, i.e. QC error with fast item responses; or (b) if QC errors occurred in 4 or more blocks, regardless of times.
- Where such an exclusion is made for the entire questionnaire, recode the overall TEDS26 status variable to value 4, to flag a random responder; and recode the item variables to mssing for the entire questionnaire, to prevent use in analysis.
- In the CATSLife Spatial Spy data, detect and exclude any twin record that was corrupted by data loss. This could result from either software crashes or from cases where a twin closed the browser multiple times. Where detected, flag with value 3 in the Spatial Spy status flag.
- Additionally, detect cases of compromised data in Spatial Spy resulting from the apparent use of a mobile device with a small screen; these cases were rare, but were generally accompanied by low mission scores. Where detected, set the Spatial Spy status flag value to 3.
- In all cases with status flag value 3, delete the Spatial Spy data.
- Detect instances of random responding or disengaged twin activity
in the CATSLife TMB tests, as follows. Where detected, set the status
flag for the test to value 4.
- In Digit Symbol Matching, disengaged twins were detected in a number of ways: extremely low, outlying total scores; implausibly fast reaction times (low median item response time); highly variable item response times; and uniform responding, with many repeated presses on the same key.
- In Vocabulary and Remembering Words, probable random responding was detected in these ways: very rapid responding always associated with low total scores; and uniform responding, again where coupled with low total scores.
- In Remembering Numbers, because of the discontinue rule, disengaged twins could only be detected among failures in the first two items: firstly, those cases with outlying high response times (indicating a lack of attention); and secondly, apparently random responses involving digits other than either of the correct ones, in the first two items.
- In all cases with status flag value 4, delete the TMB data for the relevant test. For rare cases of twins have excluded data in two or more of the four tests, set the overall TMB status flag to value 4 and delete all the TMB test item data.
- Complete any outstanding raw data recoding that was postponed until after this processing.
- Save a working SPSS data file ready for the next script.
Script 3: Derive new variables
The purpose of this script (filename Z3_derive.sps) is to add scales and other derived variables. The script carries out these tasks in order:
- Open the data file saved at the end of the previous script.
- Derive variables for individual twin ages when each data collection was carried out, based on start dates in electronic data or return dates for paper booklets.
- Add scales and composites for TEDS26 questionnaire measures.
- Derive symptom-based diagnoses for mental health conditions, based on responses in the respective measures. Such diagnoses include mania (MDQ), depression (CIDID), anxiety (CIDIA), specific phobia, social phobia, panic disorder, agoraphobia, and eating disorders.
- Derive Spatial Spy scores: accuracy scores, speed scores and total scores; firstly at the level of individual missions, secondly for each mission type, and thirdly for the Spatial Spy battery as a whole.
- Derive alternative total scores for TMB test activities.
- Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
- Save a working SPSS data file ready for the next script.
Script 4: Label the variables
The purpose of this script (filename Z4_label.sps) is simply to add variable labels and value labels to the variables in the dataset. The script carries out these tasks in order:
- Open the data file saved at the end of the previous script.
- Label all variables.
- Add value labels to categorical variables, where appropriate (generally, for any numeric categorical variable having 3 or more categories).
- Save a working SPSS data file ready for the next script.
Script 5: Double entering the data
The purpose of this script (filename Z5_double.sps) is to double-enter all the twin-specific data in the dataset. Note a few variables (twin-specific variables from family-level data and twin data flags) are already correctly double-entered at this stage (this was achieved in script 1). The script carries out the following tasks in order:
- Open the data file saved at the end of the previous script.
- Create the twin 1 part: rename all twin-specific item and derived variables by adding 1 to the end of the name, then save the dataset.
- Create the twin 2 part (for the co-twin) as follows:
- Rename the appropriate item and derived variables by changing the suffix from 1 to 2.
- Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
- Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
- Re-open the twin 1 part.
- Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
- Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
- Save an SPSS data file (filename z5double in the \working files\ subdirectory).
- Save another copy as the full 26 Year dataset, with filename Zdb9456.