TEDS Data Dictionary

Processing the 21 Year Data

Contents of this page:

Introduction

This page describes how the 21 Year analysis dataset is created. The starting point is the raw data, some of which has previously been cleaned and aggregated (the paper booklet data from TEDS 21) but some of which is still in its rawest state (electronic data from the CMS and web systems). There are four main sources of data for the 21 Year analysis dataset:

  1. The Access database file called 21yr.accdb. This provides the following:
    1. Twin paper booklet data (TEDS 21 phases 1 and 2)
    2. Parent paper booklet (TEDS 21 phase 1)
    3. General study admin data, such as booklet return dates
    4. Specifically for the g-game and covid studies, a list of twins matching their study logins to their IDs
    5. Recoded 'medical' data. This is a table in which verbatim text responses, in the TEDS 21 twin phase 2 Medical Conditions and Self Harm measures, have been cleaned up and recoded to numeric categories. This table combines responses from these measures in all three formats of the questionnaire (paper, app, web).
  2. Raw data TEDS 21 files downloaded from the CMS system, containing data collected in the app. Each data collection (parent and twin, phase 1 and phase 2) provided a separate file.
  3. Raw data files downloaded from the "backup" web system for TEDS 21, and from the web server for the g-game and covid studies. Each data collection (TEDS 21 parent and twin, phase 1 and phase 2; g-game; covid phases 1, 2, etc) provided two files: a main data file and an admin data file.
  4. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

These files were repeatedly replaced and/or updated during data collection. Once each data collection ended the respective CMS and web systems were closed down, after which the final versions of the raw data files from these sources remained unchanged. Late returns of TEDS 21 paper booklets continued for some time after the official end of data collection, so the Access database and admin records continued to be updated. Data from these sources must be "exported" from database tables into delimited text files that can be used by SPSS in dataset construction, and this process of exporting must be repeated for new versions of the dataset.

Other than exporting updated files from the databases, most steps in building the dataset are carried out by SPSS scripts (syntax files). These scripts carry out a long sequence of steps that are described in outline below. The end result is the latest version of the dataset, which is an SPSS data file.

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 21 year data files are described in more detail on another page.

Exporting raw paper booklet and admin data

Exporting involves copying the cleaned and aggregated raw TEDS 21 paper booklet data from the Access database tables where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The TEDS 21 paper booklet data and the administrative data, stored in the Access 21yr.accdb database file, are subject to occasional changes, even after the end of the study. These changes are sometimes due to late returns of data, and may sometimes be due to data cleaning or data restructuring changes. These data should therefore be re-exported before a new version of the dataset is created from the SPSS scripts. The data stored in the database tables are in some cases exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. A query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The Access queries and tables used to export the data, and the resulting csv files, are listed below.

Query or table name Source of data Database table(s) involved Exported file name
TwinPhase1Part1,
TwinPhase1Part2
TEDS 21 twin phase 1 paper questionnaires TwinPhase1Part1, TwinPhase1Part2 TwinPhase1Part1.csv, TwinPhase1Part2.csv
TwinPhase2Part1,
TwinPhase2Part2
TEDS 21 twin phase 2 paper questionnaires TwinPhase2Part1, TwinPhase2Part2 TwinPhase2Part1.csv, TwinPhase2Part2.csv
ExportParent TEDS 21 parent phase 1 paper questionnaires ParentPhase1 ParentPhase1.csv
ExportTEDS21admin return dates for paper questionnaires TEDS21progress TEDS21admin.csv
ExportMedicalConditions cleaned and recoded verbatim text responses from TEDS 21 twin phase 2 questionnaires (Medical Conditions and Self Harm measures) in all versions, not just paper TwinPhase2MedicalRecoding TwinPhase2Medical.csv
ExportGgameAndCovidUsernames A list of twin IDs with corresponding usernames for the g-game and covid studies TEDS21progress GgameAndCovidUsernames.csv

A convenient way of exporting these data files is to run the saved macro called Export data in the 21yr.accdb Access database. See the data files summary page and the 21 Year data files page for further information about the storage of the files mentioned above.

Processing by scripts

Having exported and prepared the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Scripts 1a/b/c/d/e: Importing raw data files

The main purpose of each of these three scripts is firstly to import raw data from text files into SPSS; and secondly, for TEDS 21, to ensure that equivalent variables from the three sources (CMS, web, paper) have compatible properties including variable names, types and value codes. This is essential prior to combining the sources together in script 2.

The scripts also derived some new admin-based variables, but only if these have to be derived differently according to the data source.

The scripts are named U1a_import_CMS.sps, U1b_import_backup.sps, U1c_import_paper.sps, U1d_import_g_game.sps, U1e_import_covid.sps and they deal with raw data from the CMS (TEDS 21), backup web (TEDS 21), paper booklets (TEDS 21), g-game and covid (all phases) data collections respectively. They have been separated into multiple scripts because of their length, to make them easier to use.

Each of the scripts carries out the following steps in order, where appropriate. For TEDS 21 data, each script (1a, 1b, 1c) carries out these actions separately for each data collection (parent and twins, phase 1 and phase 2). For the covid study, the script (1e) carries out the actions separately for each phase of the study (phase 1, phase 2, phase 3).

  1. Open the raw text file(s) into SPSS:
    1. For the CMS, there is a single data file per data collection. This is a text file in which variables are delimited by the pipe symbol (|).
    2. For web data (TEDS 21 web backup, g-game, each phase of the covid study), there are two data files per data collection: an "admin" data file and a "main" data file. Each is a tab-delimited text file.
    3. For TEDS 21 parent paper booklet data, there is a single text file (comma-delimited).
    4. For TEDS 21 twin paper booklet data, there are two text files (comma-delimited).
  2. Sort by participant identifier (FamilyID for parent data, TwinID for TEDS 21 twin data, partNum (twin username) for g-game and covid study data).
  3. Where there are two data files for the data collection, merge the variables together into a single file, using the participant identifier as the key variable.
  4. Delete any rows of data that were created by test users (not TEDS participants).
  5. Delete any rows of data that do not contain any data (the CMS and web data files contain a row for every potential participant prior to data collection).
  6. Name or rename each variable. For a given data collection, the equivalent variable must have the same name in the different source data files.
  7. Set the visible width and number of decimal places for each item variable.
  8. In the paper booklet raw data missing and not applicable responses are coded -99 and -77 respectively; recode these values to missing.
  9. If necessary, recode categoric variables so that the coding is consistent with equivalent variables from the other sources in the same data collection.
  10. If a variable has reversed coding in one source but not in another, and the reversed variable will be needed for creating scales, retain the variable in its original state and add a reversed/un-reversed version as appropriate. Hence each source file will provide the same reversed and un-reversed versions of each such variable.
  11. For TEDS 21, create a variable denoting the source of data (CMS app, CMS web, backup web, paper).
  12. For TEDS 21, derive variables containing a count of the number of items completed in each section/theme of the given data collection.
  13. Derive status variables for each section/theme of the given data collection. In the TEDS 21 CMS and backup files, these are derived differently from different types of status variables present in the raw data, alongside the counts of items completed; in the paper data file, the status variables are derived solely from the counts of items completed. In the g-game and covid study data, such status variables are already present in the raw data.
  14. Delete any row of data that is shown by the status variables not to contain meaningful amounts of data.
  15. For the CMS and web data files (including g-game and covid study as well as TEDS 21, but not for the paper data files), derive variables for the time spent completing each section/theme for the given data collection. In the CMS and the g-game, these are derived from item response times. In the TEDS 21 backup and covid web data, these are derived from theme start and end date-times.
  16. For CMS and web data files, convert string files relating to platform (operating system, device type, screen size) into numeric categories if possible. These variables are not present for the TEDS 21 paper booklet data.
  17. For TEDS 21, if a variable is not present in the file from the given source, but is present in the data files from other sources, create an empty variable with the same name and data type.
  18. Drop variables that are not to be retained in the dataset. In the CMS and web data files, these include redundant string variables; admin-related string data containing participant contact details; strings that have been converted to numeric categories; and other variables if not present in all source data files and not thought to be useful.
  19. Save a working SPSS data file ready for processing by the next script.
  20. For the g-game and covid study data files, the following additional steps are needed:
    1. Open (in SPSS) the admin file containing twin IDs and usernames from the g-game and covid data collections.
    2. Sort in ascending order of the approprite username for the study whose data are being processed. Save.
    3. Merge this file with the file of data processed so far for the relevant study, using partNum (twin username) as the key variable for merging.
    4. Sort in ascending order of TwinID, so it can be merged later with other files containing this ID.
    5. Save, dropping the partNum variable.
    6. For the covid study, there is now a data file for each phase of the study; merge these into a single covid study data file, using TwinID as the key variable; then save the file.

Script 2: Merging and recoding raw data files

The main purpose of this script (filename U2_merge.sps) is to merge various raw data files together, so as to create an initial dataset with one row of data per twin. The raw data files include those created in the previous set of scripts, plus admin-related data from other sources. The script also carries out some low-level processing of the raw data, such as recoding and renaming some raw item variables, and setting variable formats. The script carries out the following tasks in order:

  1. The TEDS21 study involved 5 data collections (TEDS 21 phase 1 parent, TEDS 21 phase 1 twin, TEDS 21 phase 2 twin, g-game twin, covid phase 1 twin). For each data collection in turn, carry out the following actions:
    1. For TEDS 21 data collections, combine data from the three sources (CMS, backup, paper) by merging the cases in the three relevant files prepared in the earlier scripts.
    2. In these TEDS 21 cases, identify duplicates, where the same participant has provided data in two or more different ways in the same data collection. Delete duplicates, generally prioritising more complete rows of data, so that only one row of data remains for each participant.
    3. Add a flag variable showing the presence of data in the given data collection.
    4. Recode item variables where appropriate, from the raw data coding to the final value coding to be used in the dataset.
    5. For QC items in the TEDS 21 twin data, recode the item responses into new variables containing error scores.
    6. Add reversed versions of item variables, where these will be needed for creating scales later (if the reversed items do not already exist in the raw data).
    7. Set variable levels (nominal, ordinal or scale) for all items.
    8. Sort by participant identifier (TwinID for twin data, FamilyID for parent data).
    9. Save the dataset for further processing below.
  2. There are 3 raw csv data files of admin and other data: twin IDs and birth orders; TEDS21 study admin data (booklet return dates); and cleaned and recoded TEDS21 responses for medical conditions and self-harm. For each of these files in turn, carry out the following actions were necessary:
    1. Import the csv file into SPSS.
    2. Carry out basic recoding of categorical variables where necessary.
    3. Set variable formats including numeric width and decimal places displayed, and variable level (nominal/ordinal/scale).
    4. Rename variables.
    5. Sort by partcipant identifier (TwinID or FamilyID as appropriate). and save for merging later.
  3. There are now 6 files of twin-level data: the admin file of twin IDs and birth orders; the TEDS 21 phase 1 questionnaire data file; the TEDS 21 phase 2 questionnaire data file; the file containing recoded TEDS 21 medical condition responses; the g-game data file; and the covid study data file. Merge these together into a single file, using TwinID as the key variable.
  4. Double enter the main twin data flags (for each data collection), as follows:
    1. Compute the alternative twin identifier utempid2 as the FamilyID followed by the twin order (1 or 2).
    2. Change the names of the twin data flag variables by appending the suffix 1.
    3. Sort in ascending order of utempid2 and save this file as the twin 1 part.
    4. Change the flag variable names by changing the ending from 1 to 2. Change the values of utempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of utempid2 and save with just the renamed variables as the twin 2 part.
    5. Merge the twin 1 and twin 2 parts using utempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data.
    6. Save this file as the aggregated twin data file.
  5. At this stage, there are 2 files of parent- or family-level data: the TEDS 21 parent phase 1 questionnaire data and the TEDS21 admin data file. Merge these together into a single file, using FamilyID as the key variable.
  6. Double enter twin-specific items in the family-based data as follows:
    1. Compute twin identifier utempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
    2. Re-compute utempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of utempid2 and save as the double entered family data file.
  7. Merge the aggregated twin data file with the double entered family data file, using utempid2 as the key variable. This dataset now contains all the raw data.
  8. Use the parent data flag and the double entered twin data flags to filter the dataset and delete any cases without any 21 Year data.
  9. Add a flag variable uteds21data to indicate the presence of any 21 year data for each twin pair (from parents and/or twins, in any data collection).
  10. Recode all data flag variables from missing to 0.
  11. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  12. Sort in ascending order of scrambled twin ID id_twin.
  13. Save the file and drop the raw ID variables.
  14. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  15. Use the data flags to filter the dataset and delete any cases without any 21 Year data.
  16. Save a working SPSS data file ready for the next script.

Script 3: Clean and correct data

The purpose of this script (filename U3_clean.sps) is to clean the data, as far as is possible. Cases of apparently careless or random responses are identified, in the TEDS 21 twin questionnaires and in the twin g-game, based on responses to QC items, response times, patterns of uniform responding and (for the g-game) patterns of low scoring. In TEDS 21, response time for each theme was computed as the time interval between the recorded start and end of the theme. In the g-game, instead, the mean item response time was used. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Identify patterns of uniform responding in the immediate vicinity of each QC item (same response in the QC item and in at least 3 of the 4 surrounding items). Exclude data from the theme/section if such a pattern is found, together with a QC error, in any measure within the theme.
  3. Identify patterns of rapid responding in each theme/section (mean response time per item answered is in the lowest 20% of the distribution). Exclude data from the theme if such a pattern is found, together with a QC error occuring in any measure within the theme; for g-game sub-tests, an additional criterion of low sub-test score was used.
  4. In each g-game sub-test, additionally exclude twins with extremes of rapid responding (roughly, below the 0.2%-ile of the distribution).
  5. Exclude across the entire questionnaire if two or more themes are excluded using the rules above (do this independently for phase 1 and phase 2 of TEDS21). Likewise in the g-game, exclude across the entire battery of 5 sub-tests if two or more sub-tests are excluded using the rules above.
  6. Where exclusions have been made, recode the relevant item data (within the theme, or across the entire questionnaire) to missing to ensure consistent use in analysis.
  7. Save a working SPSS data file ready for the next script.

Script 4: Derive new variables

The purpose of this script (filename U4_derive.sps) is to add derived variables, including scales, time- and age-related variables, and zygosity and exclusion variables. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Derive variables for individual twin ages when each data collection was carried out, based on start dates in electronic data or return dates for paper booklets.
  3. Add scales for questionnaire measures (TEDS 21 and covid).
  4. Add scores for cognitive measures (g-game)
  5. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  6. Save a working SPSS data file ready for the next script.

Script 5: Label the variables

The purpose of this script (filename U5_label.sps) is simply to add variable labels and value labels to the variables in the dataset. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label all variables.
  3. Add value labels to categorical variables, where appropriate (generally, for any numeric categorical variable having 3 or more categories).
  4. Save a working SPSS data file ready for the next script.

Script 6: Double entering the data

The purpose of this script (filename U6_double.sps) is to double-enter all the twin-specific data in the dataset. Note a few variables (twin-specific variables from family-level data and twin data flags) are already correctly double-entered at this stage (this was achieved in script 1). The script carries out the following tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the twin 1 part: rename all twin-specific item and derived variables by adding 1 to the end of the name, then save the dataset.
  3. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the appropriate item and derived variables by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
  4. Re-open the twin 1 part.
  5. Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
  6. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  7. Save an SPSS data file (filename u6double in the \working files\ subdirectory).
  8. Save another copy as the full 21 Year dataset, with filename Udb9456_full.