Exporting raw data
Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.
Over time, the study booklet data, stored in the Access 4yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of booklets, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If such changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields, and modifying the format of variables where appropriate. The queries used to export the data are as follows:
Query name | Source of data | Database table(s) involved | Exported file name |
---|---|---|---|
Export Adult1, Export Adult2, Export Adult3 |
parent booklets | Adult1, Adult2, Adult3 | adult1.csv, adult2.csv, adult3.csv |
Export Child1, Export Child2, Export Child3 |
twin booklets | Child1, Child2, Child3 | child1.csv, child2.csv, child3.csv |
Export ReturnDates | booklet return dates | ReturnDates_12dec01 | return_dates.csv |
A convenient way of exporting these data files is to run a macro saved in the Access database. See the data files summary page and the 4 Year data files page for further information about the storage of these files.
Processing by scripts
Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.
Script 1: Merging raw data sources
The main purpose of this script (filename D1_merge.sps) is to merge raw files together and do some basic recoding. Basic variable properties (such as variable name, variable level, width and number of decimal places) are set. Some simple derived variables are created, and some data cleaning is carried out. The script carries out these tasks in order:
- There are 4 files of family-based raw data: the 3 files of parent
booklet data plus the file of booklet return dates. These raw data files all
start in csv format. For each of these files in turn, carry out the
following actions:
- Import into SPSS
- Sort in ascending order of family identifier FamilyID
- Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
- For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
- Carry out basic recoding of categorical variables where necessary
- Drop raw data variables that are not to be retained in the datasets.
- Save as an SPSS data file.
- In the 3 parent booklet data files, in addition to the steps mentioned
above, transform and derive further variables as follows:
- In the first parent booklet data file, add a data flag variable showing the presence of parent data.
- Convert raw twin-pair neither/elder/younger/both items to twin-specific yes/no items
- Transform some raw variables into more user-friendly compound variables where appropriate (e.g. age in years from year/month items for surgery age, and for skin problem age)
- Convert the 8 raw marital status variables into a single categorical variable
- Derive blood group/factor difference variables for the zygosity algorithm
- Derive standardised twin-specific items from the raw elder twin responses and younger twin differences (talking to twins, twin play/games, discipline, parental feelings)
- Clean some inconsistent or anomalous item variable values by
recoding, typically as follows:
- In two-part questions, typically having a yes/no initial question followed by an if-yes part, remove inconsistencies by recoding the if-yes part to missing (or zero if appropriate) if the initial response was no.
- In the same types of questions, where affirmative responses were given in the if-yes part and the initial response was missing, recode the initial response to yes.
- There are 4 files of twin-based raw data: the 3 files of child booklet
data plus the admin data file containing twin IDs and birth orders. These
raw data files all start in csv format. For each of these files in turn,
carry out the following actions:
- Import into SPSS
- Sort in ascending order of twin identifier TwinID
- Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
- For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
- Carry out basic recoding of categorical variables where necessary (and add reversed versions of variables where needed)
- In the first twin booklet data file, add a data flag variable showing the presence of twin data.
- Transform some raw variables into more user-friendly compound variables where appropriate (heights and waist measurements in metres from feet/inches/metres/centimetres items, weights in kg from stones/pounds/kilograms/grams items)
- Derive item scores from item responses where appropriate, for test measures
- Clean some inconsistent or anomalous item variable values by recoding, in the same way as described above for parent booklet items. Additionally, clean heights/weights by recoding obviously infeasible values to missing.
- In the file of twin birth orders and sexes, compute the alternative twin identifier atempid2 as the FamilyID followed by the twin order (1 or 2).
- Drop raw data variables that are not to be retained in the datasets.
- Save as an SPSS data file.
- Merge the 4 files of family-based data together using FamilyID as the key variable.
- Check that meaningful parent booklet data are present by counting missing values; if not, recode the parent data flag from 1 to 0.
- Double enter the family-based data as follows:
- Compute twin identifier atempid2 for the elder twin by appending 1 to the FamilyID. Save as the elder twin part of the family data.
- Re-compute atempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
- Combine the elder and younger twin parts together by adding cases. Sort in ascending order of atempid2 and save as the double entered family data file.
- Merge the 4 twin data files together using TwinID as the key variable.
- Do some further cleaning for vocab/language items in the merged data: if the first language item response says the child is not talking yet, then recode all the vocabulary items to missing.
- Check that meaningful twin booklet data are present by counting missing values; if not, recode the twin data flag from 1 to 0.
- Double enter the twin data flags, as follows:
- Append 1 to the variable name, to denote twin 1. Sort in ascending order of atempid2 and save as the twin 1 part.
- Change the variable names by appending 2 instead of 1. Change the values of atempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of atempid2 and save with just the two double entered variables as the twin 2 part.
- Merge the twin 1 and twin 2 parts using atempid2 as the key variable. The double entered twin data flags can now be used to select twin pairs having data.
- Merge this twin data file with the double entered parent data file, using atempid2 as the key variable. This dataset now contains all the raw data.
- Use the parent data flag, and the double entered twin data flags, to filter the dataset and delete any cases without any 4 Year data. Add the overall 4 Year data flag variable dfouryr.
- Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
- Sort in ascending order of scrambled twin ID id_twin.
- Save and drop the raw ID variables.
- Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
- Merge in additional variables from the 1st Contact dataset (amedtot, atwed1/2) and from the 3 Year dataset (cmstatus) using id_twin as the key variable. These will be used later to derived new environment composites at age 4.
- Use variable dfouryr to filter the dataset and delete cases added from these other datasets that do not have 4 Year data.
- Save a working SPSS data file ready for the next script (filename d1merge in the \working files\ subdirectory).
Script 2: Deriving new variables
The main purpose of this script (filename D2_derive.sps) is to create scales and other new derived variables. See 4 Year derived variables for more details of how the variables were computed. The script carries out these tasks in order:
- Open the dataset file d1merge saved by the last script.
- Use the zygosity algorithm to compute derived variables dtempzyg, dalgzyg, dalg2zy based on item data in the 4 Year zygosity questionnaire in the parent booklet.
- Compute new date variables, from day/month/year item variables. Reduce the number of date variables in the dataset by making a best estimate of the completion date for each booklet section, substituting missing dates with other available dates where necessary.
- Use the new date variables, and the twin birth dates, to compute twin age variables.
- Derive language scores (picture vocabulary, MCDI vocabulary and grammar), including a transformed vocabulary score.
- Clean the vocabulary items and scores by recoding to missing if the total score is zero and other language data suggest that the vocabulary measure was not in fact completed, resulting in default zero item responses.
- Derive low-language flag variables.
- Derive total scores for each of the Parca activities (the 4 parent-administered tests plus the parent-report measure). Add an overall total score for the 4 parent-administered tests.
- Derive standardised composite cognitive scores from the MCDI and Parca measure scores. These composites are equally-weighted means of scores that have been standardised on the non-excluded sample of twins.
- Derive Behar, SDQ and other behaviour scales.
- Derive gender role scales.
- Derive BMI variables and parent BMI categories.
- Derive the maternal depression scale.
- Derive standardised composite environment variables from parent booklet measures (twin play/games, talking to twins, discipline, parental feelings, chaos). These composites are means of items that have been standardised on the non-excluded sample of twins.
- Derive handedness scales.
- Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
- Save a working SPSS data file ready for the next script (filename d2derive in the \working files\ subdirectory).
Script 3: Label variables
The main purpose of this script (filename D3_label.sps) is to label the variables and add value labels. The script carries out these tasks in order:
- Open the dataset file d2derive saved by the last script.
- Label all the variables.
- For all categorical variables having more than 2 categories, add value labels.
- Save a working SPSS data file ready for the next script (filename d3label in the \working files\ subdirectory).
Script 4: Double enter the twin data
The main purpose of this script (filename D4_double.sps) is to double enter the twin variables from the child booklet, add a few derived variables that require double entered data, and save the final dataset. Note that twin-related variables from the parent booklet were double entered in the first script above. The script carries out these tasks in order:
- Open the dataset file d3label saved by the last script.
- Rename all item and derived variables from the twin booklet, by appending 1 to the name.
- Sort in ascending order of id_twin and save as the twin 1 part of the data.
- Create the twin 2 part (for the co-twin) as follows:
- Rename the twin booklet variables by changing the suffix from 1 to 2.
- Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
- Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed twin booklet variables.
- Merge the twin 1 and twin 2 parts, using id_twin as the key variable. The dataset is now double entered.
- Derive twin-pair age difference variables (in some pairs, the two twin booklets were completed on different dates).
- Derive new exclusion variables, specific to the 4 Year study, based on twin age and age difference criteria.
- Derive additional environment composites (life events, environment risk) that require double entered data, and hence were not derived in the earlier script. These composites are derived from 1st Contact and 3 Year data as well as 4 Year data. The environment risk composites are standardised means of variables that have been standardised on the non-excluded twin sample.
- Clean parent heights, weights and waist measurements by removing anomalies in the double-entered values (each parent's measurements were entered twice, in the two twin booklet).
- Drop any temporary variables that have been used to derive the new variables.
- Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
- Save an SPSS data file (filename d4double in the \working files\ subdirectory).
- Save another copy as the main 4 Year dataset, with filename ddb9456.