Exporting raw data
Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.
Over time, the study booklet data, stored in the Access 3yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of booklets, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If these changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The booklet data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves (except in the case of the ReturnDates table). Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries used to export the data are as follows:
Query or table name | Source of data | Database table(s) involved | Exported file name |
---|---|---|---|
Export Adult | parent booklets | Adult | Adult1.csv |
Export Child1, Export Child2, Export Child3, Export Child4 |
twin booklets | Child1, Child2, Child3, Child4 | Child1.csv, Child2.csv, Child3.csv, Child4.csv |
ReturnDates | booklet return dates | ReturnDates | return_dates.csv |
A convenient way of exporting these data files is to run a macro that is saved in the Access database. See the data files summary page and the 3 Year data files page for further information about the storage of these files.
Processing by scripts
Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.
Script 1: Merging raw data sources
The main purpose of this script (filename C1_merge.sps) is to merge raw data files together and do some basic recoding. The script carries out these tasks in order:
- There are 2 files of family-based raw data: the file of parent booklet
data plus the file of booklet return dates. These raw data files both start
in csv format. For each of these files in turn, carry out the following
actions:
- Import into SPSS
- Sort in ascending order of family identifier FamilyID
- Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
- For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
- Carry out basic recoding of categorical variables where necessary
- Drop raw data variables that are not to be retained in the datasets.
- Save as an SPSS data file.
- In the parent booklet data file, in addition to the steps mentioned
above, transform and derive further variables as follows:
- Add a data flag variable showing the presence of parent data.
- Convert twin-pair neither/elder/younger/both items to twin-specific yes/no items
- Convert day/week/month items into a single time interval in days (breast feeding)
- Derive blood group/factor difference variables for the zygosity algorithm
- Derive standardised twin-specific items from the raw elder twin responses and younger twin differences (talking to twins, twin play/games, discipline, parental feelings)
- Clean some inconsistent or anomalous item variable values by
recoding, as follows:
- In two-part questions, typically having a yes/no initial question followed by an if-yes part, remove inconsistencies by recoding the if-yes part to missing (or zero if appropriate) if the initial response was no.
- In the same types of questions, where affirmative responses were given in the if-yes part and the initial response was missing, recode the initial response to yes.
- There are 5 files of twin-based raw data: the four files of child
booklet data plus the admin data file containing twin IDs and birth orders.
These raw data files all start in csv format. For each of these files in
turn, carry out the following actions:
- Import into SPSS
- Sort in ascending order of twin identifier TwinID
- Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
- For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
- Carry out basic recoding of categorical variables where necessary
- In the first twin booklet data file, add a data flag variable showing the presence of twin data.
- Transform some raw variables into more user-friendly compound variables where appropriate (e.g. age in years from year/month items for surgery age, heights in metres from feet/inches/metres/centimetres items, weights in kg from stones/pounds/kilograms/grams items)
- Derive item scores from item responses where appropriate, for test measures
- Clean some inconsistent or anomalous item variable values by recoding, in the same way as described above for parent booklet items. Additionally, clean heights/weights by recoding obviously infeasible values to missing.
- In the file of twin birth orders and sexes, compute the alternative twin identifier atempid2 as the FamilyID followed by the twin order (1 or 2).
- Drop raw data variables that are not to be retained in the datasets.
- Save as an SPSS data file.
- Merge the two files of family-based data together using FamilyID as the key variable.
- Double enter the family-based data as follows:
- Compute twin identifier atempid2 for the elder twin by appending 1 to the FamilyID. Save as the elder twin part of the family data.
- Re-compute atempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
- Combine the elder and younger twin parts together by adding cases. Sort in ascending order of atempid2 and save as the double entered family data file.
- Merge the five twin data files together using TwinID as the key variable.
- Do some further cleaning for vocab/language items in the merged data, as
follows:
- If the first vocabulary item response says the child is not talking yet, then recode all other vocabulary, sentence complexity and word use items to missing
- If the first sentence complexity item is missing, then recode to a 'yes' response if any other responses for either sentence complexity or word use are affirmative
- If the first communication item response shows there are no concerns about the child's language, then recode the following 8 items to missing
- Double enter the twin data flags, as follows:
- Append 1 to the variable name, to denote twin 1. Sort in ascending order of atempid2 and save as the twin 1 part.
- Change the variable names by appending 2 instead of 1. Change the values of atempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of atempid2 and save with just the two double entered variables as the twin 2 part.
- Merge the twin 1 and twin 2 parts using atempid2 as the key variable. The double entered twin data flags can now be used to select twin pairs having data.
- Merge this twin data file with the double entered parent data file, using atempid2 as the key variable. This dataset now contains all the raw data.
- Use the parent data flag, and the double entered twin data flags, to filter the dataset and delete any cases without any 3 Year data. Add the overall 3 Year data flag variable cthreeyr.
- Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
- Sort in ascending order of scrambled twin ID id_twin.
- Save and drop the raw ID variables.
- Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
- Merge in additional variables from the 1st Contact dataset (aadults, amedtot, atwed1/2, for deriving environment composites at age 3) using id_twin as the key variable.
- Use variable cthreeyr to filter the dataset and delete cases added, from the background variables dataset and from 1st Contact, that do not have 3 Year data.
- Save a working SPSS data file ready for the next script (filename c1merge in the \working files\ subdirectory).
Script 2: Deriving new variables
The main purpose of this script (filename C2_derive.sps) is to create scales and other new derived variables. See 3 Year derived variables for more details of how the variables were computed. The script carries out these tasks in order:
- Open the dataset file c1merge saved by the last script.
- Use the zygosity algorithm to compute derived variables ctempzyg, calgzyg, calg2zy based on item data in the 3 Year zygosity questionnaire in the parent booklet.
- Compute new date variables, from day/month/year item variables. Reduce the number of date variables in the dataset by making a best estimate of the completion date for each booklet section, substituting missing dates with other available dates where necessary.
- Use the new date variables, and the twin birth dates, to compute twin age variables.
- Derive MCDI language scores (word use, vocabulary, sentence complexity and grammar), including a transformed vocabulary score.
- Clean the vocabulary items and scores by recoding to missing if the total score is zero and other data (word use and sentence complexity) suggest that the vocabulary measure was not in fact completed, resulting in default zero item responses.
- Derive total scores for each of the Parca activities (the 3 parent-administered tests plus the parent-report measure). Add an overall total score for the 3 parent-administered tests.
- Derive standardised composite cognitive scores from the MCDI and Parca measure scores. These composites are equally-weighted means of scores that have been standardised on the non-excluded sample of twins.
- Derive Behar and other behaviour scales.
- Derive gender role scales.
- Derive BMI variables and parent BMI categories.
- Derive the maternal depression scale.
- Derive standardised composite environment variables from parent booklet measures (twin play/games, talking to twins, discipline, parental feelings, chaos). These composites are means of items that have been standardised on the non-excluded sample of twins.
- Derive handedness scales.
- Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
- Save a working SPSS data file ready for the next script (filename c2derive in the \working files\ subdirectory).
Script 3: Label variables
The purpose of this script (filename C3_label.sps) is simply to label the variables and add value labels. The script carries out these tasks in order:
- Open the dataset file c2derive saved by the last script.
- Label all the variables.
- For all categorical variables having more than 2 categories, add value labels.
- Save a working SPSS data file ready for the next script (filename c3label in the \working files\ subdirectory).
Script 4: Double enter the twin data
The purpose of this script (filename C4_double.sps) is to double enter the twin variables from the child booklets, add a few new variables that require double entered data, and save the final dataset. Note that twin-related variables from the parent booklet were double entered in the first script above. The script carries out these tasks in order:
- Open the dataset file c3label saved by the last script.
- Rename all item and derived variables from the twin booklet, by appending 1 to the name.
- Sort in ascending order of id_twin and save as the twin 1 part of the data.
- Create the twin 2 part (for the co-twin) as follows:
- Rename the twin booklet variables by changing the suffix from 1 to 2.
- Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
- Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed twin booklet variables.
- Merge the twin 1 and twin 2 parts, using id_twin as the key variable. The dataset is now double entered.
- Derive twin-pair age difference variables (in some pairs, the two twin booklets were completed on different dates).
- Derive new exclusion variables, specific to the 3 Year study, based on twin age and age difference criteria.
- Derive additional environment composites (life events, environment risk) that require double entered data, and hence were not derived in the earlier script. These composites are derived from 1st Contact data as well as 3 Year data. The environment risk composites are standardised means of variables that have been standardised on the non-excluded twin sample.
- Drop any temporary variables that have been used to derive the new variables.
- Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
- Save an SPSS data file (filename c4double in the \working files\ subdirectory).
- Save another copy as the main 3 Year dataset, with filename cdoub945.