Exporting raw data
Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.
The study booklet data, stored in the Access 1c.accdb database file, have been subject to occasional changes, even after the end of data collection. The 1st Contact study has been extended more than once, to try to obtain important demographic and perinatal data from families that did not respond in the original study; hence new data have been added to the database. Further changes have occasionally been made due to data cleaning or data restructuring. Whenever these data are changed in any way, they should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding any fields not needed for the dataset. The queries used to export the booklet data are as follows:
Query name | Database table | Exported file name |
---|---|---|
Export Part1 | Part1 | Part1.csv |
Export Part2 | Part2 | Part2.csv |
Export Part3 | Part3 | Part3.csv |
Export ReturnDates | FirstContactProgress | Returndates.csv |
A convenient way of exporting these data files is to run a macro that is within the Access database. See the data files summary page and the 1st Contact data files page for further information about the storage of these files.
Processing by scripts
Having exported the raw data as described above, a new version of the dataset is made by running SPSS scripts (syntax files). The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon. The functions of each script are described below.
Script 1: Merging raw data sources
The main purpose of this script (filename A1_merge.sps) is to merge raw data files together, name the item variables, set variable display properties, and do some basic recoding and transforming of variables.
For further information about changes to names and coding of item variables, refer to the annotated booklet (pdf).
The script carries out these tasks in order:
- There are 4 files of family-based raw data: the three files of booklet
data plus the file containing booklet return dates. These raw data files all start in
csv format. For each of these files in turn, carry out the following
actions:
- Import into SPSS
- Sort in ascending order of family identifier FamilyID
- Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
- For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
- Recode categorical variables to simplify and make items consistent in their coding.
- Transform twin-pair items, typically coded 0=neither, 1=elder, 2=younger, 3=both, each into two twin-specific items coded 1Y 0N. Such items are then double entered in a later script.
- Transform numeric items, for example describing time intervals in days, into ordinal interval category variables.
- Clean up inconsistent or incomplete responses in multiple-part
questions, consisting of an initial question typically followed by one or
more "if yes then" questions:
- Where the initial response is "no", but the follow-up responses are affirmative, remove the inconsistency by recoding the follow-up responses to missing. This assumes that the initial response is more likely to be accurate than the follow-up response.
- Where the initial response is missing, but the follow-up responses are affirmative, assume that the initial response should be "yes" and recode accordingly.
- Drop raw data variables that are not to be retained in the datasets.
- Save as an SPSS data file.
- Merge together the 3 SPSS files of booklet data plus the file of return dates, using FamilyID as the key variable. Remove any cases without data, then add the acontact variable to show that cases in this merged file all have 1st Contact data.
- Save a working SPSS data file ready for the next script (filename a1merge in the \working files\ subdirectory).
Script 2: Double entering the data
The main purpose of this script (filename A2_double.sps) is to double-enter the data. The script carries out these tasks in order:
- Open the dataset file a1merge saved by the last script. This dataset so far contains just one row of data per twin pair, in which twin variables refer specifically to the older and younger twin.
- Convert this into a dataset for the set of elder twins as follows:
- Compute twin identifier atempid2: multiply family identifier FamilyID by 10, and add 1 (to denote elder twin).
- For a few cases, flagged by value 2 in item variable atwinord, where elder twin details are thought to have been recorded for the younger twins and vice versa in the 1st Contact booklet, recompute atempid2 by changing the last digit from 1 to 2.
- Compute the random variable, assigning values 0 and 1 randomly (but with equal probability) to the elder twins.
- Save this dataset as the elder twin part.
- Now convert this same dataset into a dataset for the younger twin as
follows:
- Re-compute twin identifier atempid2 by changing the final digit to 2 (to denote younger twins).
- For the few cases flagged by atwinord=2, where twin data need to be reversed, change the last digit of atempid2 from 2 to 1.
- Reverse the values of the random variable for the younger twin, by recoding 0 to 1 and vice versa.
- For all twin-specific item variables, swap the elder and younger twin values around by a process of variable renaming.
- Save this dataset as the younger twin part.
- Merge cases from the elder and younger twin parts, making a larger dataset containing one row of data for each twin. Sort in ascending order of atempid2 and save.
- Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page. Drop the raw ID variables.
- Sort in ascending order of id_twin (anonymised twin ID).
- Merge in essential background variables not already present, from the reference dataset of background variables. The background variables added here include twin sexes, medical exclusions and the standard exclusion variables (all already double entered) and twin birth date (for deriving ages). This merging is done using id_twin as the key variable.
- Save a working SPSS data file ready for the next script (filename a2double in the \working files\ subdirectory).
Script 3: Cleaning the raw data
The main purpose of this script (filename A3_clean.sps) is to clean the raw data by recoding variables, where anomalies or inconsistencies can be detected. The script carries out these tasks in order:
- Open the dataset file a2double saved by the last script.
- Clean up inconsistent or incomplete responses in the six questions
asking for the sex, relationship to twins and marital status of the
respondent and partner:
- Where a response is missing but the likely response can be deduced from other related responses, recode the missing response accordingly. For example, if the responses show the respondent is female, the natural mother of the twins, and cohabiting with the other parent, it can usually be deduced that the partner is male, the father of the twins, and also cohabiting with the other parent.
- Where a single response is inconsistent with the 5 other responses, and the "correct" response can reasonably be deduced, recode the response accordingly. For example, if the respondent is female, the natural mother, and married to the other parent, while the partner is male and the natural father, then the partner's marital status can reasonably be recoded to "married to the other parent" if some other response is recorded.
- Where inconsistent responses cannot be resolved as above, and it is difficult to deduce the "correct" responses, recode inconsistent responses to missing.
- Compute the total numbers of older and younger siblings.
- Clean up inconsistent responses in the sibling data:
- Where initial responses show there to be fewer than 3 older siblings present, but details of a third older sibling are recorded, then remove the inconsistency by deleting the 3rd older sibling's details (recode them to missing). Repeat for similar sets of responses for the second and first older siblings, where initial responses show fewer than 2 or 1 older siblings respectively.
- Repeat for the younger siblings (details of up to 2 younger sibs may be recorded).
- If responses for any younger sibling show that the sibling is in fact older than the twins, then delete the responses relating to that sibling (recode them to missing).
- Repeat for older siblings who are apparently younger than the twins.
- If responses for any older or younger sibling show that their date of birth is within 300 days of the twins' date of birth, and the sibling's parents are the same as for the twins, then delete the responses relating to that sibling.
- Compute recoded, ordinal versions of some quantitative variables.
- For any new derived variables, as for item variables in the previous script, set the variable level (nominal/ordinal/scale), width and number of decimal places.
- Save a working SPSS data file ready for the next script (filename a3clean in the \working files\ subdirectory).
Script 4: add new derived variables
The main purpose of this script (filename A4_derive.sps) is to compute various types of derived variables. For full details of how new variables are derived, see the 1st Contact derived variables page and the zygosity algorithm page. The script carries out these tasks in order:
- Open the dataset file a3clean saved by the last script.
- Add variable sexdif to flag twin pairs having opposite sexes (needed for the zygosity algorithm).
- Use the zygosity algorithm to compute derived variables atempzyg, aalgzyg, aalg2zy based on item data in the 1st Contact zygosity questionnaire.
- Compute perinatal outlier exclusion flag variables aperi1, aperi2, aperi3, aperi4, aperi5, aperinat from various item variables relating to pregnancy and birth of the twins.
- Convert raw day/month/year item variables into date values in new variables.
- Derive the ages (when the 1st Contact booklet was completed) of the twins, respondent, partner, natural mother, and older siblings, from various date variables.
- Derive the age of the mother when her first child was born (amagechl), from various parent, twin and sibling items including their birth dates.
- Compute various other derived variables relating to the male and female parents, based on respondent and partner item variables. These derived variables include household type, and qualification and employment categories for the female and male parents.
- Derive composite variables for SES (ases), twin medical risk (atwmed1/2) and mother medical risk (amedtot). Each of these composites is derived from a range of other item and derived variables, and is standardised on the non-excluded sample of twin pairs (exclude1=0 & exclude2=0).
- For all new derived variables, as for other variables in previous scripts, set the variable level (nominal/ordinal/scale), width and number of decimal places.
- Drop all temporary and redundant variables that have been used in the computation of new derived variables. Date variables are dropped at this point, having been used to derive ages.
- Save a working SPSS data file ready for the next script (filename a4derive in the \working files\ subdirectory).
Script 5: add variable and value labels
The main purpose of this script (filename A5_label.sps) is to add variable labels to all variables, and value labels where appropriate. For a full list of variables, including labels and descriptions of value coding, see the 1st Contact variables list page. The script carries out these tasks in order:
- Open the dataset file a4derive saved by the last script.
- Add a descriptive variable label to every variable in the dataset.
- For every categorical variable having 3 or more response categories, add value labels to describe the numbered categories.
- Place the variables into a logical and systematic order. The variable order generally follows the order in which respective items appear in the 1st Contact booklet; additional derived variables appear at the end of the dataset.
- Save a backup copy of this dataset (filename a5label) in the \working files\ subdirectory.
- Save another copy as the main 1st Contact dataset, with filename adb9456.