Exporting raw data
Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.
The booklet/questionnaire data, the TOWRE test data, and the administrative data, stored in the Access 12yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of booklets, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If such changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are usually exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The queries used to export the data are as follows:
Query name | Source of data | Database table(s) involved | Exported file name |
---|---|---|---|
(query not used; exported directly from tables) | parent booklets | Parent1, Parent2 | Parent1.csv, Parent2.csv |
Export Teacher | teacher questionnaires | Teacher | Teacher.csv |
Export Child Qnr Data | twin booklets | TwinQuestionnaire | TwinQuestionnaire.csv |
Export 12yr admin |
|
yr12Progress | 12yrAdmin.csv |
Export SRS CAST for 12yr dataset | parent CAST questionnaire (administered by SRS sub-study) | SRSparentCAST | SRSparentCAST.csv |
A convenient way of exporting these data files is to run the saved macro called export raw data in the 12yr.accdb Access database. See the data files summary page and the 12 Year data files page for further information about the storage of these files.
Preparing raw web data
Data from the twin web tests were stored on the web server during the course of each wave of the study (waves 1 and 2 plus the follow-up wave in 2008). For each test, the data for all twins that had completed the test were collected into an "analysis file" that was downloaded from the web server. These analysis files, one per web activity per wave, were the original twin web data files. There were 36 such twin test files, 16 from each of waves 1 and 2 (for the 16 activities PIAT, GOAL, Reading Fluency, Maths, Vocabulary, Picture Completion, Raven, General Knowledge, Inferences, TOAL, Expressions, Eyes, Jigsaws, Hidden Shapes, Maths and Reading Questionnaire, and Author Recognition) plus 4 from the follow-up wave (for the 4 repeated activities PIAT, GOAL, Reading Fluency and Maths). Additionally, there was one family-based file per wave: a "family status" file contain data describing the status of the various web activities.
Subsequently, for each of these data files, the two or three wave files were aggregated together, so that there is now a single file for each web activity. At the same time, the data from the Maths and Reading Questionnnaire were merged into a single file with the data from the Author Recognition task. Hence there are 16 aggregated web data files in total. These files contain too many variables to be conveniently imported into Access database tables alongside the admin and questionnaire data. Instead, they are stored separately as csv text files. When the original files were aggregated in this way, identifying fields other than IDs (e.g. names) were removed.
Processing by scripts
Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.
Script 1: Merging raw data sources
The main purpose of this script (filename L1_merge.sps) is to merge various raw data files together, so as to create a basic dataset with one row of data per twin. The script also carried out some basic processing of the raw data, such as recoding and renaming the raw item variables, and creating some basic derived variables including scrambled IDs. The script carries out these tasks in order:
- There are 5 files of family-based raw data: two files of parent
questionnaire data, the file of SRS CAST questionnaire data, the web family
status data file (containing parent consent dates and twin test status
variables), and the 12 year admin data (containing booklet return dates,
TOWRE test data, parent-reported NC data and other details).
These raw data files all start in csv format. For each of these 5 files in
turn, carry out the following actions:
- Import into SPSS
- Sort in ascending order of family identifier FamilyID
- Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
- For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
- Carry out basic recoding of categorical variables where necessary.
- Add reversed-coded versions of item variables where needed.
- Drop raw data variables that are not to be retained in the datasets.
- Save as an SPSS data file.
- Add cases from the parent questionnaire data and the SRS CAST questionnaire (a parent may have returned either but not both of these). Both sets of data contain the same item variables for the CAST measure. Create a single parent data flag lpdata, showing the presence of data from either source.
- Merge the remaining files of family-based data together using FamilyID as the key variable.
- Double enter twin-specific items in the family-based data as follows:
- Compute twin identifier atempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
- Re-compute atempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
- Combine the elder and younger twin parts together by adding cases. Sort in ascending order of atempid2 and save as the double entered family data file.
- There are 18 files of twin-based raw
data: the admin data file containing twin IDs and birth orders, the file of twin questionnaire data, the file of teacher questionnaire
data, plus the 15 web activity files (Maths, PIAT, GOAL, Yes/No, Ravens Matrices,
Picture Completion, General Knowledge, Vocabulary, Inferences, Figurative Language,
TOAL, Eyes, Jigsaws, Hidden Shapes, Maths and Reading Questionnaire including the
Author Recognition task). These
raw data files start in csv format. For each of these files in turn, carry
out the following actions:
- Import into SPSS
- Sort in ascending order of twin identifier TwinID
- In questionnaire data files, recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
- Add reversed-coded versions of item variables where needed.
- For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
- Where response variables in raw web data files are encoded as strings, wherever possible convert these into numeric response variables (this applies in Maths, TOAL and Inferences).
- Save as an SPSS data file.
- Using TwinID as the key variable, merge together all 18 twin data files.
- Derive a flag variable lcwdata1 to show which twins have some web data, from at least one web test.
- Double enter the main twin and teacher data flags, as follows:
- Compute the alternative twin identifier atempid2 as the FamilyID followed by the twin order (1 or 2).
- Sort in ascending order of atempid2 and save this file as the twin 1 part.
- Change the flag variable names by changing the ending from 1 to 2. Change the values of atempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of atempid2 and save with just the renamed variables as the twin 2 part.
- Merge the twin 1 and twin 2 parts using atempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data.
- Merge this twin data file with the double entered parent data file, using atempid2 as the key variable. This dataset now contains all the raw data.
- Use the parent data flag, and the double entered twin and teacher data flags, to filter the dataset and delete any cases without any 12 Year data. Add the overall 12 Year data flag variable ltwelvyr.
- Recode all data flag variables from missing to 0.
- Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
- Sort in ascending order of scrambled twin ID id_twin.
- Save and drop the raw ID variables.
- Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, autism, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
- Use variable ltwelvyr to filter the dataset and delete cases added from the reference dataset that do not have 12 Year data.
- Save a working SPSS data file ready for the next script (filename l1merge in the \working files\ subdirectory).
Script 2: Recoding web data
One purpose of this script (filename L2_recode.sps) is to make the web item data easier to use, particularly by recoding missing item responses and item scores. Another purpose is to identify probable random responders within each web test, and to exclude the test data accordingly. The recoding procedures are designed to be consistent with those used in the 16 year web study. The tasks carried out in this script are necessary for some of the derivations of new variables in the next script.
- Open the data file saved at the end of the previous script.
- Derive variables to measure the variability in a twin's responses, in each web test that has a uniform response format across all items. Although these are temporary variables, designed for use in the next script, it is necessary to compute them here before the item response re-coding that follows.
- For each of the web tests, where appropriate, carry out the following
recoding stages:
- In tests involving branching, determine the branching route taken by each twin, and hence identify any items skipped due to upward branching. For these items, recode item responses from missing to -3. Item scores are nearly always automatically credited on the web server; in isolated cases where this has not happened, recode the item score to 1 (or 2 for Vocabulary).
- In tests involving discontinue rules, determine the point at which each twin discontinued. Hence identify items that were skipped due to the discontinue rule. For these items, recode item responses from missing to -2, and recode item scores from missing to 0.
- In tests involving item timeout rules, identify the items that were timed out. For these items, recode item responses to -1, and recode item answer times to missing. (Item scores have already been set to 0 on the web server, hence they do not require recoding for this purpose.)
- In all tests, assume that any remaining items with missing item responses and/or scores are crashed/malfunctioned items. Identify such items, recode item responses from missing to -4, and recode item scores from missing to 0.
- Recode the few remaining missing item scores to 0.
- For each web test, where appropriate, carry out the following additional
steps to identify and deal with compromised tests:
- In tests involving branching, identify cases where the branching rules have not been followed correctly, resulting in inappropriate upward branching and crediting of skipped items.
- Identify any cases that have been affected by a high proportion of item crashes and/or item timeouts (hence a relatively low proportion of items with meaningful answers).
- The TOAL web test has unusual and very complex rules for branching,
discontinuing, crediting and un-crediting items. In some cases, the rules
seem to have failed to work correctly, because of missing item data caused
by item crashes. Where this has happened, inappropriate crediting and
un-crediting of items has led to apparently anomalous raw test scores.
Hence, carry out the following steps:
- In cases where more than one item has crashed, assume the test is compromised and recode the status variable from 2 to 3.
- Use the item data to derive the positions of the basal and ceiling points for each twin, according to the test rules.
- Ensure that items below the basal point are credited by recoding where necessary (item score 1, item response -3).
- Ensure that items above the ceiling point are un-credited by recoding where necessary (item score 0, item response -2).
- Re-compute the test total score by summing the recoded item scores.
- In the Author Recognition test data, recode missing item responses and scores to 0. (All items were presented on one web page; items not selected have missing values in the raw data.)
- Compute a 'corrected' score for the Author Recognition test, by subtracting incorrect responses from correct responses.
- In all web activities where item times were recorded, recode outlying (very high) times to missing, so as not to distort mean times. This includes item response times, item download times, and (for PIAT only) item reading times. Anomalous outliers are thought to have occurred particularly when an item was started just before midnight and completed just after midnight.
- Derive the mean item answer time for each web test.
- Derive a categorical variable to estimate the internet connection type (broadband or dialup) used by each twin for the web tests. This estimate is based on observed mean download times for selected tests.
- Derive temporary variables, for each web test, to measure the proportion of attempted items that were answered with very fast response times (typically less than one second, but varying in some tests). These variables are used in the following step.
- In each web test, identify twins who appear to have answered randomly or without effort. Characteristics vary from test to test, but generally they involve a combination of low test scores, very short item response times, and low variability in item responses. This identification process is described in more detail in the web data cleaning page.
- For any twin identified as a random responder in a given test, exclude the test data by recoding the status variable from 2 to 4.
- In each test, where the status flag has been recoded to 3 (compromised test) or 4 (random responder), exclude the test data by recoding the test data flag from 1 to 0 and all test items and scores to missing.
- Re-compute the twin web data flag lcwdata1 to take account of these new exclusions.
- Save a working SPSS data file ready for the next script (filename l2recode in the \working files\ subdirectory), dropping any temporary variables used in the computations above.
Script 3: Create derived variables
The purpose of this script (filename L3_derive.sps) is to compute derived variables, including scales and composites and twin ages. See derived 12 Year variables for details. The script carries out these tasks in order:
- Open the data file saved at the end of the previous script.
- Derive variables for individual twin ages when the web tests were started, when the various booklets/questionnaires were returned, and when the TOWRE test results were reported.
- Derive sub-test scores for the three sub-tests of the Maths web test.
- For tests involving discontinue rules, and involving items in which there was a significant chance of guessing the correct answer, derive "adjusted" test scores. For any item skipped due to a discontinue rule, the default score of 0 is replaced by the "chance" score that would be obtained, on average, by selecting an answer at random. The adjusted total score is then the sum of the adjusted item scores.
- Derive standardized cognitive, TOWRE and academic achievement composites as
follows:
- Apply a filter (exclude1=0) to remove exclusions
- Standarise the necessary component items and scores
- Compute the mean of the appropriate standardised items/scores
- Standardise the mean, to make the final version of each composite
- Remove the filter
- For twins who finished all of the web tests, in battery A or in battery B or in the entire combined battery, derive the total time taken to complete each battery.
- Derive behaviour/environment scales (parent, teacher and child versions where appropriate) for the following measures: SDQ, MFQ, APSD, CAST, Conners, Victimisation, Puberty, Motivational (good at, liking), Chaos, Parental Feelings, Discipline, Maths and Literacy Environment. In each case, the scale is derived using a mean of the relevant items, requiring at least half the items to be non-missing.
- Derive twin BMI from heights and weights reported in the twin booklet.
- Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
- Save a working SPSS data file ready for the next script (filename l3derive).
Scripts 4 and 5: Labelling the variables
The purpose of these two scripts (filenames L4_label_part1.sps, L5_label_part2.sps) is to label all the variables in the dataset, and to add value labels for categorical variables. Because the number of variables in the dataset is very large, the labelling of the variables requires a very long script; hence this task has been split into two scripts for convenience. Each script carries out these tasks in order:
- Open the data file saved at the end of the previous script.
- Label variables.
- Add value labels to all integer-valued categorical variables having 3 or more categories.
- Save a working SPSS data file ready for the next script (filenames l4label, l5label).
Script 6: Double entering the data
The purpose of this script (filename L6_double.sps) is to double-enter all the twin and teacher data in the dataset. Note that twin-specific item variables from the parent booklet and admin data are already correctly double-entered at this stage (this was achieved in script 1). The variables to be double entered in the current script are all items from the twin booklet, teacher questionnaire and twin web tests. The script carries out these tasks in order:
- Open the data file saved at the end of the previous script.
- Create the twin 2 part (for the co-twin) as follows:
- Rename the appropriate item and derived variables (from the twin web tests and booklets and teacher questionnaires) by changing the suffix from 1 to 2.
- Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
- Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
- Re-open the data file saved at the end of the previous script: this already serves as the twin 1 part of the dataset.
- Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
- Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
- Save an SPSS data file (filename l6double in the \working files\ subdirectory).
- Save another copy as the full 12 Year dataset, with filename Ldb9456.